A message of hope.
In past years, I've featured pictures from great food from my travels. In this very different year, I'm showing some joyful creations from my kitchen.
A message of hope.
In past years, I've featured pictures from great food from my travels. In this very different year, I'm showing some joyful creations from my kitchen.
The impact of Covid-19 on the economy is sharp and sudden, which makes for some dramatic data visualization. I enjoy reading the set of charts showing consumer spending in different categories in the U.S., courtesy of Visual Capitalist.
The designer did a nice job cleaning up the data and building a sequential story line. The spending are grouped by categories such as restaurants and travel, and then sub-categories such as fast food and fine dining.
Spending is presented as year-on-year change, smoothed.
Here is the chart for the General Commerce category:
The visual design is clean and efficient. Even too sparse because one has to keep returning to the top to decipher the key events labelled 1, 2, 3, 4. Also, to find out that the percentages express year-on-year change, the reader must scroll to the bottom, and locate a footnote.
As you move down the page, you will surely make a stop at the Food Delivery category, noting that the routine is broken.
I've featured this device - an element of surprise - before. Remember this Quartz chart that depicts drinking around the world (link).
This chart contains a slight oversight - the red line should be labeled "Takeout" because food delivery is the label for the larger category.
Another surprise is in store for us in the Travel category.
I kept staring at the Cruise line, and how it kept dipping below -100 percent. That seems impossible mathematically - unless these cardholders are receiving more refunds than are making new bookings. Not only must the entire sum of 2019 bookings be wiped out, but the records must also show credits issued to these credit (or debit) cards. It's curious that the same situation did not befall the airlines. I think many readers would have liked to see some text discussing this pattern.
Now, let me put on a data analyst's hat, and describe some thoughts that raced through my head as I read these charts.
Data analysis is hard, especially if you want to convey the meaning of the data.
The charts clearly illustrate the trends but what do the data reveal? The designer adds commentary on each chart. But most of these comments count as "story time." They contain speculation on what might be causing the trend but there isn't additional data or analyses to support the storyline. In the General Commerce category, the 50 to 100 percent jump in all subcategories around late March is attributed to people stockpiling "non-perishable food, hand sanitizer, and toilet paper". That might be true but this interpretation isn't supported by credit or debit card data because those companies do not have details about what consumers purchased, only the total amount charged to the cards. It's a lot more work to solidify these conclusions.
A lot of data do not mean complete or unbiased data.
The data platform provided data on 5 million consumers. We don't know if these 5 million consumers are representative of the 300+ million people in the U.S. Some basic demographic or geographic analysis can help establish the validity. Strictly speaking, I think they have data on 5 million card accounts, not unique individuals. Most Americans use more than one credit or debit cards. It's not likely the data vendor have a full picture of an individual's or a family's spending.
It's also unclear how much of consumer spending is captured in this dataset. Credit and debit cards are only one form of payment.
Data quality tends to get worse.
One thing that drives data analyst nuts. The spending categories are becoming blurrier. In the last decade or so, big business has come to dominate the American economy. Big business, with bipartisan support, has grown by (a) absorbing little guys, and (b) eliminating boundaries between industry sectors. Around me, there is a Walgreens, several Duane Reades, and a RiteAid. They currently have the same owner, and increasingly offer the same selection. In the meantime, Walmart (big box), CVS (pharmacy), Costco (wholesale), etc. all won regulatory relief to carry groceries, fresh foods, toiletries, etc. So, while CVS or Walgreens is classified as a pharmacy, it's not clear that what proportion of the spending there is for medicines. As big business grows, these categories become less and less meaningful.
My friend Ray Vella at The Conference Board has a few charts up on their coronavirus website. TCB is a trusted advisor and consultant to large businesses and thus is a good place to learn how the business community is thinking about this crisis.
I particularly like the following chart:
This puts the turmoil in the stock market in perspective. We are roughly tracking the decline of the Great Recession of the late 2000s. It's interesting that 9/11 caused very mild gyrations in the S&P index compared to any of the other events.
The chart uses an index with value 100 at Day 0. Day 0 is defined by the trigger event for each crisis. About three weeks into the current crisis, the S&P has lost over 30% of its value.
The device of a gray background for the bottom half of the chart is surprisingly effective.
Here is a chart showing the impact of the Covid-19 crisis on different sectors.
So the full-service restaurant industry is a huge employer. Restaurants employ 7-8 times more people than airlines. Airlines employ about the same numbers of people as "beverage bars" (which I suppose is the same as "bars" which apparently is different from "drinking places"). Bars employ 7 times more people than "Cafeterias, etc.".
The chart describes where the jobs are, and which sectors they believe will be most impacted. It's not clear yet how deeply these will be impacted. Being in NYC, the complete shutdown is going to impact 100% of these jobs in certain sectors like bars, restaurants and coffee shops.
The Hustle wrote a strong analysis of the business of buffets. If you've read my analysis of Groupon's business model in Numbersense (link), you'll find some similarities. A key is to not think of every customer as an average customer; there are segments of customers who behave differently, and creating a proper mix of different types of customers is the management's challenge. I will make further comments on the statistics in a future post on the sister blog.
At Junk Charts, we'll focus on visualizing and communciating data. The article in The Hustle comes with the following dataviz:
This dataviz fails my self-sufficiency test. Recall: self-sufficiency is a basic requirement of visualizing data - that the graphical elements should be sufficient to convey the gist of the data. Otherwise, there is no point in augmenting the data with graphical elements.
The self-sufficiency test is to remove the dataset from the dataviz, and ask whether the graphic can stand on its own. So here:
The entire set of ingredient costs appears on the original graphic. When these numbers are removed, the reader gets the wrong message - that the cost is equally split between these five ingredients.
This chart reminds me of the pizza chart that everyone thought was a pie chart except its designer! I wrote about it here. Food coma is a thing.
The original chart may be regarded as an illustration rather than data visualization. If so, it's just a few steps from becoming a dataviz. Like this:
P.S. A preview of what I'll be talking about at the sister blog. The above diagram illustrates the average case - for the average buffet diner. Underneath these costs is an assumption about the relative amounts of each food that is eaten. But eaten by whom?
Also, if you have Numbersense (link), the chapter on measuring the inflation rate is relevant here. Any inflation metric must assume a basket of goods, but then the goods within the basket have to be weighted by the amount of expenditure. It's much harder to get the ratio of expenditures correct compared to getting price data.
A recent article in the Wall Street Journal about a challenger to the dominant weedkiller, Roundup, contains a nice selection of graphics. (Dicamba is the up-and-comer.)
The staircase chart shows weeds have become resistant to Roundup over time. This is considered a weakness in the Roundup business.
In this post, my focus is on the chart at the bottom, which shows complaints about Dicamba by state in 2019. This is a bubble chart, with the bubbles sorted along the horizontal axis by the acreage of farmland by state.
Below left is a more standard version of such a chart, in which the bubbles are allowed to overlap. (I only included the bubbles that were labeled in the original chart).
The WSJ’s twist is to use the vertical spacing to avoid overlapping bubbles. The vertical axis serves a design perogative and does not encode data.
I’m going to stick with the more traditional overlapping bubbles here – I’m getting to a different matter.
The question being addressed by this chart is: which states have the most serious Dicamba problem, as revealed by the frequency of complaints? The designer recognizes that the amount of farmland matters. One should expect the more acres, the more complaints.
Let's consider computing directly the number of complaints per million acres.
The resulting chart (shown below right) – while retaining the design – gives a wholly different feeling. Arkansas now owns the largest bubble even though it has the least acreage among the included states. The huge Illinois bubble is still large but is no longer a loner.
Now return to the original design for a moment (the chart on the left). In theory, this should work in the following manner: if complaints grow purely as a function of acreage, then the bubbles should grow proportionally from left to right. The trouble is that proportional areas are not as easily detected as proportional lengths.
The pair of charts below depict made-up data in which all states have 30 complaints for each million acres of farmland. It’s not intuitive that the bubbles on the left chart are growing proportionally.
Now if you look at the right chart, which shows the relative metric of complaints per million acres, it’s impossible not to notice that all bubbles are the same size.
Happy new year! Good luck and best wishes!
We'll start 2020 with something lighter. On a recent flight, I saw a chart in The Economist that shows the proportion of operating income derived from overseas markets by major grocery chains - the headline said that some of these chains are withdrawing from international markets.
The designer used one color for each grocery chain, and two shades within each color. The legend describes the shades as "total" and "of which: overseas". As with all stacked bar charts, it's a bit confusing where to find the data. The "total" is actually the entire bar, not just the darker shaded part. The darker shaded part is better labeled "home market" as shown below:
The designer's instinct to bring out the importance of international markets to each company's income is well placed. A second small edit helps: plot the international income amounts first, so they line up with the vertical zero axis. Like this:
This is essentially the same chart. The order of international and home market is reversed. I also reversed the shading, so that the international share of income is displayed darker. This shading draws the readers' attention to the key message of the chart.
A stacked bar chart of the absolute dollar amounts is not ideal for showing proportions, because each bar is a different length. Sometimes, plotting relative values summing to 100% for each company may work better.
As it stands, the chart above calls attention to a different message: that Walmart dwarfs the other three global chains. Just the international income of Walmart is larger than the total income of Costco.
Please comment below or write me directly if you have ideas for this blog as we enter a new decade. What do you want to see more of? less of?
When making a scatter plot, the two variables should not be placed arbitrarily. There is a rule governing this: the outcome variable should be shown on the vertical axis (also called y-axis), and the explanatory variable on the horizontal (or x-) axis.
This chart from the archives of the Economist has this reversed:
The title of the accompanying article is "Ice Cream and IQ"...
In a Trifecta Checkup (link), it's a Type DV chart. It's preposterous to claim eating ice cream makes one smarter without more careful studies. The chart also carries the xyopia fallacy: by showing just two variables, readers are unwittingly led to explain differences in "IQ" using differences in per-capita ice-cream consumption when lots of other stronger variables will explain any gaps in IQ.
In this post, I put aside my objections to the analysis, and focus on the issue of assigning variables to axes. Notice that this chart reverses the convention: the outcome variable (IQ) is shown on the horizontal, and the explanatory variable (ice cream) is shown on the vertical.
Here is a reconstruction of the above chart, showing only the dots that were labeled with country names. I fitted a straight regression line instead of a curve. (I don't understand why the red line in the original chart bends upwards when the data for Japan, South Korea, Singapore and Hong Kong should be dragging it down.)
Note that the interpretation of the regression line raises eyebrows because the presumed causality is reversed. For each 50 points increase in PISA score (IQ), this line says to expect ice cream consumption to raise by about 1-2 liters per person per year. So higher IQ makes people eat more ice cream.
If the convention is respected, then the following scatter plot results:
The first thing to note is that the regression analysis is different here from that shown in the previous chart. The blue regression line is not equivalent to the black regression line from the previous chart. You cannot reverse the roles of the x and y variables in a regression analysis, and so neither should you reverse the roles of the x and y variables in a scatter plot.
The blue regression line can be interpreted as having two sections, roughly, for countries consuming more than or less than 6 liters of ice cream per person per year. In the less-ice-cream countries, the correlation between ice cream and IQ is stronger (I don't endorse the causal interpretation of this statement).
When you make a scatter plot, you have two variables for which you want to analyze their correlation. In most cases, you are exploring a cause-effect relationship.
Higher income households cares more on politics.
Less educated citizens are more likely to not register to vote.
Companies with more diverse workforce has better business performance.
Frequently, the reverse correlation does not admit a causal interpretation:
Caring more about politics does not make one richer.
Not registering to vote does not make one less educated.
Making more profits does not lead to more diversity in hiring.
In each of these examples, it's clear that one variable is the outcome, the other variable is the explanatory factor. Always put the outcome in the vertical axis, and the explanation in the horizontal axis.
The justification is scientific. If you are going to add a regression line (what Excel calls a "trendline"), you must follow this convention, otherwise, your regression analysis will yield the wrong result, with an absurd interpretation!
[PS. 11/3/2019: The comments below contain different theories that link the two variables, including theories that treat PISA score ("IQ") as the explanatory variable and ice cream consumption as the outcome. Also, I elaborated that the rule does not dictate which variable is the outcome - the designer effectively signals to the reader which variable is regarded as the outcome by placing it in the vertical axis.]
I have a longer article on the sister blog about the research design of a study claiming 420 "cannabis" Day caused more road accident fatalities (link). The blog also has a discussion of the graphics used to present the analysis, which I'm excerpting here for dataviz fans.
The original chart looks like this:
The question being asked is whether April 20 is a special day when viewed against the backdrop of every day of the year. The answer is pretty clear. From this chart, the reader can see:
It doesn't even matter what the vertical axis is measuring. The visual elements did their job.
If you look closely, you can even assess the "magnitude" of the evidence, not just the "direction." While April 20 isn't special, it nonetheless is somewhat noteworthy. The vertical line associated with April 20 sits on the positive side of the range of possibilities, and appears to sit above most other days.
The chart form shown above is better at conveying the direction of the evidence than its strength. If the strength of the evidence is required, we use a different chart form.
I produced the following histogram, using the same data:
The histogram is produced by first locating the midpoints# of the vertical lines into buckets, and then counting the number of days that fall into each bucket. (# Strictly speaking, I use the point estimates.)
The midpoints# are estimates of the fatal crash ratio, which is defined as the excess crash fatalities reported on the "analysis day" relative to the "reference days," which are situated one week before and one week after the analysis day. So April 20 is compared to April 13 and 27. Therefore, a ratio of 1 indicates no excess fatalities on the analysis day. And the further the ratio is above 1, the more special is the analysis day.
If we were to pick a random day from the histogram above, we will likely land somewhere in the middle, which is to say, a day of the year in which no excess car crashes fatalities could be confirmed in the data.
As shown above, the ratio for April 20 (about 1.12) is located on the right tail, and at roughly the 94th percentile, meaning that there were 6 percent of analysis days in which the ratios would have been more extreme.
This is in line with our reading above, that April 20 is noteworthy but not extraordinary.
P.S. [4/27/2019] Replaced the first chart with a newer version from Harper's site. The newer version contains the point estimates inside the vertical lines, which are used to generate the histogram.
Saw this great little sign at Ippudo, the ramen shop, the other day:
It's a great example of highly effective data visualization. The names on the board are sake brands.
The menu (a version of a data table) is the conventional way of displaying this information.
Customers are selecting a sake. They don't have a favorite, or don't recognize many of these brands. They know a bit about their preferences: I like full-bodied, or I want the dry one.
On a menu, the key data are missing. So the first order of business is to find data on full- and light-bodied, and dry and sweet. The pricing data are omitted, possibly because it clutters up the design, or because the shop doesn't want customers to focus on price - or both.
The design uses a scatter plot. The customer finds the right quartet, thus narrowing the choices to three or four brands. Then, the positions on the two axes allow the customer to drill down further.
This user experience is leaps and bounds above scanning a list of names, and asking someone who may or may not be an expert.
Back to the Data
The success of the design depends crucially on selecting the right data. Baked into the scatter plot is the assumption that the designer knows the two factors most influential to the customer's decision. Technically, this is a "variable selection" problem: of all factors determining the brand choice, which two are the most important?
Think about the downside of selecting the wrong factors. Then, the scatter plot makes it harder to choose the sake compared to the menu.
At first glance, this graphic's message seems clear: what proportion of Americans are exceeding or lagging guidelines for consumption of different food groups. Blue for exceeding; orange for lagging. The stacked bars are lined up at the central divider - the point of meeting recommended volumes - to make it easy to compare relative proportions.
The original chart is here, on the Health.gov website.
The little icons illustrating the food groups are cute and unintrusive.
It's when you read further that things start to get complicated. The last three rows display a flipping of the color scheme, with orange on the right, blue on the left. Up to this point, you may understand blue to mean over the recommended value, and orange is under. Suddenly, the orange is shown on the right side.
The designer was wrestling with a structural issue in the data. The last three food groups - sugars, fats and sodium - are things to eat less. So, having long bars on the right side is not good. The orange/blue colors should be interpreted as bad/good and not as under/over.
The problem with this design is that it draws attention to this color flip - that is to say, it draws attention to which food groups are favored and which ones are to be avoided. This insight is actually in the metadata, not what this dataset is about.
In the following chart, I enforce the bad/good color scheme while ignoring the direction of good. The text is adjusted to use words that do not suggest direction.
Dieticians are probably distressed by this chart, given that most Americans are lagging on almost all of the recommendations.
In a final edit, I re-ordered the categories.