Consumption patterns during the pandemic

The impact of Covid-19 on the economy is sharp and sudden, which makes for some dramatic data visualization. I enjoy reading the set of charts showing consumer spending in different categories in the U.S., courtesy of Visual Capitalist.

The designer did a nice job cleaning up the data and building a sequential story line. The spending are grouped by categories such as restaurants and travel, and then sub-categories such as fast food and fine dining.

Spending is presented as year-on-year change, smoothed.

Here is the chart for the General Commerce category:


The visual design is clean and efficient. Even too sparse because one has to keep returning to the top to decipher the key events labelled 1, 2, 3, 4. Also, to find out that the percentages express year-on-year change, the reader must scroll to the bottom, and locate a footnote.

As you move down the page, you will surely make a stop at the Food Delivery category, noting that the routine is broken.


I've featured this device - an element of surprise - before. Remember this Quartz chart that depicts drinking around the world (link).

The rule for small multiples is to keep the visual design identical but vary the data from chart to chart. Here, the exceptional data force the vertical axis to extend tremendously.

This chart contains a slight oversight - the red line should be labeled "Takeout" because food delivery is the label for the larger category.

Another surprise is in store for us in the Travel category.


I kept staring at the Cruise line, and how it kept dipping below -100 percent. That seems impossible mathematically - unless these cardholders are receiving more refunds than are making new bookings. Not only must the entire sum of 2019 bookings be wiped out, but the records must also show credits issued to these credit (or debit) cards. It's curious that the same situation did not befall the airlines. I think many readers would have liked to see some text discussing this pattern.


Now, let me put on a data analyst's hat, and describe some thoughts that raced through my head as I read these charts.

Data analysis is hard, especially if you want to convey the meaning of the data.

The charts clearly illustrate the trends but what do the data reveal? The designer adds commentary on each chart. But most of these comments count as "story time." They contain speculation on what might be causing the trend but there isn't additional data or analyses to support the storyline. In the General Commerce category, the 50 to 100 percent jump in all subcategories around late March is attributed to people stockpiling "non-perishable food, hand sanitizer, and toilet paper". That might be true but this interpretation isn't supported by credit or debit card data because those companies do not have details about what consumers purchased, only the total amount charged to the cards. It's a lot more work to solidify these conclusions.

A lot of data do not mean complete or unbiased data.

The data platform provided data on 5 million consumers. We don't know if these 5 million consumers are representative of the 300+ million people in the U.S. Some basic demographic or geographic analysis can help establish the validity. Strictly speaking, I think they have data on 5 million card accounts, not unique individuals. Most Americans use more than one credit or debit cards. It's not likely the data vendor have a full picture of an individual's or a family's spending.

It's also unclear how much of consumer spending is captured in this dataset. Credit and debit cards are only one form of payment.

Data quality tends to get worse.

One thing that drives data analyst nuts. The spending categories are becoming blurrier. In the last decade or so, big business has come to dominate the American economy. Big business, with bipartisan support, has grown by (a) absorbing little guys, and (b) eliminating boundaries between industry sectors. Around me, there is a Walgreens, several Duane Reades, and a RiteAid. They currently have the same owner, and increasingly offer the same selection. In the meantime, Walmart (big box), CVS (pharmacy), Costco (wholesale), etc. all won regulatory relief to carry groceries, fresh foods, toiletries, etc. So, while CVS or Walgreens is classified as a pharmacy, it's not clear that what proportion of the spending there is for medicines. As big business grows, these categories become less and less meaningful.

The elusive meaning of black paintings and red blocks

Joe N, a longtime reader, tweeted about the following chart, by the People's Policy Project:


This is a simple column chart containing only two numbers, far exceeded by the count of labels and gridlines.

I look at charts like the lady staring at these Ad Reinhardts:



My artist friends say the black squares are not the same, if you look hard enough.

Here is what I learned after one such seating:

The tiny data labels sitting on the inside top edges of the columns hint that the right block is slightly larger than the left block.

The five labels of the vertical axis serve no purpose, nor the gridlines.

The horizontal axis for time is reversed, with 2019 appearing after 2020 (when read left to right).

The left block has one month while the right block has 12 months. This is further confused by the word "All" which shares the same starting and ending letters as "April".

As far as I can tell, the key message of this chart is that the month of April has the impact of a full year. It's like 12 months of outflows from employment hitting the economy in one month.


My first response is this chart:


Breaking the left block into 12 pieces, and color-coding the April piece brings out the comparison. You can also see that in 2019, the outflows from employment to unemployment were steady month to month.

Next, I want to see what happens if I restored the omitted months of Jan to March, 2020.


The story changes slightly. Now, the chart says that the first four months have already exceeded the full year of 2019.

Since the values hold steady month to month, with the exception of April 2020, I make a monthly view:


You can see the slight nudge-up in March 2020 as well. This draws more attention to the break in pattern.

For time-series data, I prefer to look at line charts:


As I explained in this post about employment statistics (or Chapter 6 of Numbersense (link)), the Bureau of Labor Statistics classifies people into three categories: Employed, Unemployed and Not in Labor Force. Exits from Employed to Unemployed status contribute to unemployment in the U.S. To depict a negative trend, it's often natural to use negative numbers:


You may realize that this data series paints only a partial picture of the health of the labor market. While some people exit the Employed status each month, there are others who re-enter or enter the Employed status. We should really care about net flows.


In all of 2019, there were more entrants than exits, leading to a slightly positive net inflow to the Employed status from Unemployed (blue line). In April 2020, the red line (exits) drags the blue line dramatically.

Of course, even this chart is omitting important information. There are also flows from Employed to and from Not in Labor Force.






Hope and reality in one Georgia chart

Over the weekend, Georgia's State Health Department agitated a lot of people when it published the following chart:


(This might have appeared a week ago as the last date on the chart is May 9 and the title refers to "past 15 days".)

They could have avoided the embarrassment if they had read my article at (link). In that article, I lay out a set of the "unspoken conventions," things that visual designers are, or should be, doing more or less in their sleep. Under the section titled "Order", I explain the following two "rules":

  • Place values in the natural order when it is available
  • Retain the same order across all plots in a panel of charts

In the chart above, the natural order for the horizontal (time) axis is time running left to right. The order chosen by the designer  is roughly but not precisely decreasing height of the tallest column in each daily group. Many observers suggested that the columns were arranged to give the appearance of cases dropping over time.

Within each day, the counties are ordered in decreasing number of new cases. The title of the chart reads "number of cases over time" which sounds like cumulative cases but it's not. The "lead" changed hands so many times over the 15 days, meaning the data sequence was extremely noisy, which would be unlikely for cumulative cases. There are thousands of cases in each of these counties by May. Switching the order of the columns within each daily group defeats the purpose of placing these groups side-by-side.

Responding to the bad press, the department changed the chart design for this week's version:


This chart now conforms to the two spoken rules described above. The time axis runs left to right, and within each group of columns, the order of the counties is maintained.

The chart is still very noisy, with no apparent message.


Next, I'd like to draw your attention to a Data issue. Notice that the 15-day window has shifted. This revised chart runs from May 2 to May 16, which is this past Saturday. The previous chart ran from Apr 26 to May 9. 

Here's the data for May 8 and 9 placed side by side.


There is a clear time lag of reporting cases in the State of Georgia. This chart should always exclude the last few days. The case counts keep going up until it stabilizes. The same mistake occurs in the revised chart - the last two days appear as if new cases have dwindled toward zero when in fact, it reflects a lag in reporting.

The disconnect between the Question being posed and the quality of the Data available dooms this visualization. It is not possible to provide a reliable assessment of the "past 15 days" when during perhaps half of that period, the cases are under-counted.


Nyt_tryingtobefashionableThis graphical distortion due to "immature" data has become very commonplace in Covid-19 graphics. It's similar to placing partial-year data next to full-year results, without calling out the partial data.

The following post from the ancient past (2005!) about a New York Times graphic shows that calling out this data problem does not actually solve it. It's a less-bad kind of thing.

The coronavirus data present more headaches for graphic designers than the financial statistics. Because of accounting regulations, we know that only the current quarter's data are immature. For Covid-19 reporting, the numbers are being adjusted for days and weeks.

Practically all immature counts are under-estimates. Over time, more cases are reported. Thus, any plots over time - if unadjusted - paint a misleading picture of declining counts. The effect of the reporting lag is predictable, having a larger impact as we run from left to right in time. Thus, even if the most recent data show a downward trend, it can eventually mean anything: down, flat or up. This is not random noise though - we know for certain of the downward bias; we just don't know the magnitude of the distortion for a while.

Another issue that concerns coronavirus reporting but not financial reporting is inconsistent standards across counties. Within a business, if one were to break out statistics by county, the analysts would naturally apply the same counting rules. For Covid-19 data, each county follows its own set of rules, not just  how to count things but also how to conduct testing, and so on.

Finally, with the politics of re-opening, I find it hard to trust the data. Reported cases are human-driven data - by changing the number of tests, by testing different mixes of people, by delaying reporting, by timing the revision of older data, by explicit manipulation, ...., the numbers can be tortured into any shape. That's why it is extremely important that the bean-counters are civil servants, and that politicians are kept away. In the current political environment, that separation between politics and statistics has been breached.


Why do we have low-quality data? Human decisions, frequently political decisions, adulterate the data. Epidemiologists are then forced to use the bad data, because that's what they have. Bad data lead to bad predictions and bad decisions, or if the scientists account for the low quality, predictions with high levels of uncertainty. Then, the politicians complain that predictions are wrong, or too wide-ranging to be useful. If they really cared about those predictions, they could start by being more transparent about reporting and more proactive at discovering and removing bad accounting practices. The fact that they aren't focused on improving the data gives the game away. Here's a recent post on the politics of data.


How Covid-19 deaths sneaked into Florida's statistics

Like many others, some Floridians are questioning their state's Covid statistics. It's clear there are numerous "degrees of freedom" for politicians to manipulate the numbers. What's not clear is who's influencing these decisions. Are they public-health experts, donors, voters, or whom?

A Twitter follower sent in the following chart, embedded in an informative article in Sun-Sentinel:


I like the visual design. It's clean, and conveys a moderately complex concept effectively. The reader may not immediately get what metrics are being plotted but the idea that the blue line should operate within the gray area.. until it doesn't is easily grasped. The range is technically an uncertainty band.

The metric is the proportion of total deaths (all causes) that are attributed to pneumonia and flu. Typical influenza deaths are found in that category. This chart investigates whether there were excess (unexplained) P&F deaths. The gray band measures the variability in the proportions of past years. When the blue line operates inside the band, the metric is normal. When it pierces the upper band, which happened here around week 25, a rare event has occurred.

The concern on Twitter was about the horizontal axis. Those integer labels can be confusing. The designer places a "how to read this" message in a footnote, explaining that week 1 is the first week of a typical flu season (which corresponds to late September 2019). This nugget of information helps a lot. We can see that the flu season peaks around week 20, and by the spring, it should be waning. Not so in 2020.

It's hard to escape the conclusion that deaths from Covid-19 are hiding inside the statistics of Pneumonia & Flu. As a statistician, I want to tell you Statistics Don't Lie! You can hide the data along one dimension, but they show up elsewhere. Misclassifying the deaths does buy someone some time. It takes a few weeks to compile all-cause mortality data (gasp, the CDC said mortality records are only 75 percent accurate after 8 weeks!)

The other small problem with the chart is the labeling. Neither axis has labels. The data label that shows up when you click on the line might be a default from the software that can't be turned off. It shows the two numbers being plotted without labels.


Here is a re-working of the chart that tells the story:


The proportion of deaths attributed to P&F and Covid together is roughly double the upper end of what Florida should be seeing this time of the year (without Covid). Covid-19 accounts for half the gap. The other half are still being classified as P&F. However, I suspect CDC will adjust these numbers later to reflect the reality. (In making this chart, I also learned that Florida stopped including seasonal visitors in the death counts. This is egregious manipulation. If someone died while in Florida, they should be counted. I didn't investigate whether this counting rule applies only to Covid-19 deaths, or to deaths from all causes. If they had always done that, then I might give them a pass.)

On second thought, maybe not. The other egregious thing that appeared to have happened is that the Florida state health department unplugged their prior website ( so no one can cross-reference any prior documents. The only website I can access now for Florida state health is a Covid-specific site (


There must be something juicy on the previous influenza page, no?


Lastly, when you look at my chart, please pretend that the last week is not on there. In all likelihood, the "drop" is fake because the mortality data have not been fully updated. My chart contains one more week than the Sun Sentinel chart. So you can see that the drastic decline shown on their chart turned up a big uptick on mine (next to last week).

This is a common mistake on many charts I see these days. Half-baked numbers are shown next to fully-baked ones.

Reviewing the charts in the Oxford Covid-19 study

On my sister (book) blog, I published a mega-post that examines the Oxford study that was cited two weeks ago as a counterpoint to the "doomsday" Imperial College model. These studies bring attention to the art of statistical modeling, and those six posts together are designed to give you a primer, and you don't need math to get a feel.

One aspect that didn't make it to the mega-post is the data visualization. Sad to say, the charts in the Oxford study (link) are uniformly terrible. Figure 3 is typical:


There are numerous design decisions that frustrate readers.

a) The graphic contains two charts, one on top of the other. The left axis extends floor-to-ceiling, giving the false impression that it is relevant to both charts. In fact, the graphic uses dual axes. The bottom chart references the axis shown in the bottom right corner; the left axis is meaningless. The two charts should be drawn separately.

For those who have not read the mega-post about the Oxford models, let me give a brief description of what these charts are saying. The four colors refer to four different models - these models have the same structure but different settings. The top chart shows the proportion of the population that is still susceptible to infection by a certain date. In these models, no one can get re-infected, and so you see downward curves. The bottom chart displays the growth in deaths due to Covid-19. The first death in the UK was reported on March 5.  The black dots are the official fatalities.

b) The designer allocates two-thirds of the space to the top chart, which has a much simpler message. This causes the bottom chart to be compressed beyond cognition.

c) The top chart contains just five lines, smooth curves of the same shape but different slopes. The designer chose to use thick colored lines with black outlines. As a result, nothing precise can be read from the chart. When does the yellow line start dipping? When do the two orange lines start to separate?

d) The top chart should have included margins of error. These models are very imprecise due to the sparsity of data.

e) The bottom chart should be rejected by peer reviewers. We are supposed to judge how well each of the five models fits the cumulative death counts. But three design decisions conspire to prevent us from getting the answer: (i) the vertical axis is severely compressed by tucking this chart underneath the top chart (ii) the vertical axis uses a log scale which compresses large values and (iii) the larger-than-life dots.

As I demonstrated in this post also from the sister blog, many models especially those assuming an exponential growth rate has poor fits after the first few days. Charting in log scale hides the degree of error.

f) There is a third chart squeezed into the same canvass. Notice the four little overlapping hills located around Feb 1. These hills are probability distributions, which are presented without an appropriate vertical axis. Each hill represents a particular model's estimate of the date on which the novel coronavirus entered the UK. But that date is unknowable. So the model expresses this uncertainty using a probability distribution. The "peak" of the distribution is the most likely date. The spread of the hill gives the range of plausible dates, and the height at a given date indicates the chance that that is the date of introduction. The missing axis is a probability scale, which is neither the left nor the right axis.


The bottom chart shows up in a slightly different form as Figure 1(A).


Here, the green, gray (blocked) and red thick lines correspond to the yellow/orange/red diamonds in Figure 3. The thin green and red lines show the margins of error I referred to above (these lines are not explicitly explained in the chart annotation.) The actual counts are shown as white rather than black diamonds.

Again, the thick lines and big diamonds conspire to swamp the gaps between model fit and actual data. Again, notice the use of a log scale. This means that the same amount of gap signifies much bigger errors as time moves to the right.

When using the log scale, we should label it using the original units. With a base 10 logarithm, the axis should have labels 1, 10, 100, 1000 instead of 0, 1, 2, 3. (This explains my previous point - why small gaps between a model line and a diamond can mean a big error as the counts go up.)

Also notice how the line of white diamonds makes it impossible to see what the models are doing prior to March 5, the date of the first reported death. The models apparently start showing fatalities prior to March 5. This is a key part of their conclusion - the Oxford team concluded that the coronavirus has been circulating in the U.K. even before the first infection was reported. The data visualization should therefore bring out the difference in timing.

I hope by the time the preprint is revised, the authors will have improved the data visualization.




The hidden bad assumption behind most dual-axis time-series charts

[Note: As of Monday afternoon, Typepad is having problems rendering images. Please try again later if the charts are not loading properly.]

DC sent me the following chart over Twitter. It supposedly showcases one sector that has bucked the economic collapse, and has conversely been boosted by the stay-at-home orders around the world.


At first glance, I was drawn to the yellow line and the axis title on the right side. I understood the line to depict the growth rate in traffic "vs a normal day". The trend is clear as day. Since March 10 or so, the website has become more popular by the week.

For a moment, I thought the thin black line was a trendline that fits the rather ragged traffic growth data. But looking at the last few data points, I was afraid it was a glove that didn't fit. That's when I realized this is a dual-axis chart. The black line shows the worldwide total Covid-19 cases, with the axis shown on the left side.

As with any dual-axis charts, you can modify the relationship between the two scales to paint a different picture.

This next chart says that the site traffic growth lagged Covid-19 growth until around March 14.


This one gives an ambiguous picture. One can't really say there is a strong correlation between the two time series.



Now, let's look at the chart from the DATA corner of the Trifecta Checkup (link). The analyst selected definitions that are as far apart as possible. So this chart gives a good case study of the intricacy of data definitions.

First, notice the smoothness of the line of Covid-19 cases. This data series is naturally "smoothed" because it is an aggregate of country-level counts, which themselves are aggregates of regional counts.

By contrast, the line of traffic growth rates has not been smoothed. That's why we see sharp ups and downs. This series should be smoothed as well.


The seven-day moving average line indicates a steady growth in traffic. The day-to-day fluctuations represent noise that distracts us from seeing the trendline.

Second, the Covid-19 series is a cumulative count, which means it's constantly heading upward over time (on rare days, it may go flat but never decrease). The traffic series represents change, is not cumulative, and so it can go up or down over time. To bring the data closer together, the Covid-19 series can be converted into new cases so they are change values.


Third, the traffic series are growth rates as percentages while the Covid-19 series are counts. It is possible to turn Covid-19 counts into growth rates as well. Like this:


By standardizing the units of measurement, both time series can be plotted on the same axis. Here is the new plot:


Third, the two growth rates have different reference levels. The Covid-19 growth rate I computed is day-on-day growth. This is appropriate since we don't presume there is a seasonal effect - something like new cases on Mondays are typically larger than new cases on Tuesday doesn't seem plausible.

Thanks to this helpful explainer (link), I learned what the data analyst meant by a "normal day". The growth rate of traffic is not day-on-day change. It is the change in traffic relative to the average traffic in the last four weeks on the same day of week. If it's a Monday, the change in traffic is relative to the average traffic of the last four Mondays.

This type of seasonal adjustment is used if there is a strong day-of-week effect. For example, if the website reliably gets higher traffic during weekends than weekdays, then the Saturday traffic may always exceed the Friday traffic; instead of comparing Saturday to the day before, we index Saturday to the previous Saturday, Friday to the previous Friday, and then compare those two values.


Let's consider the last chart above, the one where I got rid of the dual axes.

A major problem with trying to establish correlation of two time series is time lag. Most charts like this makes a critical and unspoken assumption - that the effect of X on Y is immediate. This chart assumes that the higher the number Covid-19 cases, the more people stays home that day, the more people swarms the site that day. Said that way, you might see it's ridiculous.

What is true of any correlations in the wild - there is always some amount of time lag. It usually is hard to know how much lag.


Finally, the chart omitted a huge factor driving the growth in traffic. At various times dependent on the country, the website rolled out a free premium service offer. This is the primary reason for the spike around mid March. How much of the traffic growth is due to the popular marketing campaign, and how much is due to stay-at-home orders - that's the real question.

An exposed seam in the crystal ball of coronavirus recovery

One of the questions being asked by the business community is when the economy will recover and how. The Conference Board has offered their outlook in this new article. (This link takes you to the collection of Covid-19 related graphics. You have to find the right one from the carousel. I can't seem to find the direct link to that page.)

This chart summarizes their viewpoint:


They considered three scenarios, starting the recovery in May, over the summer, and in the Fall. In all scenarios, the GDP of the U.S. will contract in 2020 relative to 2019. The faster the start of the recovery, the lower the decline.

My reaction to the map icon is different from the oil-drop icon in the previously-discussed chart (link). I think here, the icon steals too much attention. The way lines were placed on the map initially made me think the chart is about cross-country travel.

On the other hand, I love the way he did the horizontal axis / time-line. It elegantly tells us which numbers are actual and which numbers are projected, without explicitly saying so.


Also notice through the use of color, font size and bolding, he organizes the layers of detail, and conveys which items are more important to read first.


Trifectacheckup_imageAs I round out the Trifecta Checkup, I found a seam in the Data.

On the right edge, the number for December 2020 is 100.6 which is 0.6 above the reference level. But this number corresponds to a 1.6% reduction. How so?

This seam exposes a gap between how modelers and decision-makers see the world. Evidently, the projections by the analyst are generated using Q3 2019's GDP as baseline (index=100). I'm guessing the analyst chose that quarter because at the time of analysis, the Q4 data have not reached the final round of revision (which came out at the end of March).

A straight-off-the-report conclusion of the analysis is that the GDP would be just back to Q3 2019 level by December 2020 in the most optimistic scenario. (It's clear to me that the data series has been seasonally adjusted as well so that we can compare any month to any month. Years ago, I wrote this primer to understand seasonal adjustments.)

Decision-makers might push back on that conclusion because the reference level of Q3 2019 seems arbitrary. Instead, what they like to know is the year-on-year change to GDP. A small calculation is completed to bridge between the two numbers.

The decision-makers are satisfied after finding the numbers they care about. They are not curious about how the sausage is made, i.e., how the monthly numbers result in the year-on-year change. So the seam is left on the chart.


The why axis

A few weeks ago, I replied to a tweet by someone who was angered by the amount of bad graphics about coronavirus. I take a glass-half-full viewpoint: it's actually heart-warming for  dataviz designers to realize that their graphics are being read! When someone critiques your work, it is proof that they cared enough to look at it. Worse is when you publish something, and no one reacts to it.

That said, I just wasted half an hour trying to get into the head of the person who made the following:

Fox31_co_newcases edited

Longtime reader Chris P. forwarded this tweet to me, and I saw that Andrew Gelman got sent this one, too.

The chart looked harmless until you check out the vertical axis labels. It's... um... the most unusual. The best way to interpret what the designer did is to break up the chart into three components. Like this:


The big mystery is why the designer spent the time and energy to make this mischief.

The usual suspect is fake news. The clearest sign of malintent is the huge size of the dots. Each dot spans almost the entirety of the space between gridlines.

But there is almost no fake news here. The overall trend line is intact despite the attempted distortion. The following is a superposition of an unmanipulated line (yellow) on top of the manipulated:



The next guess is incompetence. The evidence against this view is the amount of energy required to execute these changes. In Excel, it takes a lot of work. It's easier to do this in R or any programming languages with which you can design your own axis.

Even for the R coders, the easy part is to replicate the design, but the hard part is to come up with the concept in the first place!

You can't just stumble onto a design like this. So I am not convinced the designer is an idiot.


How much work? You have to create three separate charts, with three carefully chosen vertical scales, and then clip, merge, and sew the seam. The weirdest bit is throwing away three of the twelve axis labels and writing in three fake numbers.

Here's the recipe: (if the gif doesn't load automatically, click on it)


Help me readers! I'm stumped. Why oh why did someone make this? What is the point?


P.S. [4/9/2020] A conversation with Carlos on Andrew's blog reveals another issue. I pointed out that the "Total cases" printed up top was not the sum of the 15 numbers on the chart. There was a gap of 184 cases. Carlos sent me a link showing a day on which the total cases in Colorado was 183 cases. I didn't quite get the point initially. He explained that it's 183 existing cases prior to the start of the period of this chart, plus the new cases during this period, leading to the "Total cases" as of the end of the period of this chart.

So, another mystery solved. This brings up an important point about making effective charts: one way confusion arises is if there are two things from the visual that seem to contradict each other. In most line charts, if there is a line, and then a "total", the natural expectation is that the "total" is the sum of the data that make up the line. In this case, that "total" is the total new cases during the time period depicted. Total new cases isn't the same as total cases from case #1.

It's clearer to say "Total Cases on 3/17 = 183; on 4/1 = 3342".


The epidemic of simple comparisons

Another day, another Twitter user sent a sloppy chart featured on TV news. This CNN graphic comes from Hugo K. by way of Kevin T.

And it's another opportunity to apply the self-sufficiency test.


Like before, I removed the data printed on the graphic. In reading this chart, we like to know the number of U.S. reported cases of coronavirus relative to China, and Italy relative to the U.S.

So, our eyes trace these invisible lines:


U.S. cases are roughly two-thirds of China while Italian cases are 90% of U.S.

That's what the visual elements, the columns, are telling us. But it's fake news. Here is the chart with the data:


The counts of reported cases in all three countries were neck and neck around this time.

What this quick exercise shows is that anyone who correctly reads this chart is reading the data off the chart, and ignoring the contradictionary message sent by the relative column heights. Thus, the visual elements are not self-sufficient in conveying the message.


In a Trifecta Checkup, I'd be most concerned about the D corner. The naive comparison of these case counts is an epidemic of its own. It sometimes leads to poor decisions that can exacerbate the public-health problems. See this post on my sister blog.

The difference in case counts between different countries (or regions or cities or locales) is not a direct measure of the difference in coronavirus spread in these places! This is because there are many often-unobserved factors that will explain most if not all of the differences.

After a lot of work by epidemiologists, medical researchers, statisticians and the likes, we now realize that different places conduct different numbers of tests. No test, no positive. The U.S. has been slow to get testing ramped up.

Less understood is the effect of testing selection. Consider the U.S. where it is still hard to get tested. Only those who meet a list of criteria are eligible. Imagine an alternative reality in which the U.S. conducted the same number of tests but instead of selecting most likely infected people to be tested, we test a random sample of people. The incidence of the virus in a random sample is much lower than in the severely infected, therefore, in this new reality, the number of positives would be lower despite equal numbers of tests.

That's for equal number of tests. If test kits are readily available, then a targeted (triage) testing strategy will under-count cases since mild cases or asymptomatic infections escape attention. (See my Wired column for problems with triage.)

To complicate things even more, in most countries, the number of tests and the testing selection have changed over time so a cumulative count statistic obscures those differences.

Beside testing, there are a host of other factors that affect reported case counts. These are less talked about now but eventually will be.

Different places have different population densities. A lot of cases in a big city and an equal number of cases in a small town do not signify equal severity.  Clearly, the situation in the latter is more serious.

Because the virus affects age groups differently, a direct comparison of the case counts without adjusting for age is also misleading. The number of deaths of 80-year-olds in a college town is low not because the chance of dying from COVID-19 is lower there than in a retirement community; it's low because 80-year-olds are a small proportion of the population.

Next, the cumulative counts ignore which stage of the "epi curve" these countries are at. The following chart can replace most of the charts you're inundated with by the media:


(I found the chart here.)

An epi curve traces the time line of a disease outbreak. Every location is expected to move through stages, with cases reaching a peak and eventually the number of newly recovered will exceed the number of newly infected.

Notice that China, Italy and the US occupy different stages of this curve.  It's proper to compare U.S. to China and Italy when they were at a similar early phase of their respective epi curve.

In addition, any cross-location comparison should account for how reliable the data sources are, and the different definitions of a "case" in different locations.


Finally, let's consider the Question posed by the graphic designer. It is the morbid question: which country is hit the worst by coronavirus?

This is a Type DV chart. It's got a reasonable question, but the data require a lot more work to adjust for the list of biases. The visual design is hampered by the common mistake of not starting columns at zero.


Comparing chance of death of coronavirus and flu

The COVID-19 charts are proving one thing. When the topic of a dataviz is timely and impactful, readers will study the graphics and ask questions. I've been sent some of these charts lately, and will be featuring them here.

A former student saw this chart from Business Insider (link) and didn't like it.


My initial reaction was generally positive. It's clear the chart addresses a comparison between death rates of the flu and COVID19, an important current question. The side-by-side panel is effective at allowing such a comparison. The column charts look decent, and there aren't excessive gridlines.

Sure, one sees a few simple design fixes, like removing the vertical axis altogether (since the entire dataset has already been printed). I'd also un-slant the age labels.


I'd like to discuss some subtler improvements.

A primary challenge is dealing with the different definitions of age groups across the two datasets. While the side-by-side column charts prompt readers to go left-right, right-left in comparing death rates, it's not easy to identify which column to compare to which. This is not fixable in the datasets because the organizations that compile them define their own age groups.

Also, I prefer to superimpose the death rates on the same chart, using something like a dot plot rather than a column chart. This makes the comparison even easier.

Here is a revised visualization:


The contents of this chart raise several challenges to public health officials. Clearly, hospital resources should be preferentially offered to older patients. But young people could be spreading the virus among the community.

Caution is advised as the data for COVID19 suffers from many types of inaccuracies, as outlined here.