When the visual runs away from the data

The pressure of the coronavirus news cycle has gotten the better of some graphics designers. Via Twitter, Mark B sent me the following chart:

Junkcharts_abccovidbiggestworries_sufficiency

I applied the self-sufficiency test to this pie chart. That's why you can't see the data which were also printed on the chart.

The idea of self-sufficiency is to test how much work the visual elements of the graphic are doing to convey its message. Look at the above chart, and guess the three values are.

Roughly speaking, all three answers are equally popular, with perhaps a little less than a third of respondents indicating "Getting It" as their biggest COVID-19 worry.

If measured, the slices represent 38%, 35% and 27%.

Now, here is the same chart with the data:

Abc_covidbiggestworries

Each number is way off! In addition, the three numbers sum to 178%.

Trifectacheckup_junkcharts_imageThis is an example of the Visual being at odds with the Data, using a Trifecta Checkup analysis. (Read about the Trifecta here.)

What the Visual is saying is not the same as what the data are saying. So the green arrow between D and V is broken.

***

This is a rather common mistake. This survey question apparently allows each respondent to select more than one answers. Whenever more than one responses are accepted, one cannot use a pie chart.

Here is a stacked bar chart that does right by the data.

Redo_junkcharts_abcbiggestcovidworries

 


The epidemic of simple comparisons

Another day, another Twitter user sent a sloppy chart featured on TV news. This CNN graphic comes from Hugo K. by way of Kevin T.

And it's another opportunity to apply the self-sufficiency test.

Junkcharts_cnncovidcases_sufficiency_1

Like before, I removed the data printed on the graphic. In reading this chart, we like to know the number of U.S. reported cases of coronavirus relative to China, and Italy relative to the U.S.

So, our eyes trace these invisible lines:

Junkcharts_cnncovidcases_sufficiency_2

U.S. cases are roughly two-thirds of China while Italian cases are 90% of U.S.

That's what the visual elements, the columns, are telling us. But it's fake news. Here is the chart with the data:

Cnn_covidcases

The counts of reported cases in all three countries were neck and neck around this time.

What this quick exercise shows is that anyone who correctly reads this chart is reading the data off the chart, and ignoring the contradictionary message sent by the relative column heights. Thus, the visual elements are not self-sufficient in conveying the message.

***

In a Trifecta Checkup, I'd be most concerned about the D corner. The naive comparison of these case counts is an epidemic of its own. It sometimes leads to poor decisions that can exacerbate the public-health problems. See this post on my sister blog.

The difference in case counts between different countries (or regions or cities or locales) is not a direct measure of the difference in coronavirus spread in these places! This is because there are many often-unobserved factors that will explain most if not all of the differences.

After a lot of work by epidemiologists, medical researchers, statisticians and the likes, we now realize that different places conduct different numbers of tests. No test, no positive. The U.S. has been slow to get testing ramped up.

Less understood is the effect of testing selection. Consider the U.S. where it is still hard to get tested. Only those who meet a list of criteria are eligible. Imagine an alternative reality in which the U.S. conducted the same number of tests but instead of selecting most likely infected people to be tested, we test a random sample of people. The incidence of the virus in a random sample is much lower than in the severely infected, therefore, in this new reality, the number of positives would be lower despite equal numbers of tests.

That's for equal number of tests. If test kits are readily available, then a targeted (triage) testing strategy will under-count cases since mild cases or asymptomatic infections escape attention. (See my Wired column for problems with triage.)

To complicate things even more, in most countries, the number of tests and the testing selection have changed over time so a cumulative count statistic obscures those differences.

Beside testing, there are a host of other factors that affect reported case counts. These are less talked about now but eventually will be.

Different places have different population densities. A lot of cases in a big city and an equal number of cases in a small town do not signify equal severity.  Clearly, the situation in the latter is more serious.

Because the virus affects age groups differently, a direct comparison of the case counts without adjusting for age is also misleading. The number of deaths of 80-year-olds in a college town is low not because the chance of dying from COVID-19 is lower there than in a retirement community; it's low because 80-year-olds are a small proportion of the population.

Next, the cumulative counts ignore which stage of the "epi curve" these countries are at. The following chart can replace most of the charts you're inundated with by the media:

Epicurve_coronavirus

(I found the chart here.)

An epi curve traces the time line of a disease outbreak. Every location is expected to move through stages, with cases reaching a peak and eventually the number of newly recovered will exceed the number of newly infected.

Notice that China, Italy and the US occupy different stages of this curve.  It's proper to compare U.S. to China and Italy when they were at a similar early phase of their respective epi curve.

In addition, any cross-location comparison should account for how reliable the data sources are, and the different definitions of a "case" in different locations.

***

Finally, let's consider the Question posed by the graphic designer. It is the morbid question: which country is hit the worst by coronavirus?

This is a Type DV chart. It's got a reasonable question, but the data require a lot more work to adjust for the list of biases. The visual design is hampered by the common mistake of not starting columns at zero.

 


Food coma and self-sufficiency in dataviz

The Hustle wrote a strong analysis of the business of buffets. If you've read my analysis of Groupon's business model in Numbersense (link), you'll find some similarities. A key is to not think of every customer as an average customer; there are segments of customers who behave differently, and creating a proper mix of different types of customers is the management's challenge. I will make further comments on the statistics in a future post on the sister blog.

At Junk Charts, we'll focus on visualizing and communciating data. The article in The Hustle comes with the following dataviz:

Hustle_buffetcost

This dataviz fails my self-sufficiency test. Recall: self-sufficiency is a basic requirement of visualizing data - that the graphical elements should be sufficient to convey the gist of the data. Otherwise, there is no point in augmenting the data with graphical elements.

The self-sufficiency test is to remove the dataset from the dataviz, and ask whether the graphic can stand on its own. So here:

Redo_hustlebuffetcost_selfsufficiency

The entire set of ingredient costs appears on the original graphic. When these numbers are removed, the reader gets the wrong message - that the cost is equally split between these five ingredients.

This chart reminds me of the pizza chart that everyone thought was a pie chart except its designer! I wrote about it here. Food coma is a thing.

The original chart may be regarded as an illustration rather than data visualization. If so, it's just a few steps from becoming a dataviz. Like this:

Redo_hustlebuffetcost

P.S. A preview of what I'll be talking about at the sister blog. The above diagram illustrates the average case - for the average buffet diner. Underneath these costs is an assumption about the relative amounts of each food that is eaten. But eaten by whom?

Also, if you have Numbersense (link), the chapter on measuring the inflation rate is relevant here. Any inflation metric must assume a basket of goods, but then the goods within the basket have to be weighted by the amount of expenditure. It's much harder to get the ratio of expenditures correct compared to getting price data.

 

 


Watching a valiant effort to rescue the pie chart

Today we return to the basics. In a twitter exchange with Dean E., I found the following pie chart in an Atlantic article about who's buying San Francisco real estate:

Atlantic_sfrealestatepie

The pie chart is great at one thing, showing how workers in the software industry accounted for half of the real estate purchases. (Dean and I both want to see more details of the analysis as we have many questions about the underlying data. In this post, I ignore these questions.)

After that, if we want to learn anything else from the pie chart, we have to read the data labels. This calls for one of my key recommendations: make your charts sufficient. The principle of self-sufficiency is that the visual elements of the data graphic should by themselves say something about the data. The test of self-sufficiency is executed by removing the data printed on the chart so that one can assess how much work the visual elements are performing. If the visual elements require data labels to work, then the data graphic is effectively a lookup table.

This is the same pie chart, minus the data:

Redo_atlanticsfrealestate_sufficiency

Almost all pie charts with a large number of slices are packed with data labels. Think of the labeling as a corrective action to fix the shortcoming of the form.

Here is a bar chart showing the same data:

Junkcharts_redo_atlanticsfrealestatebar

***

Let's look at all the efforts made to overcome the lack of self-sufficiency.

Here is a zoom-in on the left side of the chart:

Redo_atlanticsfrealestate_labeling_1

Data labels are necessary to help readers perceive the sizes of the slices. But as the slices are getting smaller, the labels are getting too dense, so the guiding lines are being stretched.

Eventually, the designer gave up on labeling every slice. You can see that some slices are missing labels:

Redo_atlanticsfrealestate_labeling_3

The designer also had to give up on sequencing the slices by the data. For example, hardware with a value of 2.4% should be placed between Education and Law. It is shifted to the top left side to make the labeling easier.

Redo_atlanticsfrealestate_labeling_2

Fitting all the data labels to the slices becomes the singular task at hand.

 


How to describe really small chances

Reader Aleksander B. sent me to the following chart in the Daily Mail, with the note that "the usage of area/bubble chart in combination with bar alignment is not very useful." (link)

Dailymail-image-a-35_1431545452562

One can't argue with that statement. This chart fails the self-sufficiency test: anyone reading the chart is reading the data printed on the right column, and does not gain anything from the visual elements (thus, the visual representation is not self-sufficient). As a quick check, the size of the risk for "motorcycle" should be about 30 times larger than that of "car"; the size of the risk for "car" should be 100 times larger than that of "airplane". The risk of riding motorcycles then is roughly 3,000 times that of flying in an airplane. 

The chart does not appear to be sized properly as a bubble chart:

Dailymail_travelrisk_bubble

You'll notice that the visible proportion of the "car" bubble is much larger than that of the "motorcycle" bubble, which is one part of the problem.

Nor is it sized as a bar chart:

Dailymail_travelrisk_bar

As a bar chart, both the widths and the heights of the bars vary; and the last row presents a further challenge as the bubble for the airplane does not touch the baseline.

***

Besides the Visual, the Data issues are also quite hard. This is how Aleksander describes it: "as a reader I don't want to calculate all my travel distances and then do more math to compare different ways of traveling."

The reader wants to make smarter decisions about travel based on the data provided here. Aleksandr proposes one such problem:

In terms of probability it is also easier to understand: "I am sitting in my car in strong traffic. At the end in 1 hour I will make only 10 miles so what's the probability that I will die? Is it higher or lower than 1 hour in Amtrak train?"

The underlying choice is between driving and taking Amtrak for a particular trip. This comparison is relevant because those two modes of transport are substitutes for this trip. 

One Data issue with the chart is that riding a motorcycle and flying in a plane are rarely substitutes. 

***

A way out is to do the math on behalf of your reader. The metric of deaths per 1 billion passenger-miles is not intuitive for a casual reader. A more relevant question is what's the chance of dying from the time I spend per year of driving (or riding a plane). Because the chance will be very tiny, it is easier to express the risk as the number of years of travel before I expect to see one death.

Let's assume someone drives 300 days per year, and 100 miles per day so that each year, this driver contributes 30,000 passenger-miles to the U.S. total (which is 3.2 trillion). We convert 7.3 deaths per 1 billion passenger-miles to 1 death per 137 million passenger-miles. Since this driver does 30K per year, it will take (137 million / 30K) = about 4,500 years to see one death on average. This calculation assumes that the driver drives alone. It's straightforward to adjust the estimate if the average occupancy is higher than 1. 

Now, let's consider someone who flies once a month (one outbound trip plus one return trip). We assume that each plane takes on average 100 passengers (including our protagonist), and each trip covers on average 1,000 miles. Then each of these flights contributes 100,000 passenger-miles. In a year, the 24 trips contribute 2.4 million passenger-miles. The risk of flying is listed at 0.07 deaths per 1 billion, which we convert to 1 death per 14 billion passenger-miles. On this flight schedule, it will take (14 billion / 2.4 million) = almost 6,000 years to see one death on average.

For the average person on those travel schedules, there is nothing to worry about. 

***

Comparing driving and flying is only valid for those trips in which you have a choice. So a proper comparison requires breaking down the average risks into components (e.g. focusing on shorter trips). 

The above calculation also suggests that the risk is not evenly spread out throughout the population, despite the use of an overall average. A trucker who is on the road every work day is clearly subject to higher risk than an occasional driver who makes a few trips on rental cars each year.

There is a further important point to note about flight risk, due to MIT professor Arnold Barnett. He has long criticized the use of deaths per billion passenger-miles as a risk metric for flights. (In Chapter 5 of Numbers Rule Your World (link), I explain some of Arnie's research on flight risk.) The problem is that almost all fatal crashes involving planes happen soon after take-off or not long before landing. 

 


Is the visual serving the question?

The following chart concerns California's bullet train project.

California_bullettrain

Now, look at the bubble chart at the bottom. Here it is - with all the data except the first number removed:

Highspeedtrains_sufficiency

It is impossible to know how fast the four other train systems run after I removed the numbers. The only way a reader can comprehend this chart is to read the data inside the bubbles. This chart fails the "self-sufficiency test". The self-sufficiency test asks how much work the visual elements on the chart are doing to communicate the data; in this case, the bubbles do nothing at all.

Another problem: this chart buries its lede. The message is in the caption: how California's bullet train rates against other fast train systems. California's train speed of 220 mph is only mentioned in the text but not found in the visual.

Here is a chart that draws attention to the key message:

Redo_highspeedtrains

In a Trifecta checkup, we improved this chart by bringing the visual in sync with the central question of the chart.


McKinsey thinks the data world needs more dataviz talent

Note about last week: While not blogging, I delivered four lectures on three topics over five days: one on the use of data analytics in marketing for a marketing class at Temple; two on the interplay of analytics and data visualization, at Yeshiva and a JMP Webinar; and one on how to live during the Data Revolution at NYU.

This week, I'm back at blogging.

McKinsey publishes a report confirming what most of us already know or experience - the explosion of data jobs that just isn't stopping.

On page 5, it says something that is of interest to readers of this blog: "As data grows more complex, distilling it and bringing it to life through visualization is becoming critical to help make the results of data analyses digestible for decision makers. We estimate that demand for visualization grew roughly 50 percent annually from 2010 to 2015." (my bolding)

The report contains a number of unfortunate graphics. Here's one:

Mckinseyreport_pageiii

I applied my self-sufficiency test by removing the bottom row of data from the chart. Here is what happened to the second circle, representing the fraction of value realized by the U.S. health care industry.

Mckinseyreport_pageiii_inset

What does the visual say? This is one of the questions in the Trifecta Checkup. We see three categories of things that should add up to 100 percent. With a little more effort, we find the two colored categories are each 10% while the white area is 80%. 

But that's not what the data say, because there is only one thing being measured: how much of the potential has already been realized. The two colors is an attempt to visualize the uncertainty of the estimated proportion, which in this case is described as 10 to 20 percent underneath the chart.

If we have to describe what the two colored sections represent: the dark green section is the lower bound of the estimate while the medium green section is the range of uncertainty. The edge between the two sections is the actual estimated proportion (assuming the uncertainty bound is symmetric around the estimate)!

A first attempt to fix this might be to use line segments instead of colored arcs. 

Redo_mckinseyreport_inset_jc_1

The middle diagram emphasizes the mid-point estimate while the right diagram, the range of estimates. Observe how differently these two diagrams appear from the original one shown on the left.

This design only works if the reader perceives the chart as a "racetrack" chart. You have to see the invisible vertical line at the top, which is the starting line, and measure how far around the track has the symbol gone. I have previously discussed why I don't like racetracks (for example, here and here).

***

Here is a sketch of another design:

Redo_mckinseyreport_jc_2

The center figure will have to be moved and changed to a different shape. This design conveys the sense of a goal (at 100%) and how far one is along the path. The uncertainty is represented by wave-like elements that make the exact location of the pointer arrow appear as wavering.

 

 

 

 


Plotted performance guaranteed not to predict future performance

On my flight back from Lyon, I picked up a French magazine, and found the following chart:

French interest rates chart small

A quick visit to Bing Translate tells me that this chart illustrates the rates of return of different types of investments. The headline supposedly says "Only the risk pays". In many investment brochures, after presenting some glaringly optimistic projections of future returns, the vendor legally protects itself by proclaiming "Past performance does not guarantee future performance."

For this chart, an appropriate warning is PLOTTED PERFORMANCE GUARANTEED NOT TO PREDICT THE FUTURE!

***

Two unusual decisions set this chart apart:

1. The tree ring imagery, which codes the data in the widths of concentric rings around a common core

2. The placement of larger numbers toward the middle, and smaller numbers in the periphery.

When a reader takes in the visual design of this chart, what is s/he drawn to?

The designer evidently hopes the reader will focus on comparing the widths of the rings (A), while ignoring the areas or the circumferences. I think it is more likely that the reader will see one of the following:

(B) the relative areas of the tree rings

(C) the areas of the full circles bounded by the circumferences

(D) the lengths of the outer rings

(E) the lengths of the inner rings

(F) the lengths of the "middle" rings (defined as the average of the outer and inner rings)

Here is a visualization of six ways to "see" what is on the French rates of return chart:

Redo_jc_frenchinterestrates_1

Recall the Trifecta Checkup (link). This is an example where "What does the visual say" and "What does the data say" may be at variance. In case (A), if the reader is seeing the ring widths, then those two aspects are in sync. In every other case, the two aspects are disconcordant. 

The level of distortion is visualized in the following chart:

Redo_jc_frenchinterestrates_2

Here, I normalized everything to the size of the SCPI data. The true data is presented by the ring width column, represented by the vertical stripes on the left. If the comparisons are not distorted, the other symbols should stay close to the vertical stripes. One notices there is always distortion in cases (B)-(F). This is primarily due to the placement of the large numbers near the center and the small numbers near the edge. In other words, the radius is inversely proportional to the data!

 The amount of distortion for most cases ranges from 2 to 6 times. 

While the "ring area" (B) version is least distorted on average, it is perhaps the worst of the six representations. The level of distortion is not a regular function of the size of the data. The "sicav monetaries" (smallest data) is the least distorted while the data of medium value are the most distorted.

***

To improve this chart, take a hint from the headline. Someone recognizes that there is a tradeoff between risk and return. The data series shown, which is an annualized return, only paints the return part of the relationship. 

 

 

 


No Latin honors for graphic design

Paw_honors_2018This chart appeared on a recent issue of Princeton Alumni Weekly.

If you read the sister blog, you'll be aware that at most universities in the United States, every student is above average! At Princeton,  47% of the graduating class earned "Latin" honors. The median student just missed graduating with honors so the honors graduate is just above average! The 47% number is actually lower than at some other peer schools - at one point, Harvard was giving 90% of its graduates Latin honors.

Side note: In researching this post, I also learned that in the Senior Survey for Harvard's Class of 2018, two-thirds of the respondents (response rate was about 50%) reported GPA to be 3.71 or above, and half reported 3.80 or above, which means their grade average is higher than A-.  Since Harvard does not give out A+, half of the graduates received As in almost every course they took, assuming no non-response bias.

***

Back to the chart. It's a simple chart but it's not getting a Latin honor.

Most readers of the magazine will not care about the decimal point. Just write 18.9% as 19%. Or even 20%.

The sequencing of the honor levels is backwards. Summa should be on top.

***

Warning: the remainder of this post is written for graphics die-hards. I go through a bunch of different charts, exploring some fine points.

People often complain that bar charts are boring. A trendy alternative when it comes to count or percentage data is the "pictogram."

Here are two versions of the pictogram. On the left, each percent point is shown as a dot. Then imagine each dot turned into a square, then remove all padding and lines, and you get the chart on the right, which is basically an area chart.

Redo_paw_honors_2018

The area chart is actually worse than the original column chart. It's now much harder to judge the areas of irregularly-shaped pieces. You'd have to add data labels to assist the reader.

The 100 dots is appealing because the reader can count out the number of each type of honors. But I don't like visual designs that turn readers into bean-counters.

So I experimented with ways to simplify the counting. If counting is easier, then making comparisons is also easier.

Start with this observation: When asked to count a large number of objects, we group by 10s and 5s.

So, on the left chart below, I made connectors to form groups of 5 or 10 dots. I wonder if I should use different line widths to differentiate groups of five and groups of ten. But the human brain is very powerful: even when I use the same connector style, it's easy to see which is a 5 and which is a 10.

Redo_paw_honors_2

On the left chart, the organizing principles are to keep each connector to its own row, and within each category, to start with 10-group, then 5-group, then singletons. The anti-principle is to allow same-color dots to be separated. The reader should be able to figure out Summa = 10+3, Magna = 10+5+1, Cum Laude = 10+5+4.

The right chart is even more experimental. The anti-principle is to allow bending of the connectors. I also give up on using both 5- and 10-groups. By only using 5-groups, readers can rely on their instinct that anything connected (whether straight or bent) is a 5-group. This is powerful. It relieves the effort of counting while permitting the dots to be packed more tightly by respective color.

Further, I exploited symmetry to further reduce the counting effort. Symmetry is powerful as it removes duplicate effort. In the above chart, once the reader figured out how to read Magna, reading Cum Laude is simplified because the two categories share two straight connectors, and two bent connectors that are mirror images, so it's clear that Cum Laude is more than Magna by exactly three dots (percentage points).

***

Of course, if the message you want to convey is that roughly half the graduates earn honors, and those honors are split almost even by thirds, then the column chart is sufficient. If you do want to use a pictogram, spend some time thinking about how you can reduce the effort of the counting!

 

 

 

 

 


Two views of earthquake occurrence in the Bay Area

This article has a nice description of earthquake occurrence in the San Francisco Bay Area. A few quantities are of interest: when the next quake occurs, the size of the quake, the epicenter of the quake, etc. The data graphic included in the article fails the self-sufficiency test: the only way to read this chart is to read out the entire data set - in other words, the graphical details have no utility.

Earthquake-probability-chart

The article points out the clustering of earthquakes. In particular, there is a 68-year "quiet period" between 1911 and 1979, during which no quakes over 6.0 in size occurred. The author appears to have classified quakes into three groups: "Largest" which are those at 6.5 or over; "Smaller but damaging" which are those between 6.0 and 6.5; and those below 6.0 (not shown).

For a more standard and more effective visualization of this dataset, see this post on a related chart (about avian flu outbreaks). The post discusses a bubble chart versus a column chart. I prefer the column chart.

image from junkcharts.typepad.com

This chart focuses on the timing of rare events. The time between events is not as easy to see. 

What if we want to focus on the "quiet years" between earthquakes? Here is a visualization that addresses the question: when will the next one hit us?

Redo_jc_earthquakeprobability