The why axis

A few weeks ago, I replied to a tweet by someone who was angered by the amount of bad graphics about coronavirus. I take a glass-half-full viewpoint: it's actually heart-warming for  dataviz designers to realize that their graphics are being read! When someone critiques your work, it is proof that they cared enough to look at it. Worse is when you publish something, and no one reacts to it.

That said, I just wasted half an hour trying to get into the head of the person who made the following:

Fox31_co_newcases edited

Longtime reader Chris P. forwarded this tweet to me, and I saw that Andrew Gelman got sent this one, too.

The chart looked harmless until you check out the vertical axis labels. It's... um... the most unusual. The best way to interpret what the designer did is to break up the chart into three components. Like this:

Redo_junkcharts_fox31cocases

The big mystery is why the designer spent the time and energy to make this mischief.

The usual suspect is fake news. The clearest sign of malintent is the huge size of the dots. Each dot spans almost the entirety of the space between gridlines.

But there is almost no fake news here. The overall trend line is intact despite the attempted distortion. The following is a superposition of an unmanipulated line (yellow) on top of the manipulated:

Redo_junkcharts_fox31cocases2

***

The next guess is incompetence. The evidence against this view is the amount of energy required to execute these changes. In Excel, it takes a lot of work. It's easier to do this in R or any programming languages with which you can design your own axis.

Even for the R coders, the easy part is to replicate the design, but the hard part is to come up with the concept in the first place!

You can't just stumble onto a design like this. So I am not convinced the designer is an idiot.

***

How much work? You have to create three separate charts, with three carefully chosen vertical scales, and then clip, merge, and sew the seam. The weirdest bit is throwing away three of the twelve axis labels and writing in three fake numbers.

Here's the recipe: (if the gif doesn't load automatically, click on it)

Fox31_co_cases_B6

Help me readers! I'm stumped. Why oh why did someone make this? What is the point?

 

 


Graphing the economic crisis of coronavirus 2

Last week, I discussed Ray's chart that compares the S&P 500 performance in this crisis against previous crises.

A reminder:

Tcb_stockmarketindices_fourcrises

Another useful feature is the halo around the right edge of the COVID-19 line. This device directs our eyes to where he wants us to look.

In the same series, he made the following for The Conference Board (link):

TCB-COVID-19-impact-oil-prices-640

Two things I learned from this chart:

The oil market takes a much longer time to recover after crises, compared to the S&P. None of these lines reached above 100 in the first 150 days (5 months).

Just like the S&P, the current crisis is most similar in severity to the 2008 Great Recession, only worse, and currently, the price collapse in oil is quite a bit worse than in 2008.

***
The drop of oil is going to be contentious. This is a drop too many for a Tufte purist. It might as well symbolize a tear shed.

The presence of the icon tells me these lines depict the oil market without having to read text. And I approve.


The epidemic of simple comparisons

Another day, another Twitter user sent a sloppy chart featured on TV news. This CNN graphic comes from Hugo K. by way of Kevin T.

And it's another opportunity to apply the self-sufficiency test.

Junkcharts_cnncovidcases_sufficiency_1

Like before, I removed the data printed on the graphic. In reading this chart, we like to know the number of U.S. reported cases of coronavirus relative to China, and Italy relative to the U.S.

So, our eyes trace these invisible lines:

Junkcharts_cnncovidcases_sufficiency_2

U.S. cases are roughly two-thirds of China while Italian cases are 90% of U.S.

That's what the visual elements, the columns, are telling us. But it's fake news. Here is the chart with the data:

Cnn_covidcases

The counts of reported cases in all three countries were neck and neck around this time.

What this quick exercise shows is that anyone who correctly reads this chart is reading the data off the chart, and ignoring the contradictionary message sent by the relative column heights. Thus, the visual elements are not self-sufficient in conveying the message.

***

In a Trifecta Checkup, I'd be most concerned about the D corner. The naive comparison of these case counts is an epidemic of its own. It sometimes leads to poor decisions that can exacerbate the public-health problems. See this post on my sister blog.

The difference in case counts between different countries (or regions or cities or locales) is not a direct measure of the difference in coronavirus spread in these places! This is because there are many often-unobserved factors that will explain most if not all of the differences.

After a lot of work by epidemiologists, medical researchers, statisticians and the likes, we now realize that different places conduct different numbers of tests. No test, no positive. The U.S. has been slow to get testing ramped up.

Less understood is the effect of testing selection. Consider the U.S. where it is still hard to get tested. Only those who meet a list of criteria are eligible. Imagine an alternative reality in which the U.S. conducted the same number of tests but instead of selecting most likely infected people to be tested, we test a random sample of people. The incidence of the virus in a random sample is much lower than in the severely infected, therefore, in this new reality, the number of positives would be lower despite equal numbers of tests.

That's for equal number of tests. If test kits are readily available, then a targeted (triage) testing strategy will under-count cases since mild cases or asymptomatic infections escape attention. (See my Wired column for problems with triage.)

To complicate things even more, in most countries, the number of tests and the testing selection have changed over time so a cumulative count statistic obscures those differences.

Beside testing, there are a host of other factors that affect reported case counts. These are less talked about now but eventually will be.

Different places have different population densities. A lot of cases in a big city and an equal number of cases in a small town do not signify equal severity.  Clearly, the situation in the latter is more serious.

Because the virus affects age groups differently, a direct comparison of the case counts without adjusting for age is also misleading. The number of deaths of 80-year-olds in a college town is low not because the chance of dying from COVID-19 is lower there than in a retirement community; it's low because 80-year-olds are a small proportion of the population.

Next, the cumulative counts ignore which stage of the "epi curve" these countries are at. The following chart can replace most of the charts you're inundated with by the media:

Epicurve_coronavirus

(I found the chart here.)

An epi curve traces the time line of a disease outbreak. Every location is expected to move through stages, with cases reaching a peak and eventually the number of newly recovered will exceed the number of newly infected.

Notice that China, Italy and the US occupy different stages of this curve.  It's proper to compare U.S. to China and Italy when they were at a similar early phase of their respective epi curve.

In addition, any cross-location comparison should account for how reliable the data sources are, and the different definitions of a "case" in different locations.

***

Finally, let's consider the Question posed by the graphic designer. It is the morbid question: which country is hit the worst by coronavirus?

This is a Type DV chart. It's got a reasonable question, but the data require a lot more work to adjust for the list of biases. The visual design is hampered by the common mistake of not starting columns at zero.

 


More visuals of the economic crisis

As we move into the next phase of the dataviz bonanza arising from the coronavirus pandemic, we will see a shift from simple descriptive graphics of infections and deaths to bivariate explanatory graphics claiming (usually spurious) correlations.

The FT is leading the way with this effort, and I hope all those who follow will make a note of several wise decisions they made.

  • They source their data. Most of the data about business activities come from private entities, many of which are data vendors who make money selling the data. In this article, FT got restaurant data from OpenTable, retail foot traffic data from Springboard, box office data from Box Office Mojo, flight data from Flightradar24, road traffic data from TomTom, and energy use data from European Network of Transmission System Operators for Electricity.
  • They generally let the data and charts speak without "story time". The text primarily describes the trends of the various data series.
  • They selected sectors that are obviously impacted by the shutdowns so any link between the observed trends and the crisis is plausible.

The FT charts are examples of clarity. Here is the one about road traffic patterns in major cities:

Ft_roadusage_corona_wrongsource

The cities are organized into regions: Europe, US, China, other Asia.

The key comparison is the last seven days versus the historical averages. The stories practically jump out of the page. Traffic in Paris collapsed on Tuesday. Wuhan is still locked down despite the falloff in infections. Drivers of Tokyo are probably wondering why teams are not going to the Olympics this year. Londoners? My guess is they're determined to not let another Brexit deadline slip.

***

I'd hope we go even further than FT when publishing this type of visual analytics involving "Big Data." These business data obtained from private sources typically have OCCAM properties: they are observational, seemingly complete, uncontrolled, adapted and merged. All these properties make the data very challenging to interpret.

The coronavirus case and death counts are simple by comparison. People are now aware of all the problems from differential rates of testing to which groups are selectively tested (i.e. triage) to how an infection or death is defined. The problems involving Big Data are much more complex.

I have three additional proposals:

Disclosure of Biases and Limitations

The private data have many more potential pitfalls. Take OpenTable data for example. The data measure restaurant bookings, not revenues. It measures gross bookings, not net bookings (i.e. removing no-shows). Only a proportion of restaurants use OpenTable (which cost owners money). OpenTable does not strike me as a quasi-monopoly so there are competitors with significant market share. The restaurants that use OpenTable do not form a random subsample of all restaurants. One of the most popular restaurants in the U.S. are pizza joints, with little of no seating, which do not feature in the bookings data. OpenTable also has differential popularity by country, region, or probably cuisine. 

I believe data journalists ought to provide such context in a footnote. Readers should have the information to judge whether they believe the data are sufficiently representative. Private data vendors who want data journalists to feature their datasets should be required to supply a footnote that describes the biases and limitations of their data.

Data journalists should think seriously about how they headline this type of chart. The standard practice is what FT adopted. The headline said "Restaurant bookings have collapsed" with a small footnote saying "Source: OpenTable". Should the headline have said "OpenTable bookings have collapsed" instead?

Disclosure of Definitions and Proxies

In the road traffic chart shown above, the metric is called "TomTom traffic congestion index". In order to earn this free advertising (euphemistically called "earned media" by industry), TomTom should be obliged to explain how this index is constructed. What does index = 100 mean?

[For example, it is curious that the Madrid index values are much lower across the board than those in Paris and Roma.]

For the electric usage chart, FT discloses the name of the data provider as a group of "43 electricity transmission system operators in 36 countries across Europe." Now, that is important context but can be better. The group may consist of 43 operators but how many of them are in the dataset? What proportion of the total electric usage do they account for in each country? If they have low penetration in a particular country, do they just report the low statistics or adjust the numbres?

If the journalist decides to use a proxy, for example, OpenTable restaurant bookings to reflect restaurant revenues, that should be explained, perhaps even justified.

Data as a Public Good

If private businesses choose to supply data to media outlets as a public service, they should allow the underlying data to be published.

Speaking from experience, I've seen a lot of bad data. It's one thing to hold your nose when the data are analyzed to make online advertising more profitable, or to find signals to profit from the stock market. It's another thing for the data analysis to drive public policy, in this case, policies that will have life-or-death implications.


Bubble charts, ratios and proportionality

A recent article in the Wall Street Journal about a challenger to the dominant weedkiller, Roundup, contains a nice selection of graphics. (Dicamba is the up-and-comer.)

Wsj_roundup_img1


The change in usage of three brands of weedkillers is rendered as a small-multiples of choropleth maps. This graphic displays geographical and time changes simultaneously.

The staircase chart shows weeds have become resistant to Roundup over time. This is considered a weakness in the Roundup business.

***

In this post, my focus is on the chart at the bottom, which shows complaints about Dicamba by state in 2019. This is a bubble chart, with the bubbles sorted along the horizontal axis by the acreage of farmland by state.

Wsj_roundup_img2

Below left is a more standard version of such a chart, in which the bubbles are allowed to overlap. (I only included the bubbles that were labeled in the original chart).

Redo_roundupwsj0

The WSJ’s twist is to use the vertical spacing to avoid overlapping bubbles. The vertical axis serves a design perogative and does not encode data.  

I’m going to stick with the more traditional overlapping bubbles here – I’m getting to a different matter.

***

The question being addressed by this chart is: which states have the most serious Dicamba problem, as revealed by the frequency of complaints? The designer recognizes that the amount of farmland matters. One should expect the more acres, the more complaints.

Let's consider computing directly the number of complaints per million acres.

The resulting chart (shown below right) – while retaining the design – gives a wholly different feeling. Arkansas now owns the largest bubble even though it has the least acreage among the included states. The huge Illinois bubble is still large but is no longer a loner.

Redo_dicambacomplaints1

Now return to the original design for a moment (the chart on the left). In theory, this should work in the following manner: if complaints grow purely as a function of acreage, then the bubbles should grow proportionally from left to right. The trouble is that proportional areas are not as easily detected as proportional lengths.

The pair of charts below depict made-up data in which all states have 30 complaints for each million acres of farmland. It’s not intuitive that the bubbles on the left chart are growing proportionally.

Redo_dicambacomplaints2

Now if you look at the right chart, which shows the relative metric of complaints per million acres, it’s impossible not to notice that all bubbles are the same size.


Revisiting global car sales

We looked at the following chart in the previous blog. The data concern the growth rates of car sales in different regions of the world over time.

Cnbc zh global car sales

Here is a different visualization of the same data.

Redo_cnbc_globalcarsales

Well, it's not quite the same data. I divided the global average growth rate by four to yield an approximation of the true global average. (The reason for this is explained in the other day's post.)

The chart emphasizes how each region was helping or hurting the global growth. It also features the trend in growth within each region.

 


This Excel chart looks standard but gets everything wrong

The following CNBC chart (link) shows the trend of global car sales by region (or so we think).

Cnbc zh global car sales

This type of chart is quite common in finance/business circles, and has the fingerprint of Excel. After examining it, I nominate it for the Hall of Shame.

***

The chart has three major components vying for our attention: (1) the stacked columns, (2) the yellow line, and (3) the big red dashed arrow.

The easiest to interpret is the yellow line, which is labeled "Total" in the legend. It displays the annual growth rate of car sales around the globe. The data consist of annual percentage changes in car sales, so the slope of the yellow line represents a change of change, which is not particularly useful.

The big red arrow is making the point that the projected decline in global car sales in 2019 will return the world to the slowdown of 2008-9 after almost a decade of growth.

The stacked columns appear to provide a breakdown of the global growth rate by region. Looked at carefully, you'll soon learn that the visual form has hopelessly mangled the data.

Cnbc_globalcarsales_2006

What is the growth rate for Chinese car sales in 2006? Is it 2.5%, the top edge of China's part of the column? Between 1.5% and 2.5%, the extant of China's section? The answer is neither. Because of the stacking, China's growth rate is actually the height of the relevant section, that is to say, 1 percent. So the labels on the vertical axis are not directly useful to learning regional growth rates for most sections of the chart.

Can we read the vertical axis as global growth rate? That's not proper either. The different markets are not equal in size so growth rates cannot be aggregated by simple summing - they must be weighted by relative size.

The negative growth rates present another problem. Even if we agree to sum growth rates ignoring relative market sizes, we still can't get directly to the global growth rate. We would have to take the total of the positive rates and subtract the total of the negative rates.  

***

At this point, you may begin to question everything you thought you knew about this chart. Remember the yellow line, which we thought measures the global growth rate. Take a look at the 2006 column again.

The global growth rate is depicted as 2 percent. And yet every region experienced growth rates below 2 percent! No matter how you aggregate the regions, it's not possible for the world average to be larger than the value of each region.

For 2006, the regional growth rates are: China, 1%; Rest of the World, 1%; Western Europe, 0.1%; United States, -0.25%. A simple sum of those four rates yields 2%, which is shown on the yellow line.

But this number must be divided by four. If we give the four regions equal weight, each is worth a quarter of the total. So the overall average is the sum of each growth rate weighted by 1/4, which is 0.5%. [In reality, the weights of each region should be scaled to reflect its market size.]

***

tldr; The stacked column chart with a line overlay not only fails to communicate the contents of the car sales data but it also leads to misinterpretation.

I discussed several serious problems of this chart form: 

  • stacking the columns make it hard to learn the regional data

  • the trend by region takes a super effort to decipher

  • column stacking promotes reading meaning into the height of the column but the total height is meaningless (because of the negative section) while the net height (positive minus negative) also misleads due to presumptive equal weighting

  • the yellow line shows the sum of the regional data, which is four times the global growth rate that it purports to represent

 

***

PS. [12/4/2019: New post up with a different visualization.]


Graph literacy, in a sense

Ben Jones tweeted out this chart, which has an unusual feature:

Malefemaleliteracyrates

What's unusual is that time runs in both directions. Usually, the rule is that time runs left to right (except, of course, in right-to-left cultures). Here, the purple area chart follows that convention while the yellow area chart inverts it.

On the one hand, this is quite cute. Lines meeting in the middle. Converging. I get it.

On the other hand, every time a designer defies conventions, the reader has to recognize it, and to rationalize it.

In this particular graphic, I'm not convinced. There are four numbers only. The trend on either side looks linear so the story is simple. Why complicate it using unusual visual design?

Here is an entirely conventional bumps-like chart that tells the story:

Redo_literacyratebygender

I've done a couple of things here that might be considered controversial.

First, I completely straightened out the lines. I don't see what additional precision is bringing to the chart.

Second, despite having just four numbers, I added the year 1996 and vertical gridlines indicating decades. A Tufte purist will surely object.

***

Related blog post: "The Return on Effort in Data Graphics" (link)


Does this chart tell the sordid tale of TI's decline?

The Hustle has an interesting article on the demise of the TI calculator, which is popular in business circles. The article uses this bar chart:

Hustle_ti_calculator_chart

From a Trifecta Checkup perspective, this is a Type DV chart. (See this guide to the Trifecta Checkup.)

The chart addresses a nice question: is the TI graphing calculator a victim of new technologies?

The visual design is marred by the use of the calculator images. The images add nothing to our understanding and create potential for confusion. Here is a version without the images for comparison.

Redo_junkcharts_hustlet1calc

The gridlines are placed to reveal the steepness of the decline. The sales in 2019 will likely be half those of 2014.

What about the Data? This would have been straightforward if the revenues shown are sales of the TI calculator. But according to the subtitle, the data include a whole lot more than calculators - it's the "other revenues" category in the financial reports of Texas Instrument which markets the TI. 

It requires a leap of faith to believe this data. It is entirely possible that TI calculator sales increased while total "other revenues" decreased! The decline of TI calculator could be more drastic than shown here. We simply don't have enough data to say for sure.

 

P.S. [10/3/2019] Fixed TI.

 

 


The windy path to the Rugby World Cup

When I first saw the following chart, I wondered whether it is really that challenging for these eight teams to get into the Rugby World Cup, currently playing in Japan:

1920px-2019_Rugby_World_Cup_Qualifying_Process_Diagram.svg

Another visualization of the process conveys a similar message. Both of these are uploaded to Wikipedia.

Rugby_World_Cup_2019_Qualification_illustrated_v2

(This one hasn't been updated and still contains blank entries.)

***

What are some of the key messages one would want the dataviz to deliver?

  • For the eight countries that got in (not automatically), track their paths to the World Cup. How many competitions did they have to play?
  • For those countries that failed to qualify, track their paths to the point that they were stopped. How many competitions did they play?
  • What is the structure of the qualification rounds? (These are organized regionally, in addition to certain playoffs across regions.)
  • How many countries had a chance to win one of the eight spots?
  • Within each competition, how many teams participated? Did the winner immediately qualify, or face yet another hurdle? Did the losers immediately disqualify, or were they offered another chance?

Here's my take on this chart:

Rugby_path_to_world_cup_sm