Same data + same chart form = same story. Maybe.

We love charts that tell stories.

Some people believe that if they situate the data in the right chart form, the stories reveal themselves.

Some people believe for a given dataset, there exists a best chart form that brings out the story.

An implication of these beliefs is that the story is immutable, given the dataset and the chart form.

If you use the Trifecta Checkup, you already know I don't subscribe to those ideas. That's why the Trifecta has three legs, the third is the question - which is related to the message or the story.


I came across the following chart by Statista, illustrating the growth in Covid-19 cases from the start of the pandemic to this month. The underlying data are collected by WHO and cover the entire globe. The data are grouped by regions.


The story of this chart appears to be that the world moves in lock step, with each region behaving more or less the same.

If you visit the WHO site, they show a similar chart:


On this chart, the regions at the bottom of the graph (esp. Southeast Asia in purple) clearly do not follow the same time patterns as Americas (orange) or Europe (green).

What we're witnessing is: same data, same chart form, different stories.

This is a feature, not a bug, of the stacked area chart. The story is driven largely by the order in which the pieces are stacked. In the Statista chart, the largest pieces are placed at the bottom while for WHO, the order is exactly reversed.

(There are minor differences which do not affect my argument. The WHO chart omits the "Other" category which accounts for very little. Also, the Statista chart shows the smoothed data using 7-day averaging.)

In this example, the order chosen by WHO preserves the story while the order chosen by Statista wipes it out.


What might be the underlying question of someone who makes this graph? Perhaps it is to identify the relative prevalence of Covid-19 in different regions at different stages of the pandemic.

Emphasis on the word "relative". Instead of plotting absolute number of cases, I consider plotting relative number of cases, that is to say, the proportion of cases in each region at given times.

This leads to a stacked area percentage chart.


In this side-by-side view, you see that this form is not affected by flipping the order of the regions. Both charts say the same thing: that there were two waves in Europe and the Americas that dwarfed all other regions.



Reading an infographic about our climate crisis

Let's explore an infographic by SCMP, which draws attention to the alarming temperature recorded at Verkhoyansk in Russia on June 20, 2020. The original work was on the back page of the printed newspaper, referred to in this tweet.

This view of the globe brings out the two key pieces of evidence presented in the infographic: the rise in temperature in unexpected places, and the shrinkage of the Arctic ice.


A notable design decision is to omit the color scale. On inspection, the scale is present - it was sewn into the graphic.


I applaud this decision as it does not take the reader's eyes away from the graphic. Some information is lost as the scale isn't presented in full details but I doubt many readers need those details.

A key takeaway is that the temperature in Verkhoyansk, which is on the edge of the Arctic Circle, was the same as in New Delhi in India on that day. We can see how the red was encroaching upon the Arctic Circle.


Next, the rapid shrinkage of the Arctic ice is presented in two ways. First, a series of maps.

The annotations are pared to the minimum. The presentation is simple enough such that we can visually judge that the amount of ice cover has roughly halved from 1980 to 2009.

A numerical measure of the drop is provided on the side.

Then, a line chart reinforces this message.

The line chart emphasizes change over time while the series of maps reveals change over space.


This chart suggests that the year 2020 may break the record for the smallest ice cover since 1980. The maps of Australia and India provide context to interpret the size of the Arctic ice cover.

I'd suggest reversing the pink and black colors so as to refer back to the blue and pink lines in the globe above.


The final chart shows the average temperature worldwide and in the Arctic, relative to a reference period (1981-2000).


This one is tough. It looks like an area chart but it should be read as a line chart. The darker line is the anomaly of Arctic average temperature while the lighter line is the anomaly of the global average temperature. The two series are synced except for a brief period around 1940. Since 2000, the temperatures have been dramatically rising above that of the reference period.

If this is a stacked area chart, then we'd interpret the two data series as summable, with the sum of the data series signifying something interesting. For example, the market shares of different web browsers sum to the total size of the market.

But the chart above should not be read as a stacked area chart because the outside envelope isn't the sum of the two anomalies. The problem is revealed if we try to articulate what the color shades mean.


On the far right, it seems like the dark shade is paired with the lighter line and represents global positive anomalies while the lighter shade shows Arctic's anomalies in excess of global. This interpretation only works if the Arctic line always sits above the global line. This pattern is broken in the late 1990s.

Around 1999, the Arctic's anomaly is negative while the global anomaly is positive. Here, the global anomaly gets the lighter shade while the Arctic one is blue.

One possible fix is to encode the size of the anomaly into the color of the line. The further away from zero, the darker the red/blue color.



Aligning the visual and the data

The Washington Post reported a surge in donations to the Democrats after the death of Justice Ruth Ginsberg (link). A secondary effect, perhaps unexpected, was that donors decided to spread the money around; the proportion of donors who gave to six or more candidates jumped to 65%, where normally it is at 5%.


The text tells us what to look for, and the axis labels are commendably restrained. The color scheme is also intuitive.

There is something frustrating about this chart, though. It's that the spike is shown upside down. The level that the arrow points at is 45%, which is the total of the blue columns. The visual suggests the proportion of multiple beneficiaries (2 or more) should be 55%. There is a divergence between what the visual is saying and what the data are saying. Whichever number is correct, the required proportion is the inverse of the level shown on the percentage axis!


This is the same chart flipped over.


Now, the number we need can be read off the vertical axis.

I also moved the color legend to the right side so that the entries can be printed vertically, in the same direction as the data. This is one of the unspoken rules of data visualization I featured in my feature for


In the Trifecta Checkup (link), the issue is with the green arrow between the D corner and the V corner. The data and the visual are not in sync. 


I made a streamgraph

The folks at FiveThirtyEight were excited about the following dataviz they published last week two weeks ago, illustrating the progression of vote-counting by state. (link) That was indeed the unique and confusing feature of the 2020 Presidential election in the States. For those outside the U.S., what happened (by and large) was that many Americans, skewing Biden supporters, voted by mail before Election Day but their votes were sometimes counted after the same-day votes were tallied.



A number of us kept staring at these charts, hoping for a how-to-read-it explanation. Here is a zoom-in for the state of Michigan:


To save you the trouble, here is how.

The key is to fight your urge to look at the brown area. I know, it's pretty hard to ignore the biggest areas of every chart. But try to make them disappear.

Focus on the top edge of the chart. This line gives the total number of votes counted so far. In Michigan, by hour 12, about 2.4 million votes were counted, and by hour 72, 2.8 million votes were on the book. This line gives the sum of the two major parties' vote totals [since third parties got negligible votes in this election, I'm ignoring them so as to simplify the discussion].

Next, look at the red and blue areas. These represent the gap in the number of votes between the two parties' current vote totals. If the area is red, Trump was leading; if blue, Biden was leading. Each color flip represents a lead change. Suppress the urge to interpret red as the number or share of Trump votes.


What have we learned about the vote counting in Michigan?

Counting significantly slowed after the 12th hour. Trump raced to a lead on Election Day, and around hour 20, the race was dead even, and after that, Biden overtook Trump and never looked back. Throughout most of this period, the vote lead was small compared to the total votes cast although at the end, the Biden lead was noticeable.

If you insist on interpreting the brown area, it is equal to twice the vote total of the second-place candidate, so it really isn't something you want to look at.

Just for contrast, here is the chart for Iowa:


Trump led from beginning to end, with his lead widening slightly as more votes were counted.


As I was stewing over this chart, a ominous thought overcame me. Would a streamgraph work for this data? You don't hear much about streamgraphs here because I rarely favor them (see this long-ago post) but let's just try one and see.


(These streamgraphs were made in R using the streamgraph package. Post-processing was applied to customize the labeling.)

This chart conveys all the key points listed before. You can see how the gap evolved over time, the lead flips, which candidate was in the lead, and the total mass of votes counted at different times. The gap is shown in the middle.

I can't say I'm completely happy with the streamgraph - I hope readers don't care about the numbers because it's hard to evaluate a difference when it's split two ways on either side of the middle axis!


If you come up with a better idea, make sure to leave a comment.





Locating the political center

I mentioned the September special edition of Bloomberg Businessweek on the election in this prior post. Today, I'm featuring another data visualization from the magazine.



Here are the rightmost two charts.

Bloomberg_politicalcenter_rightside Time runs from top to bottom, spanning four decades.

Each chart covers a political issue. These two charts concern abortion and marijuana.

The marijuana question (far right) has only two answers, legalize or don't legalize. The underlying data measure the proportions of people agreeing to each point of view. Roughly three-quarters of the population disagreed with legalization in 1980 while two-thirds agree with it in 2020.

Notice that there are no horizontal axis labels. This is a great editorial decision. Only coarse trends are of interest here. It's not hard to figure out the relative proportions. Adding labels would just clutter up the display.

By contrast, the abortion question has three answer choices. The middle option is "Sometimes," which is represented by a white color, with a dot pattern. This is an issue on which public opinion in aggregate has barely shifted over time.

The charts are organized in a small-multiples format. It's likely that readers are consuming each chart individually.


What about the dashed line that splits each chart in half? Why is it there?

The vertical line assists our perception of the proportions. Think of it as a single gridline.

In fact, this line is underplayed. The headline of the article is "tracking the political center." Where is the center?

Until now, we've paid attention to the boundaries between the differently colored areas. But those boundaries do not locate the political center!

The vertical dashed line is the political center; it represents the view of the median American. In 1980, the line sat inside the gray section, meaning the median American opposed legalizing marijuana. But the prevalent view was losing support over time and by 2010, there wer more Americans wanting to legalize marijuana than not. This is when the vertical line crossed into the green zone.

The following charts draw attention to the middle line, instead of the color boundaries:

Junkcharts_redo_bloombergpoliticalcenterrightsideOn these charts, as you glance down the middle line, you can see that for abortion, the political center has never exited the middle category while for marijuana, the median American didn't want to legalize it until an inflection point was reached around 2010.

I highlight these inflection points with yellow dots.


The effect on readers is entirely changed. The original charts draw attention to the areas first while the new charts pull your eyes to the vertical line.


Putting vaccine trials in boxes

Bloomberg Businessweek has a special edition about vaccines, and I found this chart on the print edition:


The chart's got a lot of white space. Its structure is a series of simple "treemaps," one for each type of vaccine. Though simple, such a chart burns a few brain cells.

Here, I've extracted the largest block, which corresponds to vaccines that work with the virus's RNA/DNA. I applied a self-sufficiency test, removing the data from the boxes. 


What proportion of these projects have moved from pre-clinical to Phase 1?  To answer this question, we have to understand the relative areas of boxes, since that's how the data are encoded. How many yellow boxes can fit into the gray box?

It's not intuitive. We'd need a ruler to do this task properly.

Then, we learn that the gray box is exactly 8 times the size of the yellow box (72 projects are pre-clinical while 9 are in Phase I). We can cram eight yellows into the gray box. Imagine doing that, and it's pretty clear the visual elements fail to convey the meaning of the data.

Self-sufficiency is the idea that a data graphic should not rely on printed data to convey its meaning; the visual elements of a data graphic should bear much of the burden. Otherwise, use a data table. To test for self-sufficiency, cover up the printed data and see if the chart still works.


A key decision for the designer is the relative importance of (a) the number of projects reaching Phase III, versus (b) the number of projects utilizing specific vaccine strategies.

This next chart emphasizes the clinical phases:



Contrast this with the version shown in the online edition of Bloomberg (link), which emphasizes the vaccine strategies.


If any reader can figure out the logic of the ordering of the vaccine strategies, please leave a comment below.

Deaths as percent neither of cases nor of population. Deaths as percent of normal.

Yesterday, I posted a note about excess deaths on the book blog (link). The post was inspired by a nice data visualization by the New York Times (link). This is a great example of data journalism.


Excess deaths is a superior metric for measuring the effect of Covid-19 on public health. It's better than deaths as percent of cases. Also better than percent of the population.What excess deaths measure is deaths as a percent of normal. Normal is usually defined as the average deaths in the respective week in years past.

The red areas indicate how far the deaths in the Southern states are above normal. The highest peak, registered in Texas in late July, is 60 percent above the normal level.


The best way to appreciate the effort that went into this graphic is to imagine receiving the outputs from the model that computes excess deaths. A three-column spreadsheet with columns "state", "week number" and "estimated excess deaths".

The first issue is unequal population sizes. More populous states of course have higher death tolls. Transforming death tolls to an index pegged to the normal level solves this problem. To produce this index, we divide actual deaths by the normal level of deaths. So the spreadsheet must be augmented by two additional columns, showing the historical average deaths and actual deaths for each state for each week. Then, the excess death index can be computed.

The journalist builds a story around the migration of the coronavirus between different regions as it rages across different states  during different weeks. To this end, the designer first divides the dataset into four regions (South, West, Midwest and Northeast). Within each region, the states must be ordered. For each state, the week of peak excess deaths is identified, and the peak index is used to sort the states.

The graphic utilizes a small-multiples framework. Time occupies the horizontal axis, by convention. The vertical axis is compressed so that the states are not too distant. For the same reason, the component graphs are allowed to overlap vertically. The benefit of the tight arrangement is clearer for the Northeast as those peaks are particularly tall. The space-saving appearance reminds me of sparklines, championed by Ed Tufte.

There is one small tricky problem. In most of June, Texas suffered at least 50 percent more deaths than normal. The severity of this excess death toll is shortchanged by the low vertical height of each component graph. What forced such congestion is probably the data from the Northeast. For example, New York City:



New York City's death toll was almost 8 times the normal level at the start of the epidemic in the U.S. If the same vertical scale is maintained across the four regions, then the Northeastern states dwarf all else.


One key takeaway from the graphic for the Southern states is the persistence of the red areas. In each state, for almost every week of the entire pandemic period, actual deaths have exceeded the normal level. This is strong indication that the coronavirus is not under control.

In fact, I'd like to see a second set of plots showing the cumulative excess deaths since March. The weekly graphic is better for identifying the ebb and flow while the cumulative graphic takes measure of the total impact of Covid-19.


The above description leaves out a huge chunk of work related to computing excess deaths. I assumed the designer receives these estimates from a data scientist. See the related post in which I explain how excess deaths are estimated from statistical models.


More visuals of the economic crisis

As we move into the next phase of the dataviz bonanza arising from the coronavirus pandemic, we will see a shift from simple descriptive graphics of infections and deaths to bivariate explanatory graphics claiming (usually spurious) correlations.

The FT is leading the way with this effort, and I hope all those who follow will make a note of several wise decisions they made.

  • They source their data. Most of the data about business activities come from private entities, many of which are data vendors who make money selling the data. In this article, FT got restaurant data from OpenTable, retail foot traffic data from Springboard, box office data from Box Office Mojo, flight data from Flightradar24, road traffic data from TomTom, and energy use data from European Network of Transmission System Operators for Electricity.
  • They generally let the data and charts speak without "story time". The text primarily describes the trends of the various data series.
  • They selected sectors that are obviously impacted by the shutdowns so any link between the observed trends and the crisis is plausible.

The FT charts are examples of clarity. Here is the one about road traffic patterns in major cities:


The cities are organized into regions: Europe, US, China, other Asia.

The key comparison is the last seven days versus the historical averages. The stories practically jump out of the page. Traffic in Paris collapsed on Tuesday. Wuhan is still locked down despite the falloff in infections. Drivers of Tokyo are probably wondering why teams are not going to the Olympics this year. Londoners? My guess is they're determined to not let another Brexit deadline slip.


I'd hope we go even further than FT when publishing this type of visual analytics involving "Big Data." These business data obtained from private sources typically have OCCAM properties: they are observational, seemingly complete, uncontrolled, adapted and merged. All these properties make the data very challenging to interpret.

The coronavirus case and death counts are simple by comparison. People are now aware of all the problems from differential rates of testing to which groups are selectively tested (i.e. triage) to how an infection or death is defined. The problems involving Big Data are much more complex.

I have three additional proposals:

Disclosure of Biases and Limitations

The private data have many more potential pitfalls. Take OpenTable data for example. The data measure restaurant bookings, not revenues. It measures gross bookings, not net bookings (i.e. removing no-shows). Only a proportion of restaurants use OpenTable (which cost owners money). OpenTable does not strike me as a quasi-monopoly so there are competitors with significant market share. The restaurants that use OpenTable do not form a random subsample of all restaurants. One of the most popular restaurants in the U.S. are pizza joints, with little of no seating, which do not feature in the bookings data. OpenTable also has differential popularity by country, region, or probably cuisine. 

I believe data journalists ought to provide such context in a footnote. Readers should have the information to judge whether they believe the data are sufficiently representative. Private data vendors who want data journalists to feature their datasets should be required to supply a footnote that describes the biases and limitations of their data.

Data journalists should think seriously about how they headline this type of chart. The standard practice is what FT adopted. The headline said "Restaurant bookings have collapsed" with a small footnote saying "Source: OpenTable". Should the headline have said "OpenTable bookings have collapsed" instead?

Disclosure of Definitions and Proxies

In the road traffic chart shown above, the metric is called "TomTom traffic congestion index". In order to earn this free advertising (euphemistically called "earned media" by industry), TomTom should be obliged to explain how this index is constructed. What does index = 100 mean?

[For example, it is curious that the Madrid index values are much lower across the board than those in Paris and Roma.]

For the electric usage chart, FT discloses the name of the data provider as a group of "43 electricity transmission system operators in 36 countries across Europe." Now, that is important context but can be better. The group may consist of 43 operators but how many of them are in the dataset? What proportion of the total electric usage do they account for in each country? If they have low penetration in a particular country, do they just report the low statistics or adjust the numbres?

If the journalist decides to use a proxy, for example, OpenTable restaurant bookings to reflect restaurant revenues, that should be explained, perhaps even justified.

Data as a Public Good

If private businesses choose to supply data to media outlets as a public service, they should allow the underlying data to be published.

Speaking from experience, I've seen a lot of bad data. It's one thing to hold your nose when the data are analyzed to make online advertising more profitable, or to find signals to profit from the stock market. It's another thing for the data analysis to drive public policy, in this case, policies that will have life-or-death implications.

Graph literacy, in a sense

Ben Jones tweeted out this chart, which has an unusual feature:


What's unusual is that time runs in both directions. Usually, the rule is that time runs left to right (except, of course, in right-to-left cultures). Here, the purple area chart follows that convention while the yellow area chart inverts it.

On the one hand, this is quite cute. Lines meeting in the middle. Converging. I get it.

On the other hand, every time a designer defies conventions, the reader has to recognize it, and to rationalize it.

In this particular graphic, I'm not convinced. There are four numbers only. The trend on either side looks linear so the story is simple. Why complicate it using unusual visual design?

Here is an entirely conventional bumps-like chart that tells the story:


I've done a couple of things here that might be considered controversial.

First, I completely straightened out the lines. I don't see what additional precision is bringing to the chart.

Second, despite having just four numbers, I added the year 1996 and vertical gridlines indicating decades. A Tufte purist will surely object.


Related blog post: "The Return on Effort in Data Graphics" (link)

Elegant way to present a pair of charts

The Bloomberg team has come up with a few goodies lately. I was captivated by the following graphic about the ebb and flow of U.S. presidential candidates across recent campaigns. Link to the full presentation here.

The highlight is at the bottom of the page. This is an excerpt of the chart:


From top to bottom are the sequential presidential races. The far right vertical axis is the finish line. Going right to left is the time before the finish line. In 2008, for example, there are candidates who entered the race much earlier than typical.

This chart presents an aggregate view of the data. We get a sense of when most of the candidates enter the race, when most of them are knocked out, and also a glimpse of outliers. The general pattern across multiple elections is also clear. The design is a stacked area chart with the baseline in the middle, rather than the bottom, of the chart.

Sure, the chart can disappoint those readers who want details and precise numbers. It's not immediately apparent how many candidates were in the race at the height of 2008, nor who the candidates were.

The designer added a nice touch. By clicking on any of the stacks, it transforms into a bar chart, showing the extent of each candidate's participation in the race.


I wish this was a way to collapse the bar chart back to the stack. You can reload the page to start afresh.


This elegant design touch makes the user experience playful. It's also an elegant way to present what is essentially a panel of plots. Imagine the more traditional presentation of placing the stack and the bar chart side by side.

This design does not escape the trade-off between entertainment value and data integrity. Looking at the 2004 campaign, one should expect to see the blue stack halve in size around day 100 when Kerry became the last man standing. That moment is not marked in the stack. The stack can be interpreted as a smoothed version of the count of active candidates.


I suppose some may complain the stack misrepresents the data somewhat. I find it an attractive way of presenting the big-picture message to an audience that mostly spend less than a minute looking at the graphic.