Make your color legend better with one simple rule

The pie chart about COVID-19 worries illustrates why we should follow a basic rule of constructing color legends: order the categories in the way you expect readers to encounter them.

Here is the chart that I discussed the other day, with the data removed since they are not of concern in this post. (link)

Junkcharts_abccovidbiggestworries_sufficiency

First look at the pie chart. Like me, you probably looked at the orange or the yellow slice first, then we move clockwise around the pie.

Notice that the legend leads with the red square ("Getting It"), which is likely the last item you'll see on the chart.

This is the same chart with the legend re-ordered:

Redo_junkcharts_abcbiggestcovidworries_legend

***

Simple charts can be made better if we follow basic rules of construction. When used frequently, these rules can be made silent. I cover rules for legends as well as many other rules in this Long Read article titled "The Unspoken Conventions of Data Visualization" (link).


Bad data leave chart hanging by the thread

IGNITE National put out a press release saying that Gen Z white men are different from all other race-gender groups because they are more likely to be or lean Republican. The evidence is in this chart:

Genz_survey

Or is it?

Following our Trifecta Checkup framework (link), let's first look at the data. White men is the bottom left group. Democratic = 42%, Independent = 28%, Republican = 48%. That's a total of 118%. Unfortunately, this chart construction error erases the message. We don't know which of the three columns were incorrectly sized, or perhaps the data were incorrectly weighted so that the error is spread out between the three columns.

But the story of the graphic is hanging by the thread - the gap between Democratic and Republican lean amongst white men is 6 percent, which is smaller than the data error of 10 percent. I sent them a tweet asking for a correction. Will post the corrected version if they respond.

Update: The thread didn't break. They replied quickly and issued the following corrected chart:

Genz_corrected

Now, the data for white men are: Democratic = 35%, Independent = 22%; Republican = 40%. Roughly 7% shift for each party affilitation so they may have just started the baseline at the wrong level when inverting the columns.

***

The Visual design also has some problems. I am not a fan of inverting columns. In fact, column inversion may be the root of the error above.

Genz_whitemenLet me zoom in on the white men columns. (see right)

Without looking at the legend, can you guess which color is Democratic, Independent or Republican? Go ahead and take your best guess.

For me, I think red is Republican (by convention), then white is Independent (a neutral color) which means yellow is Democratic.

Here is the legend:

Genz-legend

So I got the yellow and white reversed. And that is another problem with the visual design. For a chart that shows two-party politics in the U.S., there is really no good reason to deviate from the red-blue convention. The color for Independents doesn't matter since it would be understood that the third color would represent them.

If the red-blue convention were followed, readers do not need to consult the legend.

***

In my Long Read article at DataJournalism.com, I included an "unspoken rule" about color selection: use the natural color mapping whenever possible. Go here to read about this and other rules.

The chart breaks another one of the unspoken conventions. When making a legend, place it near the top of the chart. Readers need to know the color mapping before they can understand the chart.

In addition, you want the reader's eyes to read the legend in the same way they read the columns. The columns goes left to right from Democratic to Independent to Republican. The legend should do the same!

***

Here is a quick re-do that fixes the visual issues (except the data error). It's an Excel chart but it doesn't have to be bad.

Redo_genzsurvey

 


Gazing at petals

Reader Murphy pointed me to the following infographic developed by Altmetric to explain their analytics of citations of journal papers. These metrics are alternative in that they arise from non-academic media sources, such as news outlets, blogs, twitter, and reddit.

The key graphic is the petal diagram with a number in the middle.

Altmetric_tetanus

I have a hard time thinking of this object as “data visualization”. Data visualization should visualize the data. Here, the connection between the data and the visual design is tenuous.

There are eight petals arranged around the circle. The legend below the diagram maps the color of each petal to a source of data. Red, for example, represents mentions in news outlets, and green represents mentions in videos.

Each petal is the same size, even though the counts given below differ. So, the petals are like a duplicative legend.

The order of the colors around the circle does not align with its order in the table below, for a mysterious reason.

Then comes another puzzle. The bluish-gray petal appears three times in the diagram. This color is mapped to tweets. Does the number of petals represent the much higher counts of tweets compared to other mentions?

To confirm, I pulled up the graphic for a different paper.

Altmetric_worldwidedeclineofentomofauna

Here, each petal has a different color. Eight petals, eight colors. The count of tweets is still much larger than the frequencies of the other sources. So, the rule of construction appears to be one petal for each relevant data source, and if the total number of data sources fall below eight, then let Twitter claim all the unclaimed petals.

A third sample paper confirms this rule:

Altmetric_dnananodevices

None of the places we were hoping to find data – size of petals, color of petals, number of petals – actually contain any data. Anything the reader wants to learn can be directly read. The “score” that reflects the aggregate “importance” of the corresponding paper is found at the center of the circle. The legend provides the raw data.

***

Some years ago, one of my NYU students worked on a project relating to paper citations. He eventually presented the work at a conference. I featured it previously.

Michaelbales_citationimpact

Notice how the visual design provides context for interpretation – by placing each paper/researcher among its peers, and by using a relative scale (percentiles).

***

I’m ignoring the D corner of the Trifecta Checkup in this post. For any visualization to be meaningful, the data must be meaningful. The type of counting used by Altmetric treats every tweet, every mention, etc. as a tally, making everything worth the same. A mention on CNN counts as much as a mention by a pseudonymous redditor. A pan is the same as a rave. Let’s not forget the fake data menace (link), which  affects all performance metrics.


Taking small steps to bring out the message

Happy new year! Good luck and best wishes!

***

We'll start 2020 with something lighter. On a recent flight, I saw a chart in The Economist that shows the proportion of operating income derived from overseas markets by major grocery chains - the headline said that some of these chains are withdrawing from international markets.

Econ_internationalgroceries_sm

The designer used one color for each grocery chain, and two shades within each color. The legend describes the shades as "total" and "of which: overseas". As with all stacked bar charts, it's a bit confusing where to find the data. The "total" is actually the entire bar, not just the darker shaded part. The darker shaded part is better labeled "home market" as shown below:

Redo_econgroceriesintl_1

The designer's instinct to bring out the importance of international markets to each company's income is well placed. A second small edit helps: plot the international income amounts first, so they line up with the vertical zero axis. Like this:

Redo_econgroceriesintl_2

This is essentially the same chart. The order of international and home market is reversed. I also reversed the shading, so that the international share of income is displayed darker. This shading draws the readers' attention to the key message of the chart.

A stacked bar chart of the absolute dollar amounts is not ideal for showing proportions, because each bar is a different length. Sometimes, plotting relative values summing to 100% for each company may work better.

As it stands, the chart above calls attention to a different message: that Walmart dwarfs the other three global chains. Just the international income of Walmart is larger than the total income of Costco.

***

Please comment below or write me directly if you have ideas for this blog as we enter a new decade. What do you want to see more of? less of?


This Excel chart looks standard but gets everything wrong

The following CNBC chart (link) shows the trend of global car sales by region (or so we think).

Cnbc zh global car sales

This type of chart is quite common in finance/business circles, and has the fingerprint of Excel. After examining it, I nominate it for the Hall of Shame.

***

The chart has three major components vying for our attention: (1) the stacked columns, (2) the yellow line, and (3) the big red dashed arrow.

The easiest to interpret is the yellow line, which is labeled "Total" in the legend. It displays the annual growth rate of car sales around the globe. The data consist of annual percentage changes in car sales, so the slope of the yellow line represents a change of change, which is not particularly useful.

The big red arrow is making the point that the projected decline in global car sales in 2019 will return the world to the slowdown of 2008-9 after almost a decade of growth.

The stacked columns appear to provide a breakdown of the global growth rate by region. Looked at carefully, you'll soon learn that the visual form has hopelessly mangled the data.

Cnbc_globalcarsales_2006

What is the growth rate for Chinese car sales in 2006? Is it 2.5%, the top edge of China's part of the column? Between 1.5% and 2.5%, the extant of China's section? The answer is neither. Because of the stacking, China's growth rate is actually the height of the relevant section, that is to say, 1 percent. So the labels on the vertical axis are not directly useful to learning regional growth rates for most sections of the chart.

Can we read the vertical axis as global growth rate? That's not proper either. The different markets are not equal in size so growth rates cannot be aggregated by simple summing - they must be weighted by relative size.

The negative growth rates present another problem. Even if we agree to sum growth rates ignoring relative market sizes, we still can't get directly to the global growth rate. We would have to take the total of the positive rates and subtract the total of the negative rates.  

***

At this point, you may begin to question everything you thought you knew about this chart. Remember the yellow line, which we thought measures the global growth rate. Take a look at the 2006 column again.

The global growth rate is depicted as 2 percent. And yet every region experienced growth rates below 2 percent! No matter how you aggregate the regions, it's not possible for the world average to be larger than the value of each region.

For 2006, the regional growth rates are: China, 1%; Rest of the World, 1%; Western Europe, 0.1%; United States, -0.25%. A simple sum of those four rates yields 2%, which is shown on the yellow line.

But this number must be divided by four. If we give the four regions equal weight, each is worth a quarter of the total. So the overall average is the sum of each growth rate weighted by 1/4, which is 0.5%. [In reality, the weights of each region should be scaled to reflect its market size.]

***

tldr; The stacked column chart with a line overlay not only fails to communicate the contents of the car sales data but it also leads to misinterpretation.

I discussed several serious problems of this chart form: 

  • stacking the columns make it hard to learn the regional data

  • the trend by region takes a super effort to decipher

  • column stacking promotes reading meaning into the height of the column but the total height is meaningless (because of the negative section) while the net height (positive minus negative) also misleads due to presumptive equal weighting

  • the yellow line shows the sum of the regional data, which is four times the global growth rate that it purports to represent

 

***

PS. [12/4/2019: New post up with a different visualization.]


The time of bird seeds and chart tuneups

The recent post about multi-national companies reminded me of an older post, in which I stepped through data table enhancements.

Here is a video of the process. You can use any tool to implement the steps; even Excel is good enough.

 

 

The video is part of a series called "Data science: the Missing Pieces". In these episodes, I cover the parts of data science that are between the cracks, the little things that textbooks and courses do not typically cover - the things that often block students from learning efficiently.

If you have encountered such things, please comment below to suggest future topics. What is something about visualizing data you wish you learned formally?

***

P.S. Placed here to please the twitter-bot

DSTMP2_goodchart_thumb

 

 


Pulling the multi-national story out, step by step

Reader Aleksander B. found this Economist chart difficult to understand.

Redo_multinat_1

Given the chart title, the reader is looking for a story about multinationals producing lower return on equity than local firms. The first item displayed indicates that multinationals out-performed local firms in the technology sector.

The pie charts on the right column provide additional information about the share of each sector by the type of firms. Is there a correlation between the share of multinationals, and their performance differential relative to local firms?

***

We can clean up the presentation. The first changes include using dots in place of pipes, removing the vertical gridlines, and pushing the zero line to the background:

Redo_multinat_2

The horizontal gridlines attached to the zero line can also be removed:

Redo_multinat_3

Now, we re-order the rows. Start with the aggregate "All sectors". Then, order sectors from the largest under-performance by multinationals to the smallest.

Redo_multinat_4

The pie charts focus only on the share of multinationals. Taking away the remainders speeds up our perception:

Redo_multinat_5

Help the reader understand the data by dividing the sectors into groups, organized by the performance differential:

Redo_multinat_6

For what it's worth, re-sort the sectors from largest to smallest share of multinationals:

Redo_multinat_7

Having created groups of sectors by share of multinationals, I simplify further by showing the average pie chart within each group:

Redo_multinat_8

***

To recap all the edits, here is an animated gif: (if it doesn't play automatically, click on it)

Redo_junkcharts_econmultinat

***

Judging from the last graphic, I am not sure there is much correlation between share of multinationals and the performance differentials. It's interesting that in aggregate, local firms and multinationals performed the same. The average hides the variability by sector: in some sectors, local firms out-performed multinationals, as the original chart title asserted.


As Dorian confounds meteorologists, we keep our minds clear on hurricane graphics, and discover correlation as our friend

As Hurricane Dorian threatens the southeastern coast of the U.S., forecasters are fretting about the lack of consensus among various predictive models used to predict the storm’s trajectory. The uncertainty of these models, as reflected in graphical displays, has been a controversial issue in the visualization community for some time.

Let’s start by reviewing a visual design that has captured meteorologists in recent years, something known as the cone map.

Charley_oldconemap

If asked to explain this map, most of us trace a line through the middle of the cone understood to be the center of the storm, the “cone” as the areas near the storm center that are affected, and the warmer colors (red, orange) as indicating higher levels of impact. [Note: We will  design for this type of map circa 2000s.]

The above interpretation is complete, and feasible. Nevertheless, the data used to make the map are forward-looking, not historical. It is still possible to stick to the same interpretation by substituting historical measurement of impact with its projection. As such, the “warmer” regions are projected to suffer worse damage from the storm than the “cooler” regions (yellow).

After I replace the text that was removed from the map (see below), you may notice the color legend, which discloses that the colors on the map encode probabilities, not storm intensity. The text further explains that the chart shows the most probable path of the center of the storm – while the coloring shows the probability that the storm center will reach specific areas.

Charley_oldconemap

***

When reading a data graphic, we rarely first look for text about how to read the chart. In the case of the cone map, those who didn’t seek out the instructions may form one of these misunderstandings:

  1. For someone living in the yellow-shaded areas, the map does not say that the impact of the storm is projected to be lighter; it’s that the center of the storm has a lower chance of passing right through. If, however, the storm does pay a visit, the intensity of the winds will reach hurricane grade.
  2. For someone living outside the cone, the map does not say that the storm will definitely bypass you; it’s that the chance of a direct hit is below the threshold needed to show up on the cone map. Thee threshold is set to attain 66% accurate. The actual paths of storms are expected to stay inside the cone two out of three times.

Adding to the confusion, other designers have produced cone maps in which color is encoding projections of wind speeds. Here is the one for Dorian.

AL052019_wind_probs_64_F120

This map displays essentially what we thought the first cone map was showing.

One way to differentiate the two maps is to roll time forward, and imagine what the maps should look like after the storm has passed through. In the wind-speed map (shown below right), we will see a cone of damage, with warmer colors indicating regions that experienced stronger winds.

Projectedactualwinds_irma

In the storm-center map (below right), we should see a single curve, showing the exact trajectory of the center of the storm. In other words, the cone of uncertainty dissipates over time, just like the storm itself.

Projectedactualstormcenter_irma

 

After scientists learned that readers were misinterpreting the cone maps, they started to issue warnings, and also re-designed the cone map. The cone map now comes with a black-box health warning right up top. Also, in the storm-center cone map, color is no longer used. The National Hurricane Center even made a youtube pointing out the dos and donts of using the cone map.

AL052019_5day_cone_with_line_and_wind

***

The conclusion drawn from misreading the cone map isn’t as devastating as it’s made out to be. This is because the two issues are correlated. Since wind speeds are likely to be stronger nearer to the center of the storm, if one lives in a region that has a low chance of being a direct hit, then that region is also likely to experience lower average wind speeds than those nearer to the projected center of the storm’s path.

Alberto Cairo has written often about these maps, and in his upcoming book, How Charts Lie, there is a nice section addressing his work with colleagues at the University of Miami on improving public understanding of these hurricane graphics. I highly recommended Cairo’s book here.

P.S. [9/5/2019] Alberto also put out a post about the hurricane cone map.

 

 

 


Too much of a good thing

Several of us discussed this data visualization over twitter last week. The dataviz by Aero Data Lab is called “A Bird’s Eye View of Pharmaceutical Research and Development”. There is a separate discussion on STAT News.

Here is the top section of the chart:

Aerodatalab_research_top

We faced a number of hurdles in understanding this chart as there is so much going on. The size of the shapes is perhaps the first thing readers notice, followed by where the shapes are located along the horizontal (time) axis. After that, readers may see the color of the shapes, and finally, the different shapes (circles, triangles,...).

It would help to have a legend explaining the sizes, shapes and colors. These were explained within the text. The size encodes the number of test subjects in the clinical trials. The color encodes pharmaceutical companies, of which the graphic focuses on 10 major ones. Circles represent completed trials, crosses inside circles represent terminated trials, triangles represent trials that are still active and recruiting, and squares for other statuses.

The vertical axis presents another challenge. It shows the disease conditions being investigated. As a lay-person, I cannot comprehend the logic of the order. With over 800 conditions, it became impossible to find a particular condition. The search function on my browser skipped over the entire graphic. I believe the order is based on some established taxonomy.

***

In creating the alternative shown below, I stayed close to the original intent of the dataviz, retaining all the dimensions of the dataset. Instead of the fancy dot plot, I used an enhanced data table. The encoding methods reflect what I’d like my readers to notice first. The color shading reflects the size of each clinical trial. The pharmaceutical companies are represented by their first initials. The status of the trial is shown by a dot, a cross or a square.

Here is a sketch of this concept showing just the top 10 rows.

Redo_aero_pharmard

Certain conditions attracted much more investment. Certain pharmas are placing bets on cures for certain conditions. For example, Novartis is heavily into research on Meningnitis, meningococcal while GSK has spent quite a bit on researching "bacterial infections."


Wayward legend takes sides in a chart of two sides, plus data woes

Reader Chris P. submitted the following graph, found on Axios:

Axios_newstopics

From a Trifecta Checkup perspective, the chart has a clear question: are consumers getting what they wanted to read in the news they are reading?

Nevertheless, the chart is a visual mess, and the underlying data analytics fail to convince. So, it’s a Type DV chart. (See this overview of the Trifecta Checkup for the taxonomy.)

***

The designer did something tricky with the axis but the trick went off the rails. The underlying data consist of two set of ranks, one for news people consumed and the other for news people wanted covered. With 14 topics included in the study, the two data series contain the same values, 1 to 14. The trick is to collapse both axes onto one. The trouble is that the same value occurs twice, and the reader must differentiate the plot symbols (triangle or circle) to figure out which is which.

It does not help that the lines look like arrows suggesting movement. Without first reading the text, readers may assume that topics change in rank between two periods of time. Some topics moved right, increasing in importance while others shifted left.

The design wisely separated the 14 topics into three logical groups. The blue group comprises news topics for which “want covered” ranking exceeds the “read” ranking. The orange group has the opposite disposition such that the data for “read” sit to the right side of the data for “want covered”. Unfortunately, the legend up top does more harm than good: it literally takes sides!

**

Here, I've put the data onto a scatter plot:

Redo_junkcharts_aiosnewstopics_1

The two sets of ranks are basically uncorrelated, as the regression line is almost flat, with “R-squared” of 0.02.

The analyst tried to "rescue" the data in the following way. Draw the 45-degree line, and color the points above the diagonal blue, and those below the diagonal orange. Color the points on the line gray. Then, write stories about those three subgroups.

Redo_junkcharts_aiosnewstopics_2

Further, the ranking of what was read came from Parse.ly, which appears to be surveillance data (“traffic analytics”) while the ranking of what people want covered came from an Axios/SurveyMonkey poll. As for as I could tell, there was no attempt to establish that the two populations are compatible and comparable.