Superb tile map offering multiple avenues for exploration

Here's a beauty by WSJ Graphics:


The article is here.

This data graphic illustrates the power of the visual medium. The underlying dataset is complex: power production by type of source by state by month by year. That's more than 90,000 numbers. They all reside on this graphic.

Readers amazingly make sense of all these numbers without much effort.

It starts with the summary chart on top.


The designer made decisions. The data are presented in relative terms, as proportion of total power production. Only the first and last years are labeled, thus drawing our attention to the long-term trend. The order of the color blocks is carefully selected so that the cleaner sources are listed at the top and the dirtier sources at the bottom. The order of the legend labels mirrors the color blocks in the area chart.

It takes only a few seconds to learn that U.S. power production has largely shifted away from coal with most of it substituted by natural gas. Other than wind, the green sources of power have not gained much ground during these years - in a relative sense.

This summary chart serves as a reading guide for the rest of the chart, which is a tile map of all fifty states. Embedded in the tile map is a small-multiples arrangement.


The map offers multiple avenues for exploration.

Some readers may look at specific states. For example, California.


Currently, about half of the power production in California come from natural gas. Notably, there is no coal at all in any of these years. In addition to wind, solar energy has also gained. All of these insights come without the need for any labels or gridlines!

Wsj_powerproduction_westernstatesBrowsing around California, readers find different patterns in other Western states like Oregon and Washington.

Hydroelectric energy is the dominant source in those two states, with wind gradually taking share.

At this point, readers realize that the summary chart up top hides remarkable state-level variations.


There are other paths through the map.

Some readers may scan the whole map, seeking patterns that pop out.

One such pattern is the cluster of states that use coal. In most of these states, the proportion of coal has declined.

Yet another path exists for those interested in specific sources of power.

For example, the trend in nuclear power usage is easily followed by tracking the purple. South Carolina, Illinois and New Hampshire are three states that rely on nuclear for more than half of its power.

Wsj_powerproduction_vermontI wonder what happened in Vermont about 8 years ago.

The chart says they renounced nuclear energy. Here is some history. This one-time event caused a disruption in the time series, unique on the entire map.


This work is wonderful. Enjoy it!

Easy breezy bar charts, perhaps

I came across the following bar chart (link), which presents the results of a survey of CMOs (Chief Marketing Officers) on their attitudes toward data analytics.

Big-Data-and-the-CMO_chart5-Hurdle-800_30Apr2013Responses are tabulated to the question about the most significant hurdle(s) against the increasing use of data and analytics for marketing.

Eleven answers were presented, in addition to the catchall "Other" response. I'm unable to divine the rule used by the designer to sequence the responses.

It's not in order of significance, the most obvious choice. It's not alphabetical, either.


I think this indiscretion is partially redeemed by the use of color shades. The darkest blue shade points our eyes to the most significant hurdle - lack of investment in technology (44% of respondents). The second most significant hurdle is "availability of credible tools for measuring effectiveness" (31%), and that too is in dark blue.

Now what? The third most popular answer has 30% of the respondents, but it's shown by the second palest blue! I then realize the colors don't actually convey any information. Five shades of blue were selected, and they are laid out from top to bottom, from palest to darkest, in a sequential, recursive manner.


This chart is wild. Notice how the heights of the bars are variable. It seems that some bars have been widened to accommodate wrapped lines of text. These small edits introduce visual distortion so that the areas of these bars no longer are proportional to the data.

I like a pair of design decisions. Not showing decimal places and appending the % sign on each bar label is good. They also extend the horizontal axis to 100%. This shows what proportion of the respondents selected any particular answer - we note that a respondent is allowed to select more than one response.

The following is a more standard way of making a bar chart. (The color shading is not necessary.)


This example proves that the V corner of the Trifecta Checkup is still relevant. After one develops a good question, collects useful data and selects a standard chart form, figuring out how to visually display the information is not as easy breezy as one might think.

Visualizing fertility rates around the globe

The following chart dropped on my Twitter feed.


It's an ambitious chart that tries to do a lot. The underlying data set contains fertility rate data from over 200 countries over 20 years.

The basic chart form is a column chart that is curled up into a ball. The column chart is given colors that map to continents. All countries are grouped into five continents. The column chart can only take a single data series, so the 2019 fertility rate is chosen.

Beyond this basic setup, the designer embellishes the chart with a trove of information. Here's a close up:


The first number is the 2019 fertility rate, which means all the data encoded into the columns are also printed on the chart itself. Then, the flag of each country forms the next ring. Then, the name of the country. Finally, in brackets, the percent change in fertility rate between 2000 and 2019.

That is not all. Some contextual information are injected in those arrows that connect the columns to the data labels. A green arrow indicates that the fertility rate is trending lower - which is the case in most countries around the world. Once in a while, a purple arrow pops up. In the above excerpt, Seychelles gets a purple arrow because this island nation has increased the fertility rate from 2000 to 2019.

Also hiding in the background are several dashed rings. I think only the one that partially overlaps with the column chart contains any information - the other rings are inserted for an artistic reason. To decipher this dashed ring, we must look at the inset in the top left corner. We learn that the value of 2.1 children per woman is known as the replacement fertility rate. So it's also possible to assess whether each country is above or below the replacement fertility rate threshold.


[I'm presuming that this replacement threshold is about the births necessary to avoid a population decline. If that's the case, then comparing each country's fertility rate to a global fertility rate threshold is too simplistic because fertility is only one of several key factors driving a country's population growth. A more sophisticated model should generate country-level thresholds.]


Data graphics serve many functions. This chart works well as an embellished data table. It does take some time to find a specific country because the columns have been sorted by decreasing 2019 fertility rate but once we locate the column, all the other data fields are clearly laid out.

As a generator of data insights, this chart is less effective. The main insight I obtained from it is a rough ranking of continents, with African countries predominantly having higher fertility rates, followed by Asia and Oceania, then Americas, and finally, Europe which has the lowest fertility rates. If this is the key message, a standard choropleth map brings it out more directly.


Here is a small-multiples rendering of the fertility dataset. I chose 1999 values instead of 2000 to make a complete two-decade view.


The columns represent a grouping of countries based on their 1999 fertility rates. The left column contains countries with the lowest number of births per woman, and the fertility rate increases left to right - both within an individual plot and in the grid.

If you're wondering, the hidden vertical axis sorts the countries by their 1999 rank. The lighter colors are 1999 values while the darker colors are 2019 values. For most countries the dots are shifting left over the 20 years. There are some exceptions. I have labeled several of these exceptions (e.g. Kazakhstan and Mongolia), and rendered them in italic.




Ridings, polls, elections, O Canada

Stephen Taylor reached out to me about his work to visualize Canadian elections data. I took a look. I appreciate the labor of love behind this project.

He led with a streamgraph, which presents a quick overview of relative party strengths over time.


I am no Canadian election expert, and I did a bare minimum of research in writing this blog. From this chart, I learn that:

  • the Canadians have an irregular election schedule
  • Canada has a two party plus breadcrumbs system
  • The two dominant parties are Liberals and Conservatives. The Liberals currently hold just less than half of the seats. The Conservatives have more than half of the seats not held by Liberals
  • The Conservative party (maybe) rebranded as "progressive conservative" for several decades. The Reform/Alliance party was (maybe) a splinter movement within the Conservatives as well.
  • Since the "width" of the entire stream increased over time, I'm guessing the number of seats has expanded

That's quite a bit of information obtained at a glance. This shows the power of data visualization. Notice Stephen didn't even have to include a "how to read this" box.

The streamgraph form has its limitations.

The feature that makes it more attractive than an area chart is its middle anchoring, resulting in a form of symmetry. The same feature produces erroneous intuition - the red patch draws out a declining trend; the reader must fight the urge to interpret the lines and focus on the areas.

The breadcrumbs are well hidden. The legend below discloses that the Green Party holds 3 seats currently. The party has never held enough seats to appear on the streamgraph though.

The bars showing proportions in the legend is a very nice touch. (The numbers appear messed up - I have to ask Stephen whether the seats shown are current values, or some kind of historical average.) I am a big fan of informative legends.


The next featured chart is a dot plot of polling results since 2020.


One can see a three-tier system: the two main parties, then the NDP (yellow) is the clear majority of the minority, and finally you have a host of parties that don't poll over 10%.

It looks like the polls are favoring the Conservatives over the Liberals in this election but it may be an election-day toss-up.

The purple dots represent "PPC" which is a party not found elsewhere on the page.

This chart is clear as crystal because of the structure of the underlying data. It just amazes me that the polls are so highly correlated. For example, across all these polls, the NDP has never once polled better than either the Liberals or the Conservatives, and in addition, it has never polled worse than any of the small parties.

What I'd like to see is a chart that merges the two datasets, addressing the question of how well these polls predicted the actual election outcomes.


The project goes very deep as Stephen provides charts for individual "ridings" (perhaps similar to U.S. precincts).

Here we see population pyramids for Vancouver Center, versus British Columbia (Province), versus Canada.


This riding has a large surplus of younger people in their twenties and thirties. Be careful about the changing scales though. The relative difference in proportions are more drastic than visually displayed because the maximum values (5%) on the Province and Canada charts are half that on the Riding chart (10%). Imagine squashing the Province and Canada charts to half their widths.

Analyses of income and rent/own status are also provided.

This part of the dashboard exhibits a problem common in most dashboards - they present each dimension of the data separately and miss out on the more interesting stuff: the correlation between dimensions. Do people in their twenties and thirties favor specific parties? Do richer people vote for certain parties?


The riding-level maps are the least polished part of the site. This is where I'm looking for a "how to read it" box.


It took me a while to realize that the colors represent the parties. If I haven't come in from the front page, I'd have been totally lost.

Next, I got confused by the use of the word "poll". Clicking on any of the subdivisions bring up details of an actual race, with party colors, candidates and a donut chart showing proportions. The title gives a "poll id" and the name of the riding in parentheses. Since the poll id changes as I mouse over different subdivisions, I'm wondering whether a "poll" is the term for a subdivision of a riding. A quick wiki search indicates otherwise.


My best guess is the subdivisions are indicated by the numbers.

Back to the donut charts, I prefer a different sorting of the candidates. For this chart, the two most logical orderings are (a) order by overall popularity of the parties, fixed for all ridings and (b) order by popularity of the candidate, variable for each riding.

The map shown above gives the winner in each subdivision. This type of visualization dumps a lot of information. Stephen tackles this issue by offering a small multiples view of each party. Here is the Liberals in Vancouver.


Again, we encounter ambiguity about the color scheme. Liberals have been associated with a red color but we are faced with abundant yellow. After clicking on the other parties, you get the idea that he has switched to a divergent continuous color scale (red - yellow - green). Is red or green the higher value? (The answer is red.)

I'd suggest using a gray scale for these charts. The hardest decision is going to be the encoding between values and shading. Should each gray scale be different for each riding and each party?

If I were to take a guess, Stephen must have spent weeks if not months creating these maps (depending on whether he's full-time or part-time). What he has published here is a great start. Fine-tuning the issues I've mentioned may take more weeks or months more.


Stephen is brave and smart to send this project for review. For one thing, he's got some free consulting. More importantly, we should always send work around for feedback; other readers can tell us where our blind spots are.

To read more, start with this post by Stephen in which he introduces his project.

Simple charts are the hardest to do right

The CDC website has a variety of data graphics about many topics, one of which is U.S. vaccinations. I was looking for information about Covid-19 data broken down by age groups, and that's when I landed on these charts (link).


The left panel shows people with at least one dose, and the right panel shows those who are "fully vaccinated." This simple chart takes an unreasonable amount of time to comprehend.


The analyst introduces three metrics, all of which are described as "percentages". Upon reflection, they are proportions of the people in specific age ranges.

Readers are thus invited to compare these proportions. It's not clear, however, which comparisons are intended. The first item listed in the legend states "Percent among Persons who completed all recommended doses in last 14 days". For most readers, including me, this introduces an unexpected concept. The 14 days here do not refer to the (in)famous 14-day case-counting window but literally the most recent two weeks relative to when the chart was produced.

It would have been clearer if the concept of Proportions were introduced in the chart title or axis title, while the color legend explains the concept of the base population. From the lighter shade to the darker shade (of red and blue) to the gray color, the base population shifts from "Among Those Who Completed/Initiated Vaccinations Within Last 14 Days" to "Among Those Who Completed/Initiated Vaccinations Any Time" to "Among the U.S. Population (regardless of vaccination status)".

Also, a reverse order helps our comprehension. Each subsequent category is a subset of the one above. First, the whole population, then those who are fully vaccinated, and finally those who recently completed vaccinations.

The next hurdle concerns the Q corner of our Trifecta Checkup. The design leaves few hints as to what question(s) its creator intended to address. The age distribution of the U.S. population is useless unless it is compared to something.

One apparently informative comparison is the age distribution of those fully vaccinated versus the age distribution of all Americans. This is revealed by comparing the lengths of the dark blue bar and the gray bar. But is this comparison informative? It's telling me that people aged 50 to 64 account for ~25% of those who are fully vaccinated, and ~20% of all Americans. Because proportions necessarily add to 100%, this implies that other age groups have been less vaccinated. Duh! Isn't that the result of an age-based vaccination prioritization? During the first week of the vaccination campaign, one might expect close to 100% of all vaccinations to be in the highest age group while it was 0% for the other age groups.

This is a chart in search of a question. The 25% vs 20% comparison does not assist readers in making a judgement. Does this mean the vaccination campaign is working as expected, worse than expected or better than expected? The problem is the wrong baseline. The designer of this chart implies that the expected proportions should conform to the overall age distribution - but that clearly stands in the way of CDC's initial prioritization of higher-risk age groups.


In my version of the chart, I illustrate the proportion of people in each age group who have been fully vaccinated.


Among those fully vaccinated, some did it within the most recent two weeks:



Elsewhere on the CDC site, one learns that on these charts, "fully vaccinated" means one shot of J&J or 2 shots of Pfizer or Moderna, without dealing with the 14-day window or other complications. Why do we think different definitions are used in different analyses? Story-first thinking, as I have explained here. When it comes to telling the story about vaccinations, the story is about the number of shots in arms. They want as big a number as possible, and abandon any criterion that decreases the count. When it comes to reporting on vaccine effectiveness, they want as small a number of cases as possible.






Plotting the signal or the noise

Antonio alerted me to the following graphic that appeared in the Economist. This is a playful (?) attempt to draw attention to racism in the game of football (soccer).

The analyst proposed that non-white players have played better in stadiums without fans due to Covid19 in 2020 because they have not been distracted by racist abuse from fans, using Italy's Serie A as the case study.


The chart struggles to bring out this finding. There are many lines that criss-cross. The conclusion is primarily based on the two thick lines - which show the average performance with and without fans of white and non-white players. The blue line (non-white) inched to the right (better performance) while the red line (white) shifted slightly to the left.

If the reader wants to understand the chart fully, there's a lot to take in. All (presumably) players are ranked by the performance score from lowest to highest into ten equally sized tiers (known as "deciles"). They are sorted by the 2019 performance when fans were in the stadiums. Each tier is represented by the average performance score of its members. These are the values shown on the top axis labeled "with fans".

Then, with the tiers fixed, the players are rated in 2020 when stadiums were empty. For each tier, an average 2020 performance score is computed, and compared to the 2019 performance score.

The following chart reveals the structure of the data:


The players are lined up from left to right, from the worst performers to the best. Each decile is one tenth of the players, and is represented by the average score within the tier. The vertical axis is the actual score while the horizontal axis is a relative ranking - so we expect a positive correlation.

The blue line shows the 2019 (with fans) data, which are used to determine tier membership. The gray dotted line is the 2020 (no fans) data - because they don't decide the ranking, it's possible that the average score of a lower tier (e.g. tier 3 for non-whites) is higher than the average score of a higher tier (e.g. tier 4 for non-whites).

What do we learn from the graphic?

It's very hard to know if the blue and gray lines are different by chance or by whether fans were in the stadium. The maximum gap between the lines is not quite 0.2 on the raw score scale, which is roughly a one-decile shift. It'd be interesting to know the variability of the score of a given player across say 5 seasons prior to 2019. I suspect it could be more than 0.2. In any case, the tiny shifts in the averages (around 0.05) can't be distinguished from noise.


This type of analysis is tough to do. Like other observational studies, there are multiple problems of biases and confounding. Fan attendance was not the only thing that changed between 2019 and 2020. The score used to rank players is a "Fantacalcio algorithmic match-level fantasy-football score." It's odd that real-life players should be judged by their fantasy scores rather than their on-the-field performance.

The causal model appears to assume that every non-white player gets racially abused. At least, the analyst didn't look at the curves above and conclude, post-hoc, that players in the third decile are most affected by racial abuse - which is exactly what has happened with the observational studies I have featured on the book blog recently.

Being a Serie A fan, I happen to know non-white players are a small minority so the error bars are wider, which is another issue to think about. I wonder if this factor by itself explains the shifts in those curves. The curve for white players has a much higher sample size thus season-to-season fluctuations are much smaller (regardless of fans or no fans).





Come si dice donut in italiano

One of my Italian readers sent me the following "horror chart". (Last I checked, it's not Halloween.)


I mean, people are selling these rainbow sunglasses.


The dataset behind the chart is the market share of steel production by country in 1992 and in 2014. The presumed story is how steel production has shifted from country to country over those 22 years.

Before anything else, readers must decipher the colors. This takes their eyes off the data and on to the color legend placed on the right column. The order of the color legend is different from that found in the nearest object, the 2014 donut. The following shows how our eyes roll while making sense of the donut chart.


It's easier to read the 1992 donut because of the order but now, our eyes must leapfrog the 2014 donut.


This is another example of a visualization that fails the self-sufficiency test. The entire dataset is actually printed around the two circles. If we delete the data labels, it becomes clear that readers are consuming the data labels, not the visual elements of the chart.


The chart is aimed at an Italian audience so they may have a patriotic interest in the data for Italia. What they find is disappointing. Italy apparently completely dropped out of steel production. It produced 3% of the world's steel in 1992 but zero in 2014.

Now I don't know if that is true because while reproducing the chart, I noticed that in the 2014 donut, there is a dark orange color that is not found in the legend. Is that Italy or a mysterious new entrant to steel production?

One alternative is a dot plot. This design accommodates arrows between the dots indicating growth versus decline.



And you thought that pie chart was bad...

Vying for some of the worst charts of the year, Adobe came up with a few gems in its Digital Trends Survey. This was a tip from Nolan H. on Twitter.

There are many charts that should be featured; I'll focus on this one.


This is one of those survey questions that allow each respondent to select multiple responses so that adding up the percentages exceeds 100%. The survey asks people which of these futuristic products do they think is most important. There were two separate groups of respondents, consumers (lighter red) and businesses (darker red).

If, like me, you are a left-to-right, top-to-bottom reader, you'd have consumed this graphic in the following way:


The most important item is found in the lower bottom corner while the least important is placed first.

Here is a more sensible order of these objects:


To follow this order, our eyes must do this:


Now, let me say I like what they did with the top of the chart:


Put the legend above the chart because no one can understand it without first reading the legend.


Junkcharts_adobedigitaltrends_datadistortionData are embedded into part-circles (i.e. sectors)... but where do we find the data? The most obvious place to look for them is the areas of the sectors. But that's the wrong place. As I show in the explainer, the designer placed the data in the "height" - the distance from the peak point of the object to the horizontal baseline.

As a result of this choice, the areas of the sectors distort the data - they are proportional to the square of the data.

One simple way to figure out that your graphical objects have obscured the data is the self-sufficiency test. Remove all data labels from the chart, and ask if you still have something understandable.


With these unusual shapes, it's not easy to judge how much larger is one object from the next. That's why the data labels were included - the readers are looking at the data values, rather than the graphical objects. That's sad, if you are the designer.


One last mystery. What decides the layering of the light vs dark red sectors?


This design always places the smaller object in front of the larger object. Recall that the light red is for consumers and dark red for businesses. The comparison between these disjoint segments is not as interesting as the comparison of different ratings of technologies with each segment. So it's unfortunate that this aspect may get more attention than it deserves. It's also a consequence of the chart form. If the light red is always placed in front, then in some panels (such as the middle one shown above), the light red completely blocks the dark red.


Same data + same chart form = same story. Maybe.

We love charts that tell stories.

Some people believe that if they situate the data in the right chart form, the stories reveal themselves.

Some people believe for a given dataset, there exists a best chart form that brings out the story.

An implication of these beliefs is that the story is immutable, given the dataset and the chart form.

If you use the Trifecta Checkup, you already know I don't subscribe to those ideas. That's why the Trifecta has three legs, the third is the question - which is related to the message or the story.


I came across the following chart by Statista, illustrating the growth in Covid-19 cases from the start of the pandemic to this month. The underlying data are collected by WHO and cover the entire globe. The data are grouped by regions.


The story of this chart appears to be that the world moves in lock step, with each region behaving more or less the same.

If you visit the WHO site, they show a similar chart:


On this chart, the regions at the bottom of the graph (esp. Southeast Asia in purple) clearly do not follow the same time patterns as Americas (orange) or Europe (green).

What we're witnessing is: same data, same chart form, different stories.

This is a feature, not a bug, of the stacked area chart. The story is driven largely by the order in which the pieces are stacked. In the Statista chart, the largest pieces are placed at the bottom while for WHO, the order is exactly reversed.

(There are minor differences which do not affect my argument. The WHO chart omits the "Other" category which accounts for very little. Also, the Statista chart shows the smoothed data using 7-day averaging.)

In this example, the order chosen by WHO preserves the story while the order chosen by Statista wipes it out.


What might be the underlying question of someone who makes this graph? Perhaps it is to identify the relative prevalence of Covid-19 in different regions at different stages of the pandemic.

Emphasis on the word "relative". Instead of plotting absolute number of cases, I consider plotting relative number of cases, that is to say, the proportion of cases in each region at given times.

This leads to a stacked area percentage chart.


In this side-by-side view, you see that this form is not affected by flipping the order of the regions. Both charts say the same thing: that there were two waves in Europe and the Americas that dwarfed all other regions.



Atypical time order and bubble labeling

This chart appeared in a Charles Schwab magazine in Summer, 2019.


This bubble chart does not print any data labels. The bubbles take our attention but the designer realizes that the actual values of the volatility are not intuitive numbers. The same is true of any standard deviation numbers. If you're told SD of a data series is 3, it doesn't tell you much by itself.

I first transformed this chart into the equivalent column chart:


Two problems surface on the axes.

For the time axis, the years are jumbled. Readers experience vertigo, as we try to figure out how to read the chart. Our expectation that time moves left to right is thwarted. This ordering also requires every single year label to be present.

For the vertical axis, I could have left out the numbers completely. They are not really meaningful. These represent the areas of the bubbles but only relative to how I measured them.


In the next version, I sorted time in the conventional manner. Following Tufte's classic advice, only the tops of the columns are plotted.


What you see is that this ordering is much easier to comprehend. Figuring out that 2018 is an average year in terms of volatility is not any harder than in the original. In fact, we can reproduce the order of the previous chart just by letting our eyes sweep top to bottom.

To make it even easier to read the vertical axis, I converted the numbers into an index, with the average volatility as 100 (assigned to 0% on the chart) .


Now, you can see that 2018 is roughly at the average while 2008 is 400% above the average level. (How should we interpret this statement? That's a question I pose to my statistics students. It's not intuitive how one should interpret the statement that the standard deviation is 5 times higher.)