Digital revolution in China: two visual takes

The following map accompanied an article in the Economist about China's drive to create a "digital silkroad," roughly defined as making a Silicon Valley. 


The two variables plotted are the wealth of each province (measured by GDP per capita) and the level of Internet penetration. The designer made the following choices:

  • GDP per capita is presented with less precision than Internet penetration. The former is grouped into five large categories while the latter is given as a percentage to one decimal place.
  • The visual design favors GDP per capita which is encoded as the shade of color of each province. The Internet penetration data appeared added on as an afterthought.

If we apply the self-sufficiency test (i.e. by removing the printed data from the chart), it's immediately clear that the visual elements convey zero information about Internet penetration. This is a serious problem for a chart about the "digital silkroad"!


If those two variables are chosen, it would seem appropriate to convey to readers the correlation between the two variables. The following sketch is focused on surfacing the correlation.


(Click on the image to see it in full.) Here is the top of the graphic:


The individual maps are not strictly necessary. Just placing provincial names onto the grid is enough, because regional pattern isn't salient here.

The Internet penetration data were grouped into five categories as well, putting it on equal footing as GDP per capita.


Discoloring the chart to re-discover its plot

Today's chart comes from Pew Research Center, and the big question is why the colors?


The data show the age distributions of people who believe different religions. It's a stacked bar chart, in which the ages have been grouped into the young (under 15), the old (60 plus) and everyone else. Five religions are afforded their own bars while "folk" religions are grouped as one, and so have "other" religions. There is even a bar for the unaffiliated. "World" presumably is the aggregate of all the other bars, weighted by the popularity of each religion group.

So far so good. But what is it that demands 9 colors, and 27 total shades? In other words, one shade for every data point on this chart.

Here is a more restrained view:



Let's follow the designer's various decisions. The choice of those age groups indicates that the story is really happening at the "margins": Muslims and Hindus have higher proportions of younger followers while Jews and Buddhists have higher concentrations of older followers.

Therein lies the problem. Because of the lengths, their central locations, and the tints, the middle section of each bar is the most eye-catching: the reader is glancing at the wrong part of the chart.

So, let me fix this by re-ordering the three panels:

Is there really a need to draw those gray bars? The middle age group (grab-all) only exists to assure readers that everyone who's supposed to be included has been included. Why plot it?


The above chart says "trust me, what isn't drawn here constitutes the remaining population, and the whole adds to 100%."


Another issue of these charts, exacerbated by inflexible software defaults, is the forced choice of imbuing one variable with a super status above the others. In the Pew chart, the rows are ordered by decreasing proportion of the young age group, except for the "everyone" group pinned as the bottom row. Therefore, the green bars (old age group) are not in a particular order, its pattern much harder to comprehend.

In the final version, I break the need to keep bars of the same religion on the same row:


Five colors are used. Three of them are used to cluster similar religions: Muslims and Hindus (in blue) have higher proportions of the young compared to the world average (gray) while the religions painted in green have higher proportions of the old. Christians (in orange) are unusual in that the proportions are higher than average in both young and old age groups. Everyone and unaffiliated are given separate colors.

The colors here serve two purposes: connecting the two panels, and revealing the cluster structure.





Governor of Maine wants a raise

In a Trifecta checkup, this map scores low on the Q corner: what is its purpose? What have readers learned about the salaries of state governors after looking at the map? (Link to original)


The most obvious "insights" include:

  • There are more Republican governors than Democratic governors
  • Most Democratic governors are from the coastal states
  • There is exactly one Independent governor
  • Small states on the Eastern seaboard is messing up the design

Notice I haven't said anything about salaries. That's because the reader has to read the data labels to learn the governor's salary in each state. It's work to know what the average or median salary is, or even the maximum and minimum without spending quality time with the labels.

This is also an example of a chart that is invariant to the data. The chart would look exactly the same if I substituted the real salaries with 50 fake numbers.


The following design attempts to say something about the data. The dataset is actually not that interesting because the salaries are relatively closely clustered.

You get to see the full range of salaries, with the median, 25th and 75th percentiles marked off. The states are divided into top and bottom halves, with the median as the splitting level. A simple clustering algorithm is applied to group the salaries into similar categories, then color-coded.

The Maine governor is the least compensated.

If you have other ideas for this dataset, feel free to submit them to me.

A gem among the snowpack of Olympics data journalism

It's not often I come across a piece of data journalism that pleases me so much. Here it is, the "Happy 700" article by Washington Post is amazing.



When data journalism and dataviz are done right, the designers have made good decisions. Here are some of the key elements that make this article work:

(1) Unique

The topic is timely but timeliness heightens both the demand and supply of articles, which means only the unique and relevant pieces get the readers' attention.

(2) Fun

The tone is light-hearted. It's a fun read. A little bit informative - when they describe the towns that few have heard of. The notion is slightly silly but the reader won't care.

(3) Data

It's always a challenge to make data come alive, and these authors succeeded. Most of the data work involves finding, collecting and processing the data. There isn't any sophisticated analysis. But a powerful demonstration that complex analysis is not always necessary.

(4) Organization

The structure of the data is three criteria (elevation, population, and terrain) by cities. A typical way of showing such data might be an annotated table, or a Bumps-type chart, grouped columns, and so on. All these formats try to stuff the entire dataset onto one chart. The designers chose to highlight one variable at a time, cumulatively, on three separate maps. This presentation fits perfectly with the flow of the writing. 

(5) Details

The execution involves some smart choices. I am a big fan of legend/axis labels that are informative, for example, note that the legend doesn't say "Elevation in Meters":


The color scheme across all three maps shows a keen awareness of background/foreground concerns. 

Excellent visualization of gun violence in American cities

I like the Guardian's feature (undated) on gun violence in American cities a lot.

The following graphic illustrates the situation in Baltimore.


The designer starts by placing where the gun homicides occured in 2015. Then, it leads readers through an exploration of the key factors that might be associated with the spatial distribution of those homicides.

The blue color measures poverty levels. There is a moderate correlation between high numbers of dots (homicides) and deeper blue (poorer). The magenta color measures education attainment and the orange color measures proportion of blacks. In Baltimore, it appears that race is substantially better at explaining the prevalence of homicides.

This work is exemplary because it transcends description (first map) and explores explanations for the spatial pattern. Because three factors are explored together in a small-multiples layout, readers learn that no single factor can explain everything. In addition, we learn that different factors have different degrees of explanatory power.

Attentive readers will also find that the three factors of poverty, education attainment and proportion black are mutually correlated.  Areas with large black populations also tend to be poorer and less educated.


I also like the introductory section in which a little dose of interactivity is used to sequentially present the four maps, now superimposed. It then becomes possible to comprehend the rest quickly.



The top section is less successful as proportions are not easily conveyed via dot density maps.


Dropping the map form helps. Here is a draft of what I have in mind. I just pulled some data from online sources at the metropolitan area (MSA) level, and it doesn't have as striking a comparison as the city-level data, it seems.



 PS. On Twitter, Aliza tells me the article was dated January 9, 2017.

The visual should be easier to read than your data

A reader sent this tip in some time ago and I lost track of who he/she is. This graphic looks deceptively complex.


What's complex is not the underlying analysis. The design is complex and so the decoding is complex.

The question of the graphic is a central concern of anyone who's retired: how long will one's savings last? There are two related metrics to describe the durability of the stash, and they are both present on this chart. The designer first presumes that one has saved $1 million for retirement. Then he/she computes how many years the savings will last. That, of course, depends on the cost of living, which naively can be expressed as a projected annual expenditure. The designer allows the cost of living to vary by state, which is the main source of variability in the computations. The time-based and dollar-based metrics are directly linked to one another via a formula.

The design encodes the time metric in a grid of dots, and the dollar-metric in the color of the dots. The expenditures are divided into eight segments, given eight colors from deep blue to deep pink.

Thirteen of those dots are invariable, appearing in every state. Readers are drawn into a ranking of the states, which is nothing but a ranking of costs of living. (We don't know, but presume, that the cost of living computation is appropriate for retirees, and not averaged.) This order obscures any spatial correlation. There are a few production errors in the first row in which the year and month numbers are misstated slightly; the numbers should be monotonically decreasing. In terms of years and months, the difference between many states is immaterial. The pictogram format is more popular than it deserves: only highly motivated readers will count individual dots. If readers are merely reading the printed text, which contains all the data encoded in the dots, then the graphic has failed the self-sufficiency principle - the visual elements are not doing any work.


In my version, I surface the spatial correlation using maps. The states are classified into sensible groups that allow a story to be told around the analysis. Three groups of states are identified and separately portrayed. The finer variations between states within each state group appear as shades.


Data visualization should make the underlying data easier to comprehend. It's a problem when the graphic is harder to decipher than the underlying dataset.




Getting into the head of the chart designer

When I look at this chart (from Business Insider), I try to understand the decisions made by its designer - which things are important to her/him, and which things are less important.


The chart shows average salaries in the top 2 percent of income earners. The data are split by gender and by state.

First, I notice that the designer chooses to use the map form. This decision suggests that the spatial pattern of top incomes is of top interest to the designer because she/he is willing to accept the map's constraints - namely, the designer loses control of the x and y dimensions, as well as the area and shape of the data containers. For the U.S. state map, there is no elegant solution to the large number of small states problem in the Northeast.

Second, I notice the color choice. The designer provides actual values on the visualization but also groups all state-average incomes into five categories. It's not clear how she/he determines the boundaries of these income brackets. There are many more dark blue states than there are light blue states in the map for men. Because women incomes are everywhere lower than men, the map at the bottom fits all states into two large buckets, plus Connecticut. Women incomes are lower than men but there is no need to break the data down by gender to convey this message.

Third, the use of two maps indicates that the designer does not care much about gender comparisons within each state. These comparisons are difficult to accomplish on the chart - one must involuntarily bob one's head up and down to make the comparisons. The head bobbing isn't even enough: then you must pull out your calculator and compute the ratio of women to men average. If the designer wants to highlight state-level comparisons, she/he could have plotted the gender ratio on a single map, like this:

Screen Shot 2017-09-18 at 11.47.23 PM


So far, I infer that the key questions are (a) the gender gap in aggregate (b) the variability of incomes within each gender, or the spatial clustering (c) the gender gap within each state.

(a) is better conveyed in more aggregate form. Goal (b) is defeated by the lack of clear clustering. (c) is not helped by the top-bottom split.

In making the above chart, I discover a pattern - that women fare better in the smaller states like Montana, Iowa, North & South Dakota. Meanwhile, the disparity in New York is of the same degree as Oklahoma and Wyoming.


 This chart tells readers a bit more about the underlying data, without having to print the entire dataset on the page.




Making people jump over hoops

Take a look at the following chart, and guess what message the designer wants to convey:


This chart accompanied an article in the Wall Street Journal about Wells Fargo losing brokers due to the fake account scandal, and using bonuses to lure them back. Like you, my first response to the chart was that little has changed from 2015 to 2017.

It is a bit mysterious the intention of the whitespace inserted to split the four columns into two pairs. It's not obvious that UBS and Merrill are different from Wells Fargo and Morgan Stanley. This device might have been used to overcome the difficulty of reading four columns side by side.

The additional challenge of this dataset is the outlier values for UBS, which elongates the range of the vertical axis, squeezing together the values of the other three banks.

In this first alternative version, I play around with irregular gridlines.


Grouped column charts are not great at conveying changes over time, as they cause our eyes to literally jump over hoops. In the second version, I use a bumps chart to compactly highlight the trends. I also zoom in on the quarterly growth rates.


The rounded interpolation removes the sharp angles from the typical bumps chart (aka slopegraph) but it does add patterns that might not be there. This type of interpolation however respects the values at the "knots" (here, the quarterly values) while a smoother may move those points. On balance, I like this treatment.


PS. [6/2/2017] Given the commentary below, I am including the straight version of the chart, so you can compare. The straight-line version is more precise. One aspect of this chart form I dislike is the sharp angles. When there are more lines, it gets very entangled.


Making the world a richer place #onelesspie #PiDay

Xan Gregg and I have been at it for a number of years. To celebrate Pi Day today, I am ridding the world of one pie chart.

Here is a pie chart that is found on Wikipedia:


Here is the revised chart:


It's been designed to highlight certain points of interest.

I find the data quite educational. These are some other insights that are not clear from the revised chart:

  • Japan's economy is larger than Germany's
  • Russia's economy is smaller than that of Germany, Italy, India, Brazil, or South Korea
  • China and Japan combined have GDP (probably) larger than Western Europe
  • Turkey, Netherlands, Switzerland, South Africa are in the Top 20

PS. Xan re-worked a radar chart this year. (link)



Showing three dimensions using a ternary plot

Long-time reader Daniel L. isn't a fan of this chart, especially when it is made to spin, as you can see at this link:


Like other 3D charts, this one is hard to read. The vertical lines are both good and bad: They make the one dimension very easy to read but their very existence makes one realize the challenges of reading the other dimensions without guidelines.

This dataset allows me to show a ternary plot. The ternary plot is an ingenious way of putting three dimensions onto a flat surface. I have found few good uses of this chart type, though.


Let's get to the core of the issue: the analyst started with 25 skills that are frequently required by data science and analytics jobs, and his goal is to classify these skills into three groups. The underlying method used to create these groups is factor analysis.

Each dot above is a skill. The HQ of each grouping of skills (known as a factor) is a corner of the plot. The closer the dot is to the corner, the more relevant that skill is to the skill group.

In the above chart, I highlighted four skills that are not clearly in one or another skill group. For example, Commuication straddles the Math/Stats and Business dimensions but scores lowly on the Technology/Programming dimension.


The ternary plot has a few problems. Like any scatter plot, once you have 10 or more dots, it is hard to fit all the data labels. Further, the axis labels must be carefully done to help readers understand the plot. 

Before long, the chart looks very cluttered. There just isn't enough room to get all your words in. Here is another version of the same chart -- wiht a different set of annotation.


Instead of drawing attention to those skills that have no clear home, this version of the chart focuses on the dots close to each corner.

In two cases, I classified two of the skills differently from the original. The Machine Learning skill is part of Math/Stats on my charts but it is part of Technology/Programming on the original.

The ternary plot is interesting and unusual but is only useful in selected problems.