Unlocking the secrets of a marvellous data visualization

Scmp_coronavirushk_paperThe graphics team in my hometown paper SCMP has developed a formidable reputation in data visualization, and I lapped every drop of goodness on this beautiful graphic showing how the coronavirus spread around Hong Kong (in the first wave in April). Marcelo uploaded an image of the printed version to his Twitter. This graphic occupied the entire back page of that day's paper.

An online version of the chart is found here.

The data graphic is a masterclass in organizing data. While it looks complicated, I had no problem unpacking the different layers.

Cases were divided into imported cases (people returning to Hong Kong) and local cases. A small number of cases are considered in-betweens.


The two major classes then occupy one half page each. I first looked at the top half, where my attention is drawn to the thickest flows. The majority of imported cases arrived from the U.K., and most of those were returning students. The U.S. is the next largest source of imported cases. The flows are carefully ordered by continent, with the Americas on the left, followed by Europe, Middle East, Africa, and Asia.


Where there are interesting back stories, the flow blossoms into a flower. An annotation explains the cluster of cases. Each anther represents a case. Eight people caught the virus while touring Bolivia together.


One reads the local cases in the same way. Instead of flowers, think of roots. The biggest cluster by far was a band that played at clubs in three different parts of the city, infecting a total of 72 people.


Everything is understood immediately, without a need to read text or refer to legends. The visual elements carry that kind of power.


This data graphic presents a perfect amalgam of art and science. For a flow chart, the data are encoded in the relative thickness of the lines. This leaves two unused dimensions of these lines: the curvature and lengths. The order of the countries and regions take up the horizontal axis, but the vertical axis is free. Unshackled from the data, the designer introduced curves into the lines, varied their lengths, and dispersed their endings around the white space in an artistic manner.

The flowers/roots present another opportunity for creativity. The only data constraint is the number of cases in a cluster. The positions of the dots, and the shape of the lines leading to the dots are part of the playground.

What's more, the data visualization is a powerful reminder of the benefits of testing and contact tracing. The band cluster led to the closure of bars, which helped slow the spread of the coronavirus. 


Marketers want millennials to know they're millennials

When I posted about the lack of a standard definition of "millennials", Dean Eckles tweeted about the arbitrary division of age into generational categories. His view is further reinforced by the following chart, courtesy of PewResearch by way of MarketingCharts.com.


Pew asked people what generation they belong to. The amount of people who fail to place themselves in the right category is remarkable. One way to interpret this finding is that these are marketing categories created by the marketing profession. We learned in my other post that even people who use the term millennial do not have a consensus definition of it. Perhaps the 8 percent of "millennials" who identify as "boomers" are handing in a protest vote!

The chart is best read row by row - the use of stacked bar charts provides a clue. Forty percent of millennials identified as millennials, which leaves sixty percent identifying as some other generation (with about 5 percent indicating "other" responses). 

While this chart is not pretty, and may confuse some readers, it actually shows a healthy degree of analytical thinking. Arranging for the row-first interpretation is a good start. The designer also realizes the importance of the diagonal entries - what proportion of each generation self-identify as a member of that generation. Dotted borders are deployed to draw eyes to the diagonal.


The design doesn't do full justice for the analytical intelligence. Despite the use of the bar chart form, readers may be tempted to read column by column due to the color scheme. The chart doesn't have an easy column-by-column interpretation.

It's not obvious which axis has the true category and which, the self-identified category. The designer adds a hint in the sub-title to counteract this problem.

Finally, the dotted borders are no match for the differential colors. So a key message of the chart is buried.

Here is a revised chart, using a grouped bar chart format:



In a Trifecta checkup (link), the original chart is a Type V chart. It addresses a popular, pertinent question, and it shows mature analytical thinking but the visual design does not do full justice to the data story.



The windy path to the Rugby World Cup

When I first saw the following chart, I wondered whether it is really that challenging for these eight teams to get into the Rugby World Cup, currently playing in Japan:


Another visualization of the process conveys a similar message. Both of these are uploaded to Wikipedia.


(This one hasn't been updated and still contains blank entries.)


What are some of the key messages one would want the dataviz to deliver?

  • For the eight countries that got in (not automatically), track their paths to the World Cup. How many competitions did they have to play?
  • For those countries that failed to qualify, track their paths to the point that they were stopped. How many competitions did they play?
  • What is the structure of the qualification rounds? (These are organized regionally, in addition to certain playoffs across regions.)
  • How many countries had a chance to win one of the eight spots?
  • Within each competition, how many teams participated? Did the winner immediately qualify, or face yet another hurdle? Did the losers immediately disqualify, or were they offered another chance?

Here's my take on this chart:



Pulling the multi-national story out, step by step

Reader Aleksander B. found this Economist chart difficult to understand.


Given the chart title, the reader is looking for a story about multinationals producing lower return on equity than local firms. The first item displayed indicates that multinationals out-performed local firms in the technology sector.

The pie charts on the right column provide additional information about the share of each sector by the type of firms. Is there a correlation between the share of multinationals, and their performance differential relative to local firms?


We can clean up the presentation. The first changes include using dots in place of pipes, removing the vertical gridlines, and pushing the zero line to the background:


The horizontal gridlines attached to the zero line can also be removed:


Now, we re-order the rows. Start with the aggregate "All sectors". Then, order sectors from the largest under-performance by multinationals to the smallest.


The pie charts focus only on the share of multinationals. Taking away the remainders speeds up our perception:


Help the reader understand the data by dividing the sectors into groups, organized by the performance differential:


For what it's worth, re-sort the sectors from largest to smallest share of multinationals:


Having created groups of sectors by share of multinationals, I simplify further by showing the average pie chart within each group:



To recap all the edits, here is an animated gif: (if it doesn't play automatically, click on it)



Judging from the last graphic, I am not sure there is much correlation between share of multinationals and the performance differentials. It's interesting that in aggregate, local firms and multinationals performed the same. The average hides the variability by sector: in some sectors, local firms out-performed multinationals, as the original chart title asserted.

It's hot even in Alaska

A twitter user pointed to the following chart, which shows that Alaska has experienced extreme heat this summer, with the July statewide average temperature shattering the previous record;


This column chart is clear in its primary message: the red column shows that the average temperature this year is quite a bit higher than the next highest temperature, recorded in July 2004. The error bar is useful for statistically-literate people - the uncertainty is (presumably) due to measurement errors. (If a similar error bar is drawn for the July 2004 column, these bars probably overlap a bit.)

The chart violates one of the rules of making column charts - the vertical axis is truncated at 53F, thus the heights or areas of the columns shouldn't be compared. This violation was recently nominated by two dataviz bloggers when asked about "bad charts" (see here).

Now look at the horizontal axis. These are the years of the top 20 temperature records, ordered from highest to lowest. The months are almost always July except for the year 2004 when all three summer months entered the top 20. I find it hard to make sense of these dates when they are jumping around.

In the following version, I plotted the 20 temperatures on a chronological axis. Color is used to divide the 20 data points into four groups. The chart is meant to be read top to bottom. 



The Periodic Table, a challenge in information organization

Reader Chris P. points me to this article about the design of the Periodic Table. I then learned that 2019 is the “International Year of the Periodic Table,” according to the United Nations.

Here is the canonical design of the Periodic Table that science students are familiar with.


(Source: Wikipedia.)

The Periodic Table is an exercise of information organization and display. It's about adding structure to over 100 elements, so as to enhance comprehension and lookup. The canonical tabular design has columns and rows. The columns (Groups) impose a primary classification; the rows (Periods) provide a secondary classification. The elements also follow an aggregate order, which is traced by reading from top left to bottom right. The row structure makes clear the "periodicity" of the elements: the "period" of recurrence is not constant, tending to increase with the heavier elements at the bottom.

As with most complex datasets, these elements defy simple organization, due to a curse of dimensionality. The general goal is to put the similar elements closer together. Similarity can be defined in an infinite number of ways, such as chemical, physical or statistical properties. The canonical design, usually attributed to Russian chemist Mendeleev, attained its status because the community accepted his organizing principles, that is, his definitions of similarity (subsequently modified).


Of interest, there is a list of unsettled issues. According to Wikipedia, the most common arguments concern:

  • Hydrogen: typically shown as a member of Group 1 (first column), some argue that it doesn’t belong there since it is a gas not a metal. It is sometimes placed in Group 17 (halogens), where it forms a nice “triad” with fluorine and chlorine. Other designers just float hydrogen up top.
  • Helium: typically shown as a member of Group 18 (rightmost column), the  halogens noble gases, it may also be placed in Group 2.
  • Mercury: usually found in Group 12, some argue that it is not a metal like cadmium and zinc.
  • Group 3: other than the first two elements , there are various voices about how to place the other elements in Group 3. In particular, the pairs of lanthanum / actinium and lutetium / lawrencium are sometimes shown in the main table, sometimes shown in the ‘f-orbital’ sub-table usually placed below the main table.


Over the years, there have been numerous attempts to re-design the Periodic table. Some of these are featured in the article that Chris sent me (link).

I checked how these alternative designs deal with those unsettled issues. The short answer is they don't settle the issues.

Wide Table (Janet)

The key change is to remove the separation between the main table and the f-orbital (pink) section shown below, as a "footnote". This change clarifies the periodicity of the elements, especially the elongating periods as one moves down the table. This form is also called "long step".


As a tradeoff, this table requires more space and has an awkward aspect ratio.

In this version of the wide table, the designer chooses to stack lutetium / lawrencium in Group 3 as part of the main table. Other versions place lanthanum / actinium in Group 3 as part of the main table. There are even versions that leave Group 3 with two elements.

Hydrogen, helium and mercury retain their conventional positions.


Spiral Design (Hyde)

There are many attempts at spiral designs. Here is one I found on this tumblr:


The spiral leverages the correspondence between periodic and circular. It is visually more pleasing than a tabular arrangement. But there is a tradeoff. Because of the increasing "diameter" from inner to outer rings, the inner elements are visually constrained compared to the outer ones.

In these spiral diagrams, the designer solves the aspect-ratio problem by creating local loops, sometimes called peninsulas. This is analogous to the footnote table solution, and visually distorts the longer periodicity of the heavier elements.

For Hyde's diagram, hydrogen is floated, helium is assigned to Group 2, and mercury stays in Group 12.



I also found this design on the same tumblr, but unattributed. It may have come from Life magazine.


It's a variant of the spiral. Instead of peninsulas, the designer squeezes the f-orbital section under Group 3, so this is analogous to the wide table solution.

The circular diagrams convey the sense of periodic return but the wide table displays the magnitudes more clearly.

This designer places hydrogen in group 18 forming a triad with fluorine and chlorine. Helium is in Group 17 and mercury in the usual Group 12 .


Cartogram (Sheehan)

This version is different.


The designer chooses a statistical property (abundance) as the primary organizing principle. The key insight is that the lighter elements in the top few rows are generally more abundant - thus more important in a sense. The cartogram reveals a key weakness of the spiral diagrams that draw the reader's attention to the outer (heavier) elements.

Because of the distorted shapes, the cartogram form obscures much of the other data. In terms of the unsettled issues, hydrogen and helium are placed in Groups 1 and 2. Mercury is in Group 12. Group 3 is squeezed inside the main table rather than shown below.



The centerpiece of the article Chris sent me is a network graph.


This is a complete redesign, de-emphasizing the periodicity. It's a result of radically changing the definition of similarity between elements. One barrier when introducing entirely new displays is the tendency of readers to expect the familiar.


I found the following articles useful when researching this post:

The Conversation

Royal Chemistry Society


An exercise in decluttering

My friend Xan found the following chart by Pew hard to understand. Why is the chart so taxing to look at? 


It's packing too much.

I first notice the shaded areas. Shading usually signifies "look here". On this chart, the shading is highlighting the least important part of the data. Since the top line shows applicants and the bottom line admitted students, the shaded gap displays the rejections.

The numbers printed on the chart are growth rates but they confusingly do not sync with the slopes of the lines because the vertical axis plots absolute numbers, not rates. 

Pew_collegeadmissions_growthThe vertical axis presents the total number of applicants, and the total number of admitted students, in each "bucket" of colleges, grouped by their admission rate in 2017. On the right, I drew in two lines, both growth rates of 100%, from 500K to 1 million, and from 1 to 2 million. The slopes are not the same even though the rates of growth are.

Therefore, the growth rates printed on the chart must be read as extraneous data unrelated to other parts of the chart. Attempts to connect those rates to the slopes of the corresponding lines are frustrated.

Another lurking factor is the unequal sizes of the buckets of colleges. There are fewer than 10 colleges in the most selective bucket, and over 300 colleges in the largest bucket. We are unable to interpret properly the total number of applicants (or admissions). The quantity of applications in a bucket depends not just on the popularity of the colleges but also the number of colleges in each bucket.

The solution isn't to resize the buckets but to select a more appropriate metric: the number of applicants per enrolled student. The most selective colleges are attracting about 20 applicants per enrolled student while the least selective colleges (those that accept almost everyone) are getting 4 applicants per enrolled student, in 2017.

As the following chart shows, the number of applicants has doubled across the board in 15 years. This raises an intriguing question: why would a college that accepts pretty much all applicants need more applicants than enrolled students?


Depending on whether you are a school administrator or a student, a virtuous (or vicious) cycle has been realized. For the top four most selective groups of colleges, they have been able to progressively attract more applicants. Since class size did not expand appreciably, more applicants result in ever-lower admit rate. Lower admit rate reduces the chance of getting admitted, which causes prospective students to apply to even more colleges, which further suppresses admit rate. 




Appreciating population mountains

Tim Harford tweeted about a nice project visualizing of the world's distribution of population, and wondered why he likes it so much. 

That's the question we'd love to answer on this blog! Charts make us emotional - some we love, some we hate. We like to think that designers can control those emotions, via design choices.

I also happen to like the "Population Mountains" project as well. It fits nicely into a geography class.

1. Chart Form

The key feature is to adopt a 3D column chart form, instead of the more conventional choropleth or dot density. The use of columns is particularly effective here because it is natural - cities do tend to expand vertically upwards when ever more people cramp into the same amount of surface area. 


Imagine the same chart form is used to plot the number of swimming pools per square meter. It just doesn't make the same impact. 

2. Color Scale

The designer also made judicious choices on the color scale. The discrete, 5-color scheme is a clear winner over the more conventional, continuous color scale. The designer made a deliberate choice because most software by default uses a continuous color scale for continuous data (population density per square meter).


Also, notice that the color intervals in 5-color scale is not set uniformly because there is a power law in effect - the dense areas are orders of magnitude denser than the sparsely populated areas, and most locations are low-density. 

These decisions have a strong influence on the perception of the information: it affects the heights of the peaks, the contrasts between the highs and lows, etc. It also injects a degree of subjectivity into the data visualization exercise that some find offensive.

3. Background

The background map is stripped of unnecessary details so that the attention is focused on these "population mountains". No unnecessary labels, roads, relief, etc. This demonstrates an acute awareness of foreground/background issues.

4. Insights on the "shape" of the data 

The article makes the following comment:

What stands out is each city’s form, a unique mountain that might be like the steep peaks of lower Manhattan or the sprawling hills of suburban Atlanta. When I first saw a city in 3D, I had a feel for its population size that I had never experienced before.

I'd strike out population size and replace with population density. In theory, the sum of the areas of the columns in any given surface area gives you the "population size" but given the fluctuating heights of these columns, and the different surface areas (sprawls) of different cities, it is an Olympian task to estimate the volumes of the population mountains!

The more salient features of these mountains, most easily felt by readers, are the heights of the peak columns, the sprawl of the cities, and the general form of the mass of columns. The volume of the mountain is one of the tougher things to see. Similarly, the taller 3D columns hide what's behind them, and you'd need to spin and rotate the map to really get a good feel.

Here is the contrast between Paris and London, with comparable population sizes. You can see that the population in Paris (and by extension, France) is much more concentrated than in the U.K. This difference is a surprise to me.


5. Sourcing

Some of the other mountains, especially those in India and China, look a bit odd to me, which leads me to wonder about the source of the data. This project has a very great set of footnotes that not only point to the source of the data but also a discussion of its limitations, including the possibility of inaccuracies in places like India and China. 


Check out Population Mountains!






The merry-go-round of investment bankers

Here is the start of my blog post about the chart I teased the other day:



Today's post deals with the following chart, which appeared recently at Business Insider (hat tip: my sister).

It's immediately obvious that this chart requires a heroic effort to decipher. The question shown in the chart title "How many senior investment bankers left their firms?" is the easiest to answer, as the designer places the number of exits in the central circle of each plot relating to a top-tier investment bank (aka "featured bank"). Note that the visual design plays no role in delivering the message, as readers just scan the data from those circles.

Anyone persistent enough to explore the rest of the chart will eventually discover these features...


The entire post including an alternative view of the dataset is a guest blog at the JMP Blog here. This is a situation in which plotting everything will make an unreadable chart, and the designer has to think hard about what s/he is really trying to accomplish.

Message-first visualization

Sneaky Pete via Twitter sent me the following chart, asking for guidance:


This is a pretty standard dataset, frequently used in industry. It shows a breakdown of a company's profit by business unit, here classified by "state". The profit projection for the next year is measured on both absolute dollar terms and year-on-year growth.

Since those two metrics have completely different scales, in both magnitude and unit, it is common to use dual axes. In the case of the Economist, they don't use dual axes; they usually just print the second data series in its own column.


I first recommended looking at the scatter plot to see if there are any bivariate patterns. In this case, not much insights are provided via the scatter.

From there, I looked at the data again, and ended up with the following pair of bumps charts (slopegraphs):


A key principle I used is message-first. That is to say, the designer should figure out what message s/he wants to convey via the visualization, and then design the visualization to convey that message.

A second key observation is that the business units are divided into two groups, the two large states (A and F) and the small states (B to E). This is a Pareto principle that very often applies to real-world businesses, i.e. a small number of entities contribute most of the revenues (or profits). It is very likely that these businesses are structured to serve the large and small states differently, and so the separation onto two charts mirrors the internal structure.

Then, within each chart, there is a message. For the large states, it looks like state F is projected to overtake state A next year. That is a big deal because we're talking about the largest unit in the entire company.

For the small states, the standout is state B, decidedly more rosy than the other three small states with similar projected growth rates.

Note also I chose to highlight the actual dollar profits, letting the growth rates be implied in the slopes. Usually, executives are much more concerned about hitting a dollar value than a growth rate target. But that, of course, depends on your management's preference.