## The windy path to the Rugby World Cup

##### Sep 26, 2019

When I first saw the following chart, I wondered whether it is really that challenging for these eight teams to get into the Rugby World Cup, currently playing in Japan:

Another visualization of the process conveys a similar message. Both of these are uploaded to Wikipedia.

(This one hasn't been updated and still contains blank entries.)

***

What are some of the key messages one would want the dataviz to deliver?

• For the eight countries that got in (not automatically), track their paths to the World Cup. How many competitions did they have to play?
• For those countries that failed to qualify, track their paths to the point that they were stopped. How many competitions did they play?
• What is the structure of the qualification rounds? (These are organized regionally, in addition to certain playoffs across regions.)
• How many countries had a chance to win one of the eight spots?
• Within each competition, how many teams participated? Did the winner immediately qualify, or face yet another hurdle? Did the losers immediately disqualify, or were they offered another chance?

Here's my take on this chart:

## Pulling the multi-national story out, step by step

##### Sep 18, 2019

Reader Aleksander B. found this Economist chart difficult to understand.

Given the chart title, the reader is looking for a story about multinationals producing lower return on equity than local firms. The first item displayed indicates that multinationals out-performed local firms in the technology sector.

The pie charts on the right column provide additional information about the share of each sector by the type of firms. Is there a correlation between the share of multinationals, and their performance differential relative to local firms?

***

We can clean up the presentation. The first changes include using dots in place of pipes, removing the vertical gridlines, and pushing the zero line to the background:

The horizontal gridlines attached to the zero line can also be removed:

Now, we re-order the rows. Start with the aggregate "All sectors". Then, order sectors from the largest under-performance by multinationals to the smallest.

The pie charts focus only on the share of multinationals. Taking away the remainders speeds up our perception:

Help the reader understand the data by dividing the sectors into groups, organized by the performance differential:

For what it's worth, re-sort the sectors from largest to smallest share of multinationals:

Having created groups of sectors by share of multinationals, I simplify further by showing the average pie chart within each group:

***

To recap all the edits, here is an animated gif: (if it doesn't play automatically, click on it)

***

Judging from the last graphic, I am not sure there is much correlation between share of multinationals and the performance differentials. It's interesting that in aggregate, local firms and multinationals performed the same. The average hides the variability by sector: in some sectors, local firms out-performed multinationals, as the original chart title asserted.

## It's hot even in Alaska

##### Jul 31, 2019

A twitter user pointed to the following chart, which shows that Alaska has experienced extreme heat this summer, with the July statewide average temperature shattering the previous record;

This column chart is clear in its primary message: the red column shows that the average temperature this year is quite a bit higher than the next highest temperature, recorded in July 2004. The error bar is useful for statistically-literate people - the uncertainty is (presumably) due to measurement errors. (If a similar error bar is drawn for the July 2004 column, these bars probably overlap a bit.)

The chart violates one of the rules of making column charts - the vertical axis is truncated at 53F, thus the heights or areas of the columns shouldn't be compared. This violation was recently nominated by two dataviz bloggers when asked about "bad charts" (see here).

Now look at the horizontal axis. These are the years of the top 20 temperature records, ordered from highest to lowest. The months are almost always July except for the year 2004 when all three summer months entered the top 20. I find it hard to make sense of these dates when they are jumping around.

In the following version, I plotted the 20 temperatures on a chronological axis. Color is used to divide the 20 data points into four groups. The chart is meant to be read top to bottom.

## The Periodic Table, a challenge in information organization

##### Jun 24, 2019

Reader Chris P. points me to this article about the design of the Periodic Table. I then learned that 2019 is the “International Year of the Periodic Table,” according to the United Nations.

Here is the canonical design of the Periodic Table that science students are familiar with.

(Source: Wikipedia.)

The Periodic Table is an exercise of information organization and display. It's about adding structure to over 100 elements, so as to enhance comprehension and lookup. The canonical tabular design has columns and rows. The columns (Groups) impose a primary classification; the rows (Periods) provide a secondary classification. The elements also follow an aggregate order, which is traced by reading from top left to bottom right. The row structure makes clear the "periodicity" of the elements: the "period" of recurrence is not constant, tending to increase with the heavier elements at the bottom.

As with most complex datasets, these elements defy simple organization, due to a curse of dimensionality. The general goal is to put the similar elements closer together. Similarity can be defined in an infinite number of ways, such as chemical, physical or statistical properties. The canonical design, usually attributed to Russian chemist Mendeleev, attained its status because the community accepted his organizing principles, that is, his definitions of similarity (subsequently modified).

***

Of interest, there is a list of unsettled issues. According to Wikipedia, the most common arguments concern:

• Hydrogen: typically shown as a member of Group 1 (first column), some argue that it doesn’t belong there since it is a gas not a metal. It is sometimes placed in Group 17 (halogens), where it forms a nice “triad” with fluorine and chlorine. Other designers just float hydrogen up top.
• Helium: typically shown as a member of Group 18 (rightmost column), the  halogens noble gases, it may also be placed in Group 2.
• Mercury: usually found in Group 12, some argue that it is not a metal like cadmium and zinc.
• Group 3: other than the first two elements , there are various voices about how to place the other elements in Group 3. In particular, the pairs of lanthanum / actinium and lutetium / lawrencium are sometimes shown in the main table, sometimes shown in the ‘f-orbital’ sub-table usually placed below the main table.

***

Over the years, there have been numerous attempts to re-design the Periodic table. Some of these are featured in the article that Chris sent me (link).

I checked how these alternative designs deal with those unsettled issues. The short answer is they don't settle the issues.

Wide Table (Janet)

The key change is to remove the separation between the main table and the f-orbital (pink) section shown below, as a "footnote". This change clarifies the periodicity of the elements, especially the elongating periods as one moves down the table. This form is also called "long step".

As a tradeoff, this table requires more space and has an awkward aspect ratio.

In this version of the wide table, the designer chooses to stack lutetium / lawrencium in Group 3 as part of the main table. Other versions place lanthanum / actinium in Group 3 as part of the main table. There are even versions that leave Group 3 with two elements.

Hydrogen, helium and mercury retain their conventional positions.

Spiral Design (Hyde)

There are many attempts at spiral designs. Here is one I found on this tumblr:

The spiral leverages the correspondence between periodic and circular. It is visually more pleasing than a tabular arrangement. But there is a tradeoff. Because of the increasing "diameter" from inner to outer rings, the inner elements are visually constrained compared to the outer ones.

In these spiral diagrams, the designer solves the aspect-ratio problem by creating local loops, sometimes called peninsulas. This is analogous to the footnote table solution, and visually distorts the longer periodicity of the heavier elements.

For Hyde's diagram, hydrogen is floated, helium is assigned to Group 2, and mercury stays in Group 12.

Racetrack

I also found this design on the same tumblr, but unattributed. It may have come from Life magazine.

It's a variant of the spiral. Instead of peninsulas, the designer squeezes the f-orbital section under Group 3, so this is analogous to the wide table solution.

The circular diagrams convey the sense of periodic return but the wide table displays the magnitudes more clearly.

This designer places hydrogen in group 18 forming a triad with fluorine and chlorine. Helium is in Group 17 and mercury in the usual Group 12 .

Cartogram (Sheehan)

This version is different.

The designer chooses a statistical property (abundance) as the primary organizing principle. The key insight is that the lighter elements in the top few rows are generally more abundant - thus more important in a sense. The cartogram reveals a key weakness of the spiral diagrams that draw the reader's attention to the outer (heavier) elements.

Because of the distorted shapes, the cartogram form obscures much of the other data. In terms of the unsettled issues, hydrogen and helium are placed in Groups 1 and 2. Mercury is in Group 12. Group 3 is squeezed inside the main table rather than shown below.

Network

The centerpiece of the article Chris sent me is a network graph.

This is a complete redesign, de-emphasizing the periodicity. It's a result of radically changing the definition of similarity between elements. One barrier when introducing entirely new displays is the tendency of readers to expect the familiar.

***

I found the following articles useful when researching this post:

The Conversation

Royal Chemistry Society

## An exercise in decluttering

##### Apr 18, 2019

My friend Xan found the following chart by Pew hard to understand. Why is the chart so taxing to look at?

It's packing too much.

I first notice the shaded areas. Shading usually signifies "look here". On this chart, the shading is highlighting the least important part of the data. Since the top line shows applicants and the bottom line admitted students, the shaded gap displays the rejections.

The numbers printed on the chart are growth rates but they confusingly do not sync with the slopes of the lines because the vertical axis plots absolute numbers, not rates.

The vertical axis presents the total number of applicants, and the total number of admitted students, in each "bucket" of colleges, grouped by their admission rate in 2017. On the right, I drew in two lines, both growth rates of 100%, from 500K to 1 million, and from 1 to 2 million. The slopes are not the same even though the rates of growth are.

Therefore, the growth rates printed on the chart must be read as extraneous data unrelated to other parts of the chart. Attempts to connect those rates to the slopes of the corresponding lines are frustrated.

Another lurking factor is the unequal sizes of the buckets of colleges. There are fewer than 10 colleges in the most selective bucket, and over 300 colleges in the largest bucket. We are unable to interpret properly the total number of applicants (or admissions). The quantity of applications in a bucket depends not just on the popularity of the colleges but also the number of colleges in each bucket.

The solution isn't to resize the buckets but to select a more appropriate metric: the number of applicants per enrolled student. The most selective colleges are attracting about 20 applicants per enrolled student while the least selective colleges (those that accept almost everyone) are getting 4 applicants per enrolled student, in 2017.

As the following chart shows, the number of applicants has doubled across the board in 15 years. This raises an intriguing question: why would a college that accepts pretty much all applicants need more applicants than enrolled students?

Depending on whether you are a school administrator or a student, a virtuous (or vicious) cycle has been realized. For the top four most selective groups of colleges, they have been able to progressively attract more applicants. Since class size did not expand appreciably, more applicants result in ever-lower admit rate. Lower admit rate reduces the chance of getting admitted, which causes prospective students to apply to even more colleges, which further suppresses admit rate.

## Appreciating population mountains

##### Dec 11, 2018

Tim Harford tweeted about a nice project visualizing of the world's distribution of population, and wondered why he likes it so much.

That's the question we'd love to answer on this blog! Charts make us emotional - some we love, some we hate. We like to think that designers can control those emotions, via design choices.

I also happen to like the "Population Mountains" project as well. It fits nicely into a geography class.

1. Chart Form

The key feature is to adopt a 3D column chart form, instead of the more conventional choropleth or dot density. The use of columns is particularly effective here because it is natural - cities do tend to expand vertically upwards when ever more people cramp into the same amount of surface area.

Imagine the same chart form is used to plot the number of swimming pools per square meter. It just doesn't make the same impact.

2. Color Scale

The designer also made judicious choices on the color scale. The discrete, 5-color scheme is a clear winner over the more conventional, continuous color scale. The designer made a deliberate choice because most software by default uses a continuous color scale for continuous data (population density per square meter).

Also, notice that the color intervals in 5-color scale is not set uniformly because there is a power law in effect - the dense areas are orders of magnitude denser than the sparsely populated areas, and most locations are low-density.

These decisions have a strong influence on the perception of the information: it affects the heights of the peaks, the contrasts between the highs and lows, etc. It also injects a degree of subjectivity into the data visualization exercise that some find offensive.

3. Background

The background map is stripped of unnecessary details so that the attention is focused on these "population mountains". No unnecessary labels, roads, relief, etc. This demonstrates an acute awareness of foreground/background issues.

4. Insights on the "shape" of the data

The article makes the following comment:

What stands out is each city’s form, a unique mountain that might be like the steep peaks of lower Manhattan or the sprawling hills of suburban Atlanta. When I first saw a city in 3D, I had a feel for its population size that I had never experienced before.

I'd strike out population size and replace with population density. In theory, the sum of the areas of the columns in any given surface area gives you the "population size" but given the fluctuating heights of these columns, and the different surface areas (sprawls) of different cities, it is an Olympian task to estimate the volumes of the population mountains!

The more salient features of these mountains, most easily felt by readers, are the heights of the peak columns, the sprawl of the cities, and the general form of the mass of columns. The volume of the mountain is one of the tougher things to see. Similarly, the taller 3D columns hide what's behind them, and you'd need to spin and rotate the map to really get a good feel.

Here is the contrast between Paris and London, with comparable population sizes. You can see that the population in Paris (and by extension, France) is much more concentrated than in the U.K. This difference is a surprise to me.

5. Sourcing

Some of the other mountains, especially those in India and China, look a bit odd to me, which leads me to wonder about the source of the data. This project has a very great set of footnotes that not only point to the source of the data but also a discussion of its limitations, including the possibility of inaccuracies in places like India and China.

***

Check out Population Mountains!

## The merry-go-round of investment bankers

##### Nov 29, 2018

Here is the start of my blog post about the chart I teased the other day:

Today's post deals with the following chart, which appeared recently at Business Insider (hat tip: my sister).

It's immediately obvious that this chart requires a heroic effort to decipher. The question shown in the chart title "How many senior investment bankers left their firms?" is the easiest to answer, as the designer places the number of exits in the central circle of each plot relating to a top-tier investment bank (aka "featured bank"). Note that the visual design plays no role in delivering the message, as readers just scan the data from those circles.

Anyone persistent enough to explore the rest of the chart will eventually discover these features...

***

The entire post including an alternative view of the dataset is a guest blog at the JMP Blog here. This is a situation in which plotting everything will make an unreadable chart, and the designer has to think hard about what s/he is really trying to accomplish.

## Message-first visualization

##### Nov 12, 2018

Sneaky Pete via Twitter sent me the following chart, asking for guidance:

This is a pretty standard dataset, frequently used in industry. It shows a breakdown of a company's profit by business unit, here classified by "state". The profit projection for the next year is measured on both absolute dollar terms and year-on-year growth.

Since those two metrics have completely different scales, in both magnitude and unit, it is common to use dual axes. In the case of the Economist, they don't use dual axes; they usually just print the second data series in its own column.

***

I first recommended looking at the scatter plot to see if there are any bivariate patterns. In this case, not much insights are provided via the scatter.

From there, I looked at the data again, and ended up with the following pair of bumps charts (slopegraphs):

A key principle I used is message-first. That is to say, the designer should figure out what message s/he wants to convey via the visualization, and then design the visualization to convey that message.

A second key observation is that the business units are divided into two groups, the two large states (A and F) and the small states (B to E). This is a Pareto principle that very often applies to real-world businesses, i.e. a small number of entities contribute most of the revenues (or profits). It is very likely that these businesses are structured to serve the large and small states differently, and so the separation onto two charts mirrors the internal structure.

Then, within each chart, there is a message. For the large states, it looks like state F is projected to overtake state A next year. That is a big deal because we're talking about the largest unit in the entire company.

For the small states, the standout is state B, decidedly more rosy than the other three small states with similar projected growth rates.

Note also I chose to highlight the actual dollar profits, letting the growth rates be implied in the slopes. Usually, executives are much more concerned about hitting a dollar value than a growth rate target. But that, of course, depends on your management's preference.

## Digital revolution in China: two visual takes

##### Jun 13, 2018

The following map accompanied an article in the Economist about China's drive to create a "digital silkroad," roughly defined as making a Silicon Valley.

The two variables plotted are the wealth of each province (measured by GDP per capita) and the level of Internet penetration. The designer made the following choices:

• GDP per capita is presented with less precision than Internet penetration. The former is grouped into five large categories while the latter is given as a percentage to one decimal place.
• The visual design favors GDP per capita which is encoded as the shade of color of each province. The Internet penetration data appeared added on as an afterthought.

If we apply the self-sufficiency test (i.e. by removing the printed data from the chart), it's immediately clear that the visual elements convey zero information about Internet penetration. This is a serious problem for a chart about the "digital silkroad"!

***

If those two variables are chosen, it would seem appropriate to convey to readers the correlation between the two variables. The following sketch is focused on surfacing the correlation.

(Click on the image to see it in full.) Here is the top of the graphic:

The individual maps are not strictly necessary. Just placing provincial names onto the grid is enough, because regional pattern isn't salient here.

The Internet penetration data were grouped into five categories as well, putting it on equal footing as GDP per capita.

## Discoloring the chart to re-discover its plot

##### Apr 05, 2018

Today's chart comes from Pew Research Center, and the big question is why the colors?

The data show the age distributions of people who believe different religions. It's a stacked bar chart, in which the ages have been grouped into the young (under 15), the old (60 plus) and everyone else. Five religions are afforded their own bars while "folk" religions are grouped as one, and so have "other" religions. There is even a bar for the unaffiliated. "World" presumably is the aggregate of all the other bars, weighted by the popularity of each religion group.

So far so good. But what is it that demands 9 colors, and 27 total shades? In other words, one shade for every data point on this chart.

Here is a more restrained view:

***

Let's follow the designer's various decisions. The choice of those age groups indicates that the story is really happening at the "margins": Muslims and Hindus have higher proportions of younger followers while Jews and Buddhists have higher concentrations of older followers.

Therein lies the problem. Because of the lengths, their central locations, and the tints, the middle section of each bar is the most eye-catching: the reader is glancing at the wrong part of the chart.

So, let me fix this by re-ordering the three panels:

Is there really a need to draw those gray bars? The middle age group (grab-all) only exists to assure readers that everyone who's supposed to be included has been included. Why plot it?

The above chart says "trust me, what isn't drawn here constitutes the remaining population, and the whole adds to 100%."

***

Another issue of these charts, exacerbated by inflexible software defaults, is the forced choice of imbuing one variable with a super status above the others. In the Pew chart, the rows are ordered by decreasing proportion of the young age group, except for the "everyone" group pinned as the bottom row. Therefore, the green bars (old age group) are not in a particular order, its pattern much harder to comprehend.

In the final version, I break the need to keep bars of the same religion on the same row:

Five colors are used. Three of them are used to cluster similar religions: Muslims and Hindus (in blue) have higher proportions of the young compared to the world average (gray) while the religions painted in green have higher proportions of the old. Christians (in orange) are unusual in that the proportions are higher than average in both young and old age groups. Everyone and unaffiliated are given separate colors.

The colors here serve two purposes: connecting the two panels, and revealing the cluster structure.