Pulling the multi-national story out, step by step

Reader Aleksander B. found this Economist chart difficult to understand.


Given the chart title, the reader is looking for a story about multinationals producing lower return on equity than local firms. The first item displayed indicates that multinationals out-performed local firms in the technology sector.

The pie charts on the right column provide additional information about the share of each sector by the type of firms. Is there a correlation between the share of multinationals, and their performance differential relative to local firms?


We can clean up the presentation. The first changes include using dots in place of pipes, removing the vertical gridlines, and pushing the zero line to the background:


The horizontal gridlines attached to the zero line can also be removed:


Now, we re-order the rows. Start with the aggregate "All sectors". Then, order sectors from the largest under-performance by multinationals to the smallest.


The pie charts focus only on the share of multinationals. Taking away the remainders speeds up our perception:


Help the reader understand the data by dividing the sectors into groups, organized by the performance differential:


For what it's worth, re-sort the sectors from largest to smallest share of multinationals:


Having created groups of sectors by share of multinationals, I simplify further by showing the average pie chart within each group:



To recap all the edits, here is an animated gif: (if it doesn't play automatically, click on it)



Judging from the last graphic, I am not sure there is much correlation between share of multinationals and the performance differentials. It's interesting that in aggregate, local firms and multinationals performed the same. The average hides the variability by sector: in some sectors, local firms out-performed multinationals, as the original chart title asserted.

This Wimbledon beauty will be ageless


This Financial Times chart paints the picture of the emerging trend in Wimbledon men’s tennis: the average age of players has been rising, and hits 30 years old for the first time ever in 2019.

The chart works brilliantly. Let's look at the design decisions that contributed to its success.

The chart contains a good amount of data and the presentation is carefully layered, with the layers nicely tied to some visual cues.

Readers are drawn immediately to the average line, which conveys the key statistical finding. The blue dot  reinforces the key message, aided by the dotted line drawn at 30 years old. The single data label that shows a number also highlights the message.

Next, readers may notice the large font that is applied to selected players. This device draws attention to the human stories behind the dry data. Knowledgable fans may recall fondly when Borg, Becker and Chang burst onto the scene as teenagers.


Then, readers may pick up on the ticker-tape data that display the spread of ages of Wimbledon players in any given year. There is some shading involved, not clearly explained, but we surmise that it illustrates the range of ages of most of the contestants. In a sense, the range of probable ages and the average age tell the same story. The current trend of rising ages began around 2005.


Finally, a key data processing decision is disclosed in chart header and sub-header. The chart only plots the players who reached the fourth round (16). Like most decisions involved in data analysis, this choice has both desirable and undesirable effects. I like it because it thins out the data. The chart would have appeared more cluttered otherwise, in a negative way.

The removal of players eliminated in the early rounds limits the conclusion that one can draw from the chart. We are tempted to generalize the finding, saying that the average men’s player has increased in age – that was what I said in the first paragraph. Thinking about that for a second, I am not so sure the general statement is valid.

The overall field might have gone younger or not grown older, even as the older players assert their presence in the tournament. (This article provides side evidence that the conjecture might be true: the author looked at the average age of players in the top 100 ATP ranking versus top 1000, and learned that the average age of the top 1000 has barely shifted while the top 100 players have definitely grown older.)

So kudos to these reporters for writing a careful headline that stays true to the analysis.

I also found this video at FT that discussed the chart.


This chart about Wimbledon players hits the Trifecta. It has an interesting – to some, surprising – message (Q). It demonstrates thoughtful processing and analysis of the data (D). And the visual design fits well with its intended message (V). (For a comprehensive guide to the Trifecta Checkup, see here.)

Too much of a good thing

Several of us discussed this data visualization over twitter last week. The dataviz by Aero Data Lab is called “A Bird’s Eye View of Pharmaceutical Research and Development”. There is a separate discussion on STAT News.

Here is the top section of the chart:


We faced a number of hurdles in understanding this chart as there is so much going on. The size of the shapes is perhaps the first thing readers notice, followed by where the shapes are located along the horizontal (time) axis. After that, readers may see the color of the shapes, and finally, the different shapes (circles, triangles,...).

It would help to have a legend explaining the sizes, shapes and colors. These were explained within the text. The size encodes the number of test subjects in the clinical trials. The color encodes pharmaceutical companies, of which the graphic focuses on 10 major ones. Circles represent completed trials, crosses inside circles represent terminated trials, triangles represent trials that are still active and recruiting, and squares for other statuses.

The vertical axis presents another challenge. It shows the disease conditions being investigated. As a lay-person, I cannot comprehend the logic of the order. With over 800 conditions, it became impossible to find a particular condition. The search function on my browser skipped over the entire graphic. I believe the order is based on some established taxonomy.


In creating the alternative shown below, I stayed close to the original intent of the dataviz, retaining all the dimensions of the dataset. Instead of the fancy dot plot, I used an enhanced data table. The encoding methods reflect what I’d like my readers to notice first. The color shading reflects the size of each clinical trial. The pharmaceutical companies are represented by their first initials. The status of the trial is shown by a dot, a cross or a square.

Here is a sketch of this concept showing just the top 10 rows.


Certain conditions attracted much more investment. Certain pharmas are placing bets on cures for certain conditions. For example, Novartis is heavily into research on Meningnitis, meningococcal while GSK has spent quite a bit on researching "bacterial infections."

It's hot even in Alaska

A twitter user pointed to the following chart, which shows that Alaska has experienced extreme heat this summer, with the July statewide average temperature shattering the previous record;


This column chart is clear in its primary message: the red column shows that the average temperature this year is quite a bit higher than the next highest temperature, recorded in July 2004. The error bar is useful for statistically-literate people - the uncertainty is (presumably) due to measurement errors. (If a similar error bar is drawn for the July 2004 column, these bars probably overlap a bit.)

The chart violates one of the rules of making column charts - the vertical axis is truncated at 53F, thus the heights or areas of the columns shouldn't be compared. This violation was recently nominated by two dataviz bloggers when asked about "bad charts" (see here).

Now look at the horizontal axis. These are the years of the top 20 temperature records, ordered from highest to lowest. The months are almost always July except for the year 2004 when all three summer months entered the top 20. I find it hard to make sense of these dates when they are jumping around.

In the following version, I plotted the 20 temperatures on a chronological axis. Color is used to divide the 20 data points into four groups. The chart is meant to be read top to bottom. 



Three estimates, two differences trip up an otherwise good design

Reader Fernando P. was baffled by this chart from the Perception Gap report by More in Common. (link to report)


Overall, this chart is quite good. Its flaws are subtle. There is so much going on, perhaps even the designer found it hard to keep level.

The title is "Democrat's Perception Gap" which actually means the gap between Democrats' perception of Republicans and Republican's self-reported views. We are talking about two estimates of Republican views. Conversely, in Figure 2 (not shown), the "Republican's Perception Gap" describes two estimates of Democrat views.

The gap is visually shown as the gray bar between the red dot and the blue dot. This is labeled perception gap, and its values are printed on the right column, also labeled perception gap.

Perhaps as an after-thought, the designer added the yellow stripes, which is a third estimate of Republican views, this time by Independents. This little addition wreaks havoc. There are now three estimates - and two gaps. There is a new gap, between Independents' perception of Republican views, and Republican's self-reported views. This I-gap is hidden in plain sight. The words "perception gap" obstinately sticks to the D-gap.


Here is a slightly modified version of the same chart.



The design focuses attention on the two gaps (bars). It also identifies the Republican self-perception as the anchor point from which the gaps are computed.

I have chosen to describe the Republican dot as "self-perception" rather than "actual view," which connotes a form of "truth." Rather than considering the gap as an error of estimation, I like to think of the gap as the difference between two groups of people asked to estimate a common quantity.

Also, one should note that on the last two issues, there is virtual agreement.


Aside from the visual, I have doubts about the value of such a study. Only the most divisive issues are being addressed here. Adding a few bipartisan issues would provide controls that can be useful to tease out what is the baseline perception gap.

I wonder whether there is a self-selection in survey response, such that people with extreme views (from each party) will be under-represented. Further, do we believe that all survey respondents will provide truthful answers to sensitive questions that deal with racism, sexism, etc.? For example, if I am a moderate holding racist views, would I really admit to racism in a survey?



Putting the house in order, two Brexit polls

Reader Steve M. noticed an oversight in the Guardian in the following bar chart (link):


The reporter was discussing an important story that speaks to the need for careful polling design. He was comparing two polls, one by Ipsos Mori, and one by YouGov, that estimates the vote support for each party in the future U.K. general election. The bottom line is that the YouGov poll predicts about double the support for the Brexit Party than the Ipsos-Mori poll.

The stacked bar chart should only be used for data that can be added up. Here, we should be comparing the numbers side by side:


I've always found this standard display inadequate. The story here is the gap in the two bar lengths for the Brexit Party. A secondary story is that the support for the Brexit Party might come from voters breaking from Labour. In other words, we really want the reader to see:


Switching to a dot plot helps bring attention to the gaps:


Now, putting the house in order:


Why do these two polls show such different results? As the reporter explained, the answer is in how the question was asked. The Ipsos-Mori is unprompted, meaning the Brexit Party was not announced to the respondent as one of the choices while the YouGov is prompted.

This last version imposes a direction on the gaps to bring out the secondary message - that the support for Brexit might be coming from voters breaking from Labour.




Wayward legend takes sides in a chart of two sides, plus data woes

Reader Chris P. submitted the following graph, found on Axios:


From a Trifecta Checkup perspective, the chart has a clear question: are consumers getting what they wanted to read in the news they are reading?

Nevertheless, the chart is a visual mess, and the underlying data analytics fail to convince. So, it’s a Type DV chart. (See this overview of the Trifecta Checkup for the taxonomy.)


The designer did something tricky with the axis but the trick went off the rails. The underlying data consist of two set of ranks, one for news people consumed and the other for news people wanted covered. With 14 topics included in the study, the two data series contain the same values, 1 to 14. The trick is to collapse both axes onto one. The trouble is that the same value occurs twice, and the reader must differentiate the plot symbols (triangle or circle) to figure out which is which.

It does not help that the lines look like arrows suggesting movement. Without first reading the text, readers may assume that topics change in rank between two periods of time. Some topics moved right, increasing in importance while others shifted left.

The design wisely separated the 14 topics into three logical groups. The blue group comprises news topics for which “want covered” ranking exceeds the “read” ranking. The orange group has the opposite disposition such that the data for “read” sit to the right side of the data for “want covered”. Unfortunately, the legend up top does more harm than good: it literally takes sides!


Here, I've put the data onto a scatter plot:


The two sets of ranks are basically uncorrelated, as the regression line is almost flat, with “R-squared” of 0.02.

The analyst tried to "rescue" the data in the following way. Draw the 45-degree line, and color the points above the diagonal blue, and those below the diagonal orange. Color the points on the line gray. Then, write stories about those three subgroups.


Further, the ranking of what was read came from Parse.ly, which appears to be surveillance data (“traffic analytics”) while the ranking of what people want covered came from an Axios/SurveyMonkey poll. As for as I could tell, there was no attempt to establish that the two populations are compatible and comparable.






Pay levels in the U.S.

The Wall Street Journal published a graphic showing the median pay levels at "most" public companies in the U.S. here.


People who attended my dataviz seminar might recognize the similarity with the graphic showing internet download speeds by different broadband technologies. It's a clean, clear way of showing multiple comparisons on the same chart.

You can see the distribution of pay levels of companies within each industry grouping, and the vertical lines showing the sector medians allow comparison across sectors. The median pay levels are quite similar with the energy sector leaning higher, and consumer sector leaning lower.

The consumer sector is extremely heavy on the low side of the pay range. Companies like Universal, Abercrombie, Skechers, Mattel, Gap, etc. all pay at least half their employees less than $6,000. The data is sourced to MyLogIQ. I have no knowledge of how reliable or valid the data are. It's curious to me that Dunkin Brands showed a median of $110K while Starbucks showed $13K.



I like the interactive features.

The window control lets the user zoom in to different parts of the pay range. This is necessary because of the extremely high salaries. The control doubles as a presentation of the overall distribution of median salaries.

The text box can be used to add data labels to specific companies.


See previous discussion of WSJ Graphics.


Measles babies

Mona Chalabi has made this remarkable graphic to illustrate the effect of the anti-vaccine movement on measles cases in the U.S.: (link)


As a form of agitprop, the graphic seizes upon the fear engendered by the defacing red rash of the disease. And it's very effective in articulating its social message.


I wasn't able to find the data except for a specific year or two. So, this post is more inspired by the graphic than a direct response to it.

I think the left-side legend should say "1 case of measles in someone who was not vaccinated" (as opposed to 1 case of measles in aggregate).

The chart encodes the data in the density of the red dots. What does the density of the red dots signify? There are two possibilities: case counts or case rates.

2013 is a year in which I could find data. In 2013, the U.S. saw 187 cases of measles, only 4 of them in someone who was vaccinated. In other words, there are 49 times as many measles cases among the unvaccinated as the vaccinated.

But note that about 90 percent of the population (using 13-17 year olds as a proxy) are vaccinated. The chance of getting measles in the unvaccinated is 0.8 per million, compared to 0.002 per million in the vaccinated - 422 times higher.

The following chart shows the relative appearance of the dot densities. The bottom row which compares the relative chance of getting measles is the more appropriate metric, and it looks much worse.



Mona's instagram has many other provocative graphics.


Seeking simplicity in complex data: Bloomberg's dataviz on UK gender pay gap

Bloomberg featured a thought-provoking dataviz that illustrates the pay gap by gender in the U.K. The dataset underlying this effort is complex, and the designers did a good job simplifying the data for ease of comprehension.

U.K. companies are required to submit data on salaries and bonuses by gender, and by pay quartiles. The dataset is incomplete, since some companies are slow to report, and the analyst decided not to merge companies that changed names.

Companies are classified into industry groups. Readers who read Chapter 3 of Numbers Rule Your World (link) should ask whether these group differences are meaningful by themselves, without controlling for seniority, job titles, etc. The chapter features one method used by the educational testing industry to take a more nuanced analysis of group differences.


The Bloomberg visualization has two sections. In the top section, each company is represented by the percent difference between average female pay and average male pay. Then the companies within a given industry is shown in a histogram. The histograms provide a view of the disparity between companies within a given industry. The black line represents the relative proportion of companies in a given industry that have no gender pay gap but it’s the weight of the histogram on either side of the black line that carries the graphic’s message.

This is the histogram for arts, entertainment and recreation.


The spread within this industry is very wide, especially on the left side of the black line. A large proportion of these companies pay women less on average than men, and how much less is highly variable. There is one extreme positive value: Chelsea FC Foundation that pays the average female about 40% more than the average male.

This is the histogram for the public sector.

It is a much tighter distribution, meaning that the pay gaps vary less from organization to organization (this statement ignores the possibility that there are outliers not visible on this graphic). Again, the vast majority of entities in this sector pay women less than men on average.


The second part of the visualization look at the quartile data. The employees of each company are divided into four equal-sized groups, based on their wages. Think of these groups as the Top 25% Earners, the Second 25%, etc. Within each group, the analyst looks at the proportion of women. If gender is independent of pay, then we should expect the proportions of women to be about the same for all four quartiles. (This analysis considers gender to be the only explainer for pay gaps. This is a problem I've called xyopia, that frames a complex multivariate issue as a bivariate problem involving one outcome and one explanatory variable. Chapter 3 of Numbers Rule Your World (link) discusses how statisticians approach this issue.)

Bloomberg_genderpaygap_public_pieOn the right is the chart for the public sector. This is a pie chart used as a container. Every pie has four equal-sized slices representing the four quartiles of pay.

The female proportion is encoded in both the size and color of the pie slices. The size encoding is more precise while the color encoding has only 4 levels so it provides a “binned” summary view of the same data.

For the public sector, the lighter-colored slice shows the top 25% earners, and its light color means the proportion of women in the top 25% earners group is between 30 and 50 percent. As we move clockwise around the pie, the slices represent the 2nd, 3rd and bottom 25% earners, and women form 50 to 70 percent of each of those three quartiles.

To read this chart properly, the reader must first do one calculation. Women represent about 60% of the top 25% earners in the public sector. Is that good or bad? This depends on the overall representation of women in the public sector. If the sector employs 75 percent women overall, then the 60 percent does not look good but if it employs 40 percent women, then the same value of 60% tells us that the female employees are disproportionately found in the top 25% earners.

That means the reader must compare each value in the pie chart against the overall proportion of women, which is learned from the average of the four quartiles.


In the chart below, I make this relative comparison explicit. The overall proportion of women in each industry is shown using an open dot. Then the graphic displays two bars, one for the Top 25% earners, and one for the Bottom 25% earners. The bars show the gap between those quartiles and the overall female proportion. For the top earners, the size of the red bars shows the degree of under-representation of women while for the bottom earners, the size of the gray bars shows the degree of over-representation of women.


The net sum of the bar lengths is a plausible measure of gender inequality.

The industries are sorted from the ones employing fewer women (at the top) to the ones employing the most women (at the bottom). An alternative is to sort by total bar lengths. In the original Bloomberg chart - the small multiples of pie charts, the industries are sorted by the proportion of women in the bottom 25% pay quartile, from smallest to largest.

In making this dataviz, I elected to ignore the middle 50%. This is not a problem since any quartile above the average must be compensated by a different quartile below the average.


The challenge of complex datasets is discovering simple ways to convey the underlying message. This usually requires quite a bit of upfront analytics, data transformation, and lots of sketching.