Clearing a forest of labels

This chart by the Financial Times has a strong message, and I like a lot about it:


The countries are by and large aligned along a diagonal, with the poorer countries growing strongly between 2007-2019 while the richer countries suffered negative growth.

A small issue with the chart is the thick forest of text - redundant text. The sub-title, the axis titles, the quadrant labels, and the left-right-half labels all repeat the same things. In the following chart, I simplify the text:


Typically, I don't put axis titles as a sub-header (or, header of the graphic) but as this may be part of the FT style, I respected it.

Re-thinking a standard business chart of stock purchases and sales

Here is a typical business chart.


A possible story here: institutional investors are generally buying AMD stock, except in Q3 2018.

Let's give this chart a three-step treatment.

STEP 1: The Basics

Remove the data labels, which stand sideways awkwardly, and are redundant given the axis labels. If the audience includes people who want to take the underlying data, then supply a separate data table. It's easier to copy and paste from, and doing so removes clutter from the visual.

The value axis is probably created by an algorithm - hard to imagine someone deliberately placing axis labels  $262 million apart.

The gridlines are optional.


STEP 2: Intermediate

Simplify and re-organize the time axis labels; show the quarter and year structure. The years need not repeat.

Align the vocabulary on the chart. The legend mentions "inflows and outflows" while the chart title uses the words "buying and selling". Inflows is buying; outflows is selling.


STEP 3: Advanced

This type of data presents an interesting design challenge. Arguably the most important metric is the net purchases (or the net flow), i.e. inflows minus outflows. And yet, the chart form leaves this element in the gaps, visually.

The outflows are numerically opposite to inflows. The sign of the flow is encoded in the color scheme. An outflow still points upwards. This isn't a criticism, but rather a limitation of the chart form. If the red bars are made to point downwards to indicate negative flow, then the "net flow" is almost impossible to visually compute!

Putting the columns side by side allows the reader to visually compute the gap, but it is hard to visually compare gaps from quarter to quarter because each gap is hanging off a different baseline.

The following graphic solves this issue by focusing the chart on the net flows. The buying and selling are still plotted but are deliberately pushed to the side:


The structure of the data is such that the gray and pink sections are "symmetric" around the brown columns. A purist may consider removing one of these columns. In other words:


Here, the gray columns represent gross purchases while the brown columns display net purchases. The reader is then asked to infer the gross selling, which is the difference between the two column heights.

We are almost back to the original chart, except that the net buying is brought to the foreground while the gross selling is pushed to the background.


Pay levels in the U.S.

The Wall Street Journal published a graphic showing the median pay levels at "most" public companies in the U.S. here.


People who attended my dataviz seminar might recognize the similarity with the graphic showing internet download speeds by different broadband technologies. It's a clean, clear way of showing multiple comparisons on the same chart.

You can see the distribution of pay levels of companies within each industry grouping, and the vertical lines showing the sector medians allow comparison across sectors. The median pay levels are quite similar with the energy sector leaning higher, and consumer sector leaning lower.

The consumer sector is extremely heavy on the low side of the pay range. Companies like Universal, Abercrombie, Skechers, Mattel, Gap, etc. all pay at least half their employees less than $6,000. The data is sourced to MyLogIQ. I have no knowledge of how reliable or valid the data are. It's curious to me that Dunkin Brands showed a median of $110K while Starbucks showed $13K.



I like the interactive features.

The window control lets the user zoom in to different parts of the pay range. This is necessary because of the extremely high salaries. The control doubles as a presentation of the overall distribution of median salaries.

The text box can be used to add data labels to specific companies.


See previous discussion of WSJ Graphics.


Measles babies

Mona Chalabi has made this remarkable graphic to illustrate the effect of the anti-vaccine movement on measles cases in the U.S.: (link)


As a form of agitprop, the graphic seizes upon the fear engendered by the defacing red rash of the disease. And it's very effective in articulating its social message.


I wasn't able to find the data except for a specific year or two. So, this post is more inspired by the graphic than a direct response to it.

I think the left-side legend should say "1 case of measles in someone who was not vaccinated" (as opposed to 1 case of measles in aggregate).

The chart encodes the data in the density of the red dots. What does the density of the red dots signify? There are two possibilities: case counts or case rates.

2013 is a year in which I could find data. In 2013, the U.S. saw 187 cases of measles, only 4 of them in someone who was vaccinated. In other words, there are 49 times as many measles cases among the unvaccinated as the vaccinated.

But note that about 90 percent of the population (using 13-17 year olds as a proxy) are vaccinated. The chance of getting measles in the unvaccinated is 0.8 per million, compared to 0.002 per million in the vaccinated - 422 times higher.

The following chart shows the relative appearance of the dot densities. The bottom row which compares the relative chance of getting measles is the more appropriate metric, and it looks much worse.



Mona's instagram has many other provocative graphics.


Elegant way to present a pair of charts

The Bloomberg team has come up with a few goodies lately. I was captivated by the following graphic about the ebb and flow of U.S. presidential candidates across recent campaigns. Link to the full presentation here.

The highlight is at the bottom of the page. This is an excerpt of the chart:


From top to bottom are the sequential presidential races. The far right vertical axis is the finish line. Going right to left is the time before the finish line. In 2008, for example, there are candidates who entered the race much earlier than typical.

This chart presents an aggregate view of the data. We get a sense of when most of the candidates enter the race, when most of them are knocked out, and also a glimpse of outliers. The general pattern across multiple elections is also clear. The design is a stacked area chart with the baseline in the middle, rather than the bottom, of the chart.

Sure, the chart can disappoint those readers who want details and precise numbers. It's not immediately apparent how many candidates were in the race at the height of 2008, nor who the candidates were.

The designer added a nice touch. By clicking on any of the stacks, it transforms into a bar chart, showing the extent of each candidate's participation in the race.


I wish this was a way to collapse the bar chart back to the stack. You can reload the page to start afresh.


This elegant design touch makes the user experience playful. It's also an elegant way to present what is essentially a panel of plots. Imagine the more traditional presentation of placing the stack and the bar chart side by side.

This design does not escape the trade-off between entertainment value and data integrity. Looking at the 2004 campaign, one should expect to see the blue stack halve in size around day 100 when Kerry became the last man standing. That moment is not marked in the stack. The stack can be interpreted as a smoothed version of the count of active candidates.


I suppose some may complain the stack misrepresents the data somewhat. I find it an attractive way of presenting the big-picture message to an audience that mostly spend less than a minute looking at the graphic.

Seeking simplicity in complex data: Bloomberg's dataviz on UK gender pay gap

Bloomberg featured a thought-provoking dataviz that illustrates the pay gap by gender in the U.K. The dataset underlying this effort is complex, and the designers did a good job simplifying the data for ease of comprehension.

U.K. companies are required to submit data on salaries and bonuses by gender, and by pay quartiles. The dataset is incomplete, since some companies are slow to report, and the analyst decided not to merge companies that changed names.

Companies are classified into industry groups. Readers who read Chapter 3 of Numbers Rule Your World (link) should ask whether these group differences are meaningful by themselves, without controlling for seniority, job titles, etc. The chapter features one method used by the educational testing industry to take a more nuanced analysis of group differences.


The Bloomberg visualization has two sections. In the top section, each company is represented by the percent difference between average female pay and average male pay. Then the companies within a given industry is shown in a histogram. The histograms provide a view of the disparity between companies within a given industry. The black line represents the relative proportion of companies in a given industry that have no gender pay gap but it’s the weight of the histogram on either side of the black line that carries the graphic’s message.

This is the histogram for arts, entertainment and recreation.


The spread within this industry is very wide, especially on the left side of the black line. A large proportion of these companies pay women less on average than men, and how much less is highly variable. There is one extreme positive value: Chelsea FC Foundation that pays the average female about 40% more than the average male.

This is the histogram for the public sector.

It is a much tighter distribution, meaning that the pay gaps vary less from organization to organization (this statement ignores the possibility that there are outliers not visible on this graphic). Again, the vast majority of entities in this sector pay women less than men on average.


The second part of the visualization look at the quartile data. The employees of each company are divided into four equal-sized groups, based on their wages. Think of these groups as the Top 25% Earners, the Second 25%, etc. Within each group, the analyst looks at the proportion of women. If gender is independent of pay, then we should expect the proportions of women to be about the same for all four quartiles. (This analysis considers gender to be the only explainer for pay gaps. This is a problem I've called xyopia, that frames a complex multivariate issue as a bivariate problem involving one outcome and one explanatory variable. Chapter 3 of Numbers Rule Your World (link) discusses how statisticians approach this issue.)

Bloomberg_genderpaygap_public_pieOn the right is the chart for the public sector. This is a pie chart used as a container. Every pie has four equal-sized slices representing the four quartiles of pay.

The female proportion is encoded in both the size and color of the pie slices. The size encoding is more precise while the color encoding has only 4 levels so it provides a “binned” summary view of the same data.

For the public sector, the lighter-colored slice shows the top 25% earners, and its light color means the proportion of women in the top 25% earners group is between 30 and 50 percent. As we move clockwise around the pie, the slices represent the 2nd, 3rd and bottom 25% earners, and women form 50 to 70 percent of each of those three quartiles.

To read this chart properly, the reader must first do one calculation. Women represent about 60% of the top 25% earners in the public sector. Is that good or bad? This depends on the overall representation of women in the public sector. If the sector employs 75 percent women overall, then the 60 percent does not look good but if it employs 40 percent women, then the same value of 60% tells us that the female employees are disproportionately found in the top 25% earners.

That means the reader must compare each value in the pie chart against the overall proportion of women, which is learned from the average of the four quartiles.


In the chart below, I make this relative comparison explicit. The overall proportion of women in each industry is shown using an open dot. Then the graphic displays two bars, one for the Top 25% earners, and one for the Bottom 25% earners. The bars show the gap between those quartiles and the overall female proportion. For the top earners, the size of the red bars shows the degree of under-representation of women while for the bottom earners, the size of the gray bars shows the degree of over-representation of women.


The net sum of the bar lengths is a plausible measure of gender inequality.

The industries are sorted from the ones employing fewer women (at the top) to the ones employing the most women (at the bottom). An alternative is to sort by total bar lengths. In the original Bloomberg chart - the small multiples of pie charts, the industries are sorted by the proportion of women in the bottom 25% pay quartile, from smallest to largest.

In making this dataviz, I elected to ignore the middle 50%. This is not a problem since any quartile above the average must be compensated by a different quartile below the average.


The challenge of complex datasets is discovering simple ways to convey the underlying message. This usually requires quite a bit of upfront analytics, data transformation, and lots of sketching.



Morphing small multiples to investigate Sri Lanka's religions

Earlier this month, the bombs in Sri Lanka led to some data graphics in the media, educating us on the religious tensions within the island nation. I like this effort by Reuters using small multiples to show which religions are represented in which districts of Sri Lanka (lifted from their twitter feed):


The key to reading this map is the top legend. From there, you'll notice that many of the color blocks, especially for Muslims and Catholics are well short of 50 percent. The absence of the darkest tints of green and blue conveys important information. Looking at the blue map by itself misleads - Catholics are in the minority in every district except one. In this setup, readers are expected to compare between maps, and between map and legend.

The overall distribution at the bottom of the chart is a nice piece of context.


The above design isolates each religion in its own chart, and displays the spatial spheres of influence. I played around with using different ways of paneling the small multiples.

In the following graphic, the panels represent the level of dominance within each district. The first panel shows the districts in which the top religion is practiced by at least 70 percent of the population (if religions were evenly distributed across all districts, we expect 70 percent of each to be Buddhists.) The second panel shows the religions that account for 40 to 70 percent of the district's residents. By this definition, no district can appear on both the left and middle maps. This division is effective at showing districts with one dominant religion, and those that are "mixed".

In the middle panel, the displayed religion represents the top religion in a mixed district. The last panel shows the second religion in each mixed district, and these religions typically take up between 25 and 40 percent of the residents.


The chart shows that other than Buddhists, Hinduism is the only religion that dominates specific districts, concentrated at the northern end of the island. The districts along the east and west coasts and the "neck" are mixed with the top religion accounting for 40 to 70 percent of the residents. By assimilating the second and the third panels, the reader sees the top and the second religions in each of these mixed districts.


This example shows why in the Trifecta Checkup, the Visual is a separate corner from the Question and the Data. Both maps utilize the same visual design, in terms of forms and colors and so on, but they deliver different expereinces to readers by answering different questions, and cutting the data differently.


P.S. [5/7/2019] Corrected spelling of Hindu.

Watching a valiant effort to rescue the pie chart

Today we return to the basics. In a twitter exchange with Dean E., I found the following pie chart in an Atlantic article about who's buying San Francisco real estate:


The pie chart is great at one thing, showing how workers in the software industry accounted for half of the real estate purchases. (Dean and I both want to see more details of the analysis as we have many questions about the underlying data. In this post, I ignore these questions.)

After that, if we want to learn anything else from the pie chart, we have to read the data labels. This calls for one of my key recommendations: make your charts sufficient. The principle of self-sufficiency is that the visual elements of the data graphic should by themselves say something about the data. The test of self-sufficiency is executed by removing the data printed on the chart so that one can assess how much work the visual elements are performing. If the visual elements require data labels to work, then the data graphic is effectively a lookup table.

This is the same pie chart, minus the data:


Almost all pie charts with a large number of slices are packed with data labels. Think of the labeling as a corrective action to fix the shortcoming of the form.

Here is a bar chart showing the same data:



Let's look at all the efforts made to overcome the lack of self-sufficiency.

Here is a zoom-in on the left side of the chart:


Data labels are necessary to help readers perceive the sizes of the slices. But as the slices are getting smaller, the labels are getting too dense, so the guiding lines are being stretched.

Eventually, the designer gave up on labeling every slice. You can see that some slices are missing labels:


The designer also had to give up on sequencing the slices by the data. For example, hardware with a value of 2.4% should be placed between Education and Law. It is shifted to the top left side to make the labeling easier.


Fitting all the data labels to the slices becomes the singular task at hand.


Say it thrice: a nice example of layering and story-telling

I enjoyed the New York Times's data viz showing how actively the Democratic candidates were criss-crossing the nation in the month of March (link).

It is a great example of layering the presentation, starting with an eye-catching map at the most aggregate level. The designers looped through the same dataset three times.


This compact display packs quite a lot. We can easily identify which were the most popular states; and which candidate visited which states the most.

I noticed how they handled the legend. There is no explicit legend. The candidate names are spread around the map. The size legend is also missing, replaced by a short sentence explaining that size encodes the number of cities visited within the state. For a chart like this, having a precise size legend isn't that useful.

The next section presents the same data in a small-multiples layout. The heads are replaced by dots.


This allows more precise comparison of one candidate to another, and one location to another.

This display has one shortcoming. If you compare the left two maps above, those for Amy Klobuchar and Beto O'Rourke, it looks like they have visited roughly similar number of cities when in fact Beto went to 42 compared to 25. Reducing the size of the dots might work.

Then, in the third visualization of the same data, the time dimension is emphasized. Lines are used to animate the daily movements of the candidates, one by one.


Click here to see the animation.

When repetition is done right, it doesn't feel like repetition.


Form and function: when academia takes on weed

I have a longer article on the sister blog about the research design of a study claiming 420 "cannabis" Day caused more road accident fatalities (link). The blog also has a discussion of the graphics used to present the analysis, which I'm excerpting here for dataviz fans.

The original chart looks like this:


The question being asked is whether April 20 is a special day when viewed against the backdrop of every day of the year. The answer is pretty clear. From this chart, the reader can see:

  • that April 20 is part of the background "noise". It's not standing out from the pack;
  • that there are other days like July 4, Labor Day, Christmas, etc. that stand out more than April 20

It doesn't even matter what the vertical axis is measuring. The visual elements did their job. 


If you look closely, you can even assess the "magnitude" of the evidence, not just the "direction." While April 20 isn't special, it nonetheless is somewhat noteworthy. The vertical line associated with April 20 sits on the positive side of the range of possibilities, and appears to sit above most other days.

The chart form shown above is better at conveying the direction of the evidence than its strength. If the strength of the evidence is required, we use a different chart form.

I produced the following histogram, using the same data:


The histogram is produced by first locating the midpoints# of the vertical lines into buckets, and then counting the number of days that fall into each bucket.  (# Strictly speaking, I use the point estimates.)

The midpoints# are estimates of the fatal crash ratio, which is defined as the excess crash fatalities reported on the "analysis day" relative to the "reference days," which are situated one week before and one week after the analysis day. So April 20 is compared to April 13 and 27. Therefore, a ratio of 1 indicates no excess fatalities on the analysis day. And the further the ratio is above 1, the more special is the analysis day. 

If we were to pick a random day from the histogram above, we will likely land somewhere in the middle, which is to say, a day of the year in which no excess car crashes fatalities could be confirmed in the data.

As shown above, the ratio for April 20 (about 1.12)  is located on the right tail, and at roughly the 94th percentile, meaning that there were 6 percent of analysis days in which the ratios would have been more extreme. 

This is in line with our reading above, that April 20 is noteworthy but not extraordinary.


P.S. [4/27/2019] Replaced the first chart with a newer version from Harper's site. The newer version contains the point estimates inside the vertical lines, which are used to generate the histogram.