This Excel chart looks standard but gets everything wrong

The following CNBC chart (link) shows the trend of global car sales by region (or so we think).

Cnbc zh global car sales

This type of chart is quite common in finance/business circles, and has the fingerprint of Excel. After examining it, I nominate it for the Hall of Shame.

***

The chart has three major components vying for our attention: (1) the stacked columns, (2) the yellow line, and (3) the big red dashed arrow.

The easiest to interpret is the yellow line, which is labeled "Total" in the legend. It displays the annual growth rate of car sales around the globe. The data consist of annual percentage changes in car sales, so the slope of the yellow line represents a change of change, which is not particularly useful.

The big red arrow is making the point that the projected decline in global car sales in 2019 will return the world to the slowdown of 2008-9 after almost a decade of growth.

The stacked columns appear to provide a breakdown of the global growth rate by region. Looked at carefully, you'll soon learn that the visual form has hopelessly mangled the data.

Cnbc_globalcarsales_2006

What is the growth rate for Chinese car sales in 2006? Is it 2.5%, the top edge of China's part of the column? Between 1.5% and 2.5%, the extant of China's section? The answer is neither. Because of the stacking, China's growth rate is actually the height of the relevant section, that is to say, 1 percent. So the labels on the vertical axis are not directly useful to learning regional growth rates for most sections of the chart.

Can we read the vertical axis as global growth rate? That's not proper either. The different markets are not equal in size so growth rates cannot be aggregated by simple summing - they must be weighted by relative size.

The negative growth rates present another problem. Even if we agree to sum growth rates ignoring relative market sizes, we still can't get directly to the global growth rate. We would have to take the total of the positive rates and subtract the total of the negative rates.  

***

At this point, you may begin to question everything you thought you knew about this chart. Remember the yellow line, which we thought measures the global growth rate. Take a look at the 2006 column again.

The global growth rate is depicted as 2 percent. And yet every region experienced growth rates below 2 percent! No matter how you aggregate the regions, it's not possible for the world average to be larger than the value of each region.

For 2006, the regional growth rates are: China, 1%; Rest of the World, 1%; Western Europe, 0.1%; United States, -0.25%. A simple sum of those four rates yields 2%, which is shown on the yellow line.

But this number must be divided by four. If we give the four regions equal weight, each is worth a quarter of the total. So the overall average is the sum of each growth rate weighted by 1/4, which is 0.5%. [In reality, the weights of each region should be scaled to reflect its market size.]

***

tldr; The stacked column chart with a line overlay not only fails to communicate the contents of the car sales data but it also leads to misinterpretation.

I discussed several serious problems of this chart form: 

  • stacking the columns make it hard to learn the regional data

  • the trend by region takes a super effort to decipher

  • column stacking promotes reading meaning into the height of the column but the total height is meaningless (because of the negative section) while the net height (positive minus negative) also misleads due to presumptive equal weighting

  • the yellow line shows the sum of the regional data, which is four times the global growth rate that it purports to represent

 

***

PS. [12/4/2019: New post up with a different visualization.]


This chart tells you how rich is rich - if you can read it

Via twitter, John B. sent me the following YouGov chart (link) that he finds difficult to read:

Yougov_whoisrich

The title is clear enough: the higher your income, the higher you set the bar.

When one then moves from the title to the chart, one gets misdirected. The horizontal axis shows pound values, so the axis naturally maps to "the higher your income". But it doesn't. Those pound values are the "cutoff" values - the line between "rich" and "not rich". Even after one realizes this detail, the axis  presents further challenges: the cutoff values are arbitrary numbers such as "45,001" sterling; and these continuous numbers are treated as discrete categories, with irregular intervals between each category.

There is some very interesting and hard to obtain data sitting behind this chart but the visual form suppresses them. The best way to understand this dataset is to first think about each income group. Say, people who make between 20 to 30 thousand sterling a year. Roughly 10% of these people think "rich" starts at 25,000. Forty percent of this income group think "rich" start at 40,000.

For each income group, we have data on Z percent think "rich" starts at X. I put all of these data points into a heatmap, like this:

Redo_junkcharts_yougovuk_whoisrich

Technical note: in order to restore the horizontal axis to a continuous scale, you can take the discrete data from the original chart, then fit a smoothed curve through those points, and finally compute the interpolated values for any income level using the smoothing model.

***

There are some concerns about the survey design. It's hard to get enough samples for higher-income people. This is probably why the highest income segment starts at 50,000. But notice that 50,ooo is around the level at which lower-income people consider "rich". So, this survey is primarily about how low-income people perceive "rich" people.

The curve for the highest income group is much straighter and smoother than the other lines - that's because it's really the average of a number of curves (for each 10,000 sterling segment).

 

P.S. The YouGov tweet that publicized the small-multiples chart shown above links to a page that no longer contains the chart. They may have replaced it due to feedback.

 

 


Marketers want millennials to know they're millennials

When I posted about the lack of a standard definition of "millennials", Dean Eckles tweeted about the arbitrary division of age into generational categories. His view is further reinforced by the following chart, courtesy of PewResearch by way of MarketingCharts.com.

PewResearch-Generational-Identification-Sept2015

Pew asked people what generation they belong to. The amount of people who fail to place themselves in the right category is remarkable. One way to interpret this finding is that these are marketing categories created by the marketing profession. We learned in my other post that even people who use the term millennial do not have a consensus definition of it. Perhaps the 8 percent of "millennials" who identify as "boomers" are handing in a protest vote!

The chart is best read row by row - the use of stacked bar charts provides a clue. Forty percent of millennials identified as millennials, which leaves sixty percent identifying as some other generation (with about 5 percent indicating "other" responses). 

While this chart is not pretty, and may confuse some readers, it actually shows a healthy degree of analytical thinking. Arranging for the row-first interpretation is a good start. The designer also realizes the importance of the diagonal entries - what proportion of each generation self-identify as a member of that generation. Dotted borders are deployed to draw eyes to the diagonal.

***

The design doesn't do full justice for the analytical intelligence. Despite the use of the bar chart form, readers may be tempted to read column by column due to the color scheme. The chart doesn't have an easy column-by-column interpretation.

It's not obvious which axis has the true category and which, the self-identified category. The designer adds a hint in the sub-title to counteract this problem.

Finally, the dotted borders are no match for the differential colors. So a key message of the chart is buried.

Here is a revised chart, using a grouped bar chart format:

Redo_junkcharts_millennial_id

***

In a Trifecta checkup (link), the original chart is a Type V chart. It addresses a popular, pertinent question, and it shows mature analytical thinking but the visual design does not do full justice to the data story.

 

 


Does this chart tell the sordid tale of TI's decline?

The Hustle has an interesting article on the demise of the TI calculator, which is popular in business circles. The article uses this bar chart:

Hustle_ti_calculator_chart

From a Trifecta Checkup perspective, this is a Type DV chart. (See this guide to the Trifecta Checkup.)

The chart addresses a nice question: is the TI graphing calculator a victim of new technologies?

The visual design is marred by the use of the calculator images. The images add nothing to our understanding and create potential for confusion. Here is a version without the images for comparison.

Redo_junkcharts_hustlet1calc

The gridlines are placed to reveal the steepness of the decline. The sales in 2019 will likely be half those of 2014.

What about the Data? This would have been straightforward if the revenues shown are sales of the TI calculator. But according to the subtitle, the data include a whole lot more than calculators - it's the "other revenues" category in the financial reports of Texas Instrument which markets the TI. 

It requires a leap of faith to believe this data. It is entirely possible that TI calculator sales increased while total "other revenues" decreased! The decline of TI calculator could be more drastic than shown here. We simply don't have enough data to say for sure.

 

P.S. [10/3/2019] Fixed TI.

 

 


Pulling the multi-national story out, step by step

Reader Aleksander B. found this Economist chart difficult to understand.

Redo_multinat_1

Given the chart title, the reader is looking for a story about multinationals producing lower return on equity than local firms. The first item displayed indicates that multinationals out-performed local firms in the technology sector.

The pie charts on the right column provide additional information about the share of each sector by the type of firms. Is there a correlation between the share of multinationals, and their performance differential relative to local firms?

***

We can clean up the presentation. The first changes include using dots in place of pipes, removing the vertical gridlines, and pushing the zero line to the background:

Redo_multinat_2

The horizontal gridlines attached to the zero line can also be removed:

Redo_multinat_3

Now, we re-order the rows. Start with the aggregate "All sectors". Then, order sectors from the largest under-performance by multinationals to the smallest.

Redo_multinat_4

The pie charts focus only on the share of multinationals. Taking away the remainders speeds up our perception:

Redo_multinat_5

Help the reader understand the data by dividing the sectors into groups, organized by the performance differential:

Redo_multinat_6

For what it's worth, re-sort the sectors from largest to smallest share of multinationals:

Redo_multinat_7

Having created groups of sectors by share of multinationals, I simplify further by showing the average pie chart within each group:

Redo_multinat_8

***

To recap all the edits, here is an animated gif: (if it doesn't play automatically, click on it)

Redo_junkcharts_econmultinat

***

Judging from the last graphic, I am not sure there is much correlation between share of multinationals and the performance differentials. It's interesting that in aggregate, local firms and multinationals performed the same. The average hides the variability by sector: in some sectors, local firms out-performed multinationals, as the original chart title asserted.


Women workers taken for a loop or four

I was drawn to the following chart in Business Insider because of the calendar metaphor. (The accompanying article is here.)

Businessinsider_payday

Sometimes, the calendar helps readers grasp concepts faster but I'm afraid the usage here slows us down.

The underlying data consist of just four numbers: the wage gaps between race and gender in the U.S., considered simply from an aggregate median personal income perspective. The analyst adopts the median annual salary of a white male worker as a baseline. Then, s/he imputes the number of extra days that others must work to attain the same level of income. For example, the median Asian female worker must work 64 extra days (at her daily salary level) to match the white guy's annual pay. Meanwhile, Hispanic female workers must work 324 days extra.

There are a host of reasons why the calendar metaphor backfired.

Firstly, it draws attention to an uncomfortable detail of the analysis - which papers over the fact that weekends or public holidays are counted as workdays. The coloring of the boxes compounds this issue. (And the designer also got confused and slipped up when applying the purple color for Hispanic women.)

Secondly, the calendar focuses on Year 2 while Year 1 lurks in the background - white men have to work to get that income (roughly $46,000 in 2017 according to the Census Bureau).

Thirdly, the calendar view exposes another sore point around the underlying analysis. In reality, the white male workers are continuing to earn wages during Year 2.

The realism of the calendar clashes with the hypothetical nature of the analysis.

***

One can just use a bar chart, comparing the number of extra days needed. The calendar design can be considered a set of overlapping bars, wrapped around the shape of a calendar.

The staid bars do not bring to life the extra toil - the message is that these women have to work harder to get the same amount of pay. This led me to a different metaphor - the white men got to the destination in a straight line but the women must go around loops (extra days) before reaching the same endpoint.

Redo_businessinsider_racegenderpaygap

While the above is a rough sketch, I made sure that the total length of the lines including the loops roughly matches the total number of days the women needed to work to earn $46,000.

***

The above discussion focuses solely on the V(isual) corner of the Trifecta Checkup, but this data visualization is also interesting from the D(ata) perspective. Statisticians won't like such a simple analysis that ignores, among other things, the different mix of jobs and industries underlying these aggregate pay figures.

Now go to my other post on the sister (book) blog for a discussion of the underlying analysis.

 

 


Too much of a good thing

Several of us discussed this data visualization over twitter last week. The dataviz by Aero Data Lab is called “A Bird’s Eye View of Pharmaceutical Research and Development”. There is a separate discussion on STAT News.

Here is the top section of the chart:

Aerodatalab_research_top

We faced a number of hurdles in understanding this chart as there is so much going on. The size of the shapes is perhaps the first thing readers notice, followed by where the shapes are located along the horizontal (time) axis. After that, readers may see the color of the shapes, and finally, the different shapes (circles, triangles,...).

It would help to have a legend explaining the sizes, shapes and colors. These were explained within the text. The size encodes the number of test subjects in the clinical trials. The color encodes pharmaceutical companies, of which the graphic focuses on 10 major ones. Circles represent completed trials, crosses inside circles represent terminated trials, triangles represent trials that are still active and recruiting, and squares for other statuses.

The vertical axis presents another challenge. It shows the disease conditions being investigated. As a lay-person, I cannot comprehend the logic of the order. With over 800 conditions, it became impossible to find a particular condition. The search function on my browser skipped over the entire graphic. I believe the order is based on some established taxonomy.

***

In creating the alternative shown below, I stayed close to the original intent of the dataviz, retaining all the dimensions of the dataset. Instead of the fancy dot plot, I used an enhanced data table. The encoding methods reflect what I’d like my readers to notice first. The color shading reflects the size of each clinical trial. The pharmaceutical companies are represented by their first initials. The status of the trial is shown by a dot, a cross or a square.

Here is a sketch of this concept showing just the top 10 rows.

Redo_aero_pharmard

Certain conditions attracted much more investment. Certain pharmas are placing bets on cures for certain conditions. For example, Novartis is heavily into research on Meningnitis, meningococcal while GSK has spent quite a bit on researching "bacterial infections."


Re-thinking a standard business chart of stock purchases and sales

Here is a typical business chart.

Cetera_amd_chart

A possible story here: institutional investors are generally buying AMD stock, except in Q3 2018.

Let's give this chart a three-step treatment.

STEP 1: The Basics

Remove the data labels, which stand sideways awkwardly, and are redundant given the axis labels. If the audience includes people who want to take the underlying data, then supply a separate data table. It's easier to copy and paste from, and doing so removes clutter from the visual.

The value axis is probably created by an algorithm - hard to imagine someone deliberately placing axis labels  $262 million apart.

The gridlines are optional.

Redo_amdinstitution_1

STEP 2: Intermediate

Simplify and re-organize the time axis labels; show the quarter and year structure. The years need not repeat.

Align the vocabulary on the chart. The legend mentions "inflows and outflows" while the chart title uses the words "buying and selling". Inflows is buying; outflows is selling.

Redo_amdinstitution_2

STEP 3: Advanced

This type of data presents an interesting design challenge. Arguably the most important metric is the net purchases (or the net flow), i.e. inflows minus outflows. And yet, the chart form leaves this element in the gaps, visually.

The outflows are numerically opposite to inflows. The sign of the flow is encoded in the color scheme. An outflow still points upwards. This isn't a criticism, but rather a limitation of the chart form. If the red bars are made to point downwards to indicate negative flow, then the "net flow" is almost impossible to visually compute!

Putting the columns side by side allows the reader to visually compute the gap, but it is hard to visually compare gaps from quarter to quarter because each gap is hanging off a different baseline.

The following graphic solves this issue by focusing the chart on the net flows. The buying and selling are still plotted but are deliberately pushed to the side:

Redo_amd_1

The structure of the data is such that the gray and pink sections are "symmetric" around the brown columns. A purist may consider removing one of these columns. In other words:

Redo_amd_2

Here, the gray columns represent gross purchases while the brown columns display net purchases. The reader is then asked to infer the gross selling, which is the difference between the two column heights.

We are almost back to the original chart, except that the net buying is brought to the foreground while the gross selling is pushed to the background.

 


Pay levels in the U.S.

The Wall Street Journal published a graphic showing the median pay levels at "most" public companies in the U.S. here.

Wsj_mediancompanypay

People who attended my dataviz seminar might recognize the similarity with the graphic showing internet download speeds by different broadband technologies. It's a clean, clear way of showing multiple comparisons on the same chart.

You can see the distribution of pay levels of companies within each industry grouping, and the vertical lines showing the sector medians allow comparison across sectors. The median pay levels are quite similar with the energy sector leaning higher, and consumer sector leaning lower.

The consumer sector is extremely heavy on the low side of the pay range. Companies like Universal, Abercrombie, Skechers, Mattel, Gap, etc. all pay at least half their employees less than $6,000. The data is sourced to MyLogIQ. I have no knowledge of how reliable or valid the data are. It's curious to me that Dunkin Brands showed a median of $110K while Starbucks showed $13K.

Wsj_medianpay_dunkinstarbucks

***

I like the interactive features.

The window control lets the user zoom in to different parts of the pay range. This is necessary because of the extremely high salaries. The control doubles as a presentation of the overall distribution of median salaries.

The text box can be used to add data labels to specific companies.

***

See previous discussion of WSJ Graphics.

 


An exercise in decluttering

My friend Xan found the following chart by Pew hard to understand. Why is the chart so taxing to look at? 

Pew_collegeadmissions

It's packing too much.

I first notice the shaded areas. Shading usually signifies "look here". On this chart, the shading is highlighting the least important part of the data. Since the top line shows applicants and the bottom line admitted students, the shaded gap displays the rejections.

The numbers printed on the chart are growth rates but they confusingly do not sync with the slopes of the lines because the vertical axis plots absolute numbers, not rates. 

Pew_collegeadmissions_growthThe vertical axis presents the total number of applicants, and the total number of admitted students, in each "bucket" of colleges, grouped by their admission rate in 2017. On the right, I drew in two lines, both growth rates of 100%, from 500K to 1 million, and from 1 to 2 million. The slopes are not the same even though the rates of growth are.

Therefore, the growth rates printed on the chart must be read as extraneous data unrelated to other parts of the chart. Attempts to connect those rates to the slopes of the corresponding lines are frustrated.

Another lurking factor is the unequal sizes of the buckets of colleges. There are fewer than 10 colleges in the most selective bucket, and over 300 colleges in the largest bucket. We are unable to interpret properly the total number of applicants (or admissions). The quantity of applications in a bucket depends not just on the popularity of the colleges but also the number of colleges in each bucket.

The solution isn't to resize the buckets but to select a more appropriate metric: the number of applicants per enrolled student. The most selective colleges are attracting about 20 applicants per enrolled student while the least selective colleges (those that accept almost everyone) are getting 4 applicants per enrolled student, in 2017.

As the following chart shows, the number of applicants has doubled across the board in 15 years. This raises an intriguing question: why would a college that accepts pretty much all applicants need more applicants than enrolled students?

Redo_pewcollegeadmissions

Depending on whether you are a school administrator or a student, a virtuous (or vicious) cycle has been realized. For the top four most selective groups of colleges, they have been able to progressively attract more applicants. Since class size did not expand appreciably, more applicants result in ever-lower admit rate. Lower admit rate reduces the chance of getting admitted, which causes prospective students to apply to even more colleges, which further suppresses admit rate.