McKinsey thinks the data world needs more dataviz talent

Note about last week: While not blogging, I delivered four lectures on three topics over five days: one on the use of data analytics in marketing for a marketing class at Temple; two on the interplay of analytics and data visualization, at Yeshiva and a JMP Webinar; and one on how to live during the Data Revolution at NYU.

This week, I'm back at blogging.

McKinsey publishes a report confirming what most of us already know or experience - the explosion of data jobs that just isn't stopping.

On page 5, it says something that is of interest to readers of this blog: "As data grows more complex, distilling it and bringing it to life through visualization is becoming critical to help make the results of data analyses digestible for decision makers. We estimate that demand for visualization grew roughly 50 percent annually from 2010 to 2015." (my bolding)

The report contains a number of unfortunate graphics. Here's one:

Mckinseyreport_pageiii

I applied my self-sufficiency test by removing the bottom row of data from the chart. Here is what happened to the second circle, representing the fraction of value realized by the U.S. health care industry.

Mckinseyreport_pageiii_inset

What does the visual say? This is one of the questions in the Trifecta Checkup. We see three categories of things that should add up to 100 percent. With a little more effort, we find the two colored categories are each 10% while the white area is 80%. 

But that's not what the data say, because there is only one thing being measured: how much of the potential has already been realized. The two colors is an attempt to visualize the uncertainty of the estimated proportion, which in this case is described as 10 to 20 percent underneath the chart.

If we have to describe what the two colored sections represent: the dark green section is the lower bound of the estimate while the medium green section is the range of uncertainty. The edge between the two sections is the actual estimated proportion (assuming the uncertainty bound is symmetric around the estimate)!

A first attempt to fix this might be to use line segments instead of colored arcs. 

Redo_mckinseyreport_inset_jc_1

The middle diagram emphasizes the mid-point estimate while the right diagram, the range of estimates. Observe how differently these two diagrams appear from the original one shown on the left.

This design only works if the reader perceives the chart as a "racetrack" chart. You have to see the invisible vertical line at the top, which is the starting line, and measure how far around the track has the symbol gone. I have previously discussed why I don't like racetracks (for example, here and here).

***

Here is a sketch of another design:

Redo_mckinseyreport_jc_2

The center figure will have to be moved and changed to a different shape. This design conveys the sense of a goal (at 100%) and how far one is along the path. The uncertainty is represented by wave-like elements that make the exact location of the pointer arrow appear as wavering.

 

 

 

 


Plotted performance guaranteed not to predict future performance

On my flight back from Lyon, I picked up a French magazine, and found the following chart:

French interest rates chart small

A quick visit to Bing Translate tells me that this chart illustrates the rates of return of different types of investments. The headline supposedly says "Only the risk pays". In many investment brochures, after presenting some glaringly optimistic projections of future returns, the vendor legally protects itself by proclaiming "Past performance does not guarantee future performance."

For this chart, an appropriate warning is PLOTTED PERFORMANCE GUARANTEED NOT TO PREDICT THE FUTURE!

***

Two unusual decisions set this chart apart:

1. The tree ring imagery, which codes the data in the widths of concentric rings around a common core

2. The placement of larger numbers toward the middle, and smaller numbers in the periphery.

When a reader takes in the visual design of this chart, what is s/he drawn to?

The designer evidently hopes the reader will focus on comparing the widths of the rings (A), while ignoring the areas or the circumferences. I think it is more likely that the reader will see one of the following:

(B) the relative areas of the tree rings

(C) the areas of the full circles bounded by the circumferences

(D) the lengths of the outer rings

(E) the lengths of the inner rings

(F) the lengths of the "middle" rings (defined as the average of the outer and inner rings)

Here is a visualization of six ways to "see" what is on the French rates of return chart:

Redo_jc_frenchinterestrates_1

Recall the Trifecta Checkup (link). This is an example where "What does the visual say" and "What does the data say" may be at variance. In case (A), if the reader is seeing the ring widths, then those two aspects are in sync. In every other case, the two aspects are disconcordant. 

The level of distortion is visualized in the following chart:

Redo_jc_frenchinterestrates_2

Here, I normalized everything to the size of the SCPI data. The true data is presented by the ring width column, represented by the vertical stripes on the left. If the comparisons are not distorted, the other symbols should stay close to the vertical stripes. One notices there is always distortion in cases (B)-(F). This is primarily due to the placement of the large numbers near the center and the small numbers near the edge. In other words, the radius is inversely proportional to the data!

 The amount of distortion for most cases ranges from 2 to 6 times. 

While the "ring area" (B) version is least distorted on average, it is perhaps the worst of the six representations. The level of distortion is not a regular function of the size of the data. The "sicav monetaries" (smallest data) is the least distorted while the data of medium value are the most distorted.

***

To improve this chart, take a hint from the headline. Someone recognizes that there is a tradeoff between risk and return. The data series shown, which is an annualized return, only paints the return part of the relationship. 

 

 

 


The French takes back cinema but can you see it?

I like independent cinema, and here are three French films that come to mind as I write this post: Delicatessen, The Class (Entre les murs), and 8 Women (8 femmes). 

The French people are taking back cinema. Even though they purchased more tickets to U.S. movies than French movies, the gap has been narrowing in the last two decades. How do I know? It's the subject of this infographic

DataCinema

How do I know? That's not easy to say, given how complicated this infographic is. Here is a zoomed-in view of the top of the chart:

Datacinema_top

 

You've got the slice of orange, which doubles as the imagery of a film roll. The chart uses five legend items to explain the two layers of data. The solid donut chart presents the mix of ticket sales by country of origin, comparing U.S. movies, French movies, and "others". Then, there are two thin arcs showing the mix of movies by country of origin. 

The donut chart has an usual feature. Typically, the data are coded in the angles at the donut's center. Here, the data are coded twice: once at the center, and again in the width of the ring. This is a self-defeating feature because it draws even more attention to the area of the donut slices except that the areas are highly distorted. If the ratios of the areas are accurate when all three pieces have the same width, then varying those widths causes the ratios to shift from the correct ones!

The best thing about this chart is found in the little blue star, which adds context to the statistics. The 61% number is unusually high, which demands an explanation. The designer tells us it's due to the popularity of The Lion King.

***

The one donut is for the year 1994. The infographic actually shows an entire time series from 1994 to 2014.

The design is most unusual. The years 1994, 1999, 2004, 2009, 2014 receive special attention. The in-between years are split into two pairs, shrunk, and placed alternately to the right and left of the highlighted years. So your eyes are asked to zig-zag down the page in order to understand the trend. 

To see the change of U.S. movie ticket sales over time, you have to estimate the sizes of the red-orange donut slices from one pie chart to another. 

Here is an alternative visual design that brings out the two messages in this data: that French movie-goers are increasingly preferring French movies, and that U.S. movies no longer account for the majority of ticket sales.

Redo_junkcharts_frenchmovies

A long-term linear trend exists for both U.S. and French ticket sales. The "outlier" values are highlighted and explained by the blockbuster that drove them.

 

P.S.

1. You can register for the free seminar in Lyon here. To register for live streaming, go here.
2. Thanks Carla Paquet at JMP for help translating from French.


Saying no thanks to a box of donuts

As I reported last week, the Department of Education for Delaware is running a survey on dashboard design. The survey link is here.

One of the charts being evaluated is a box of donuts, as shown below:

Delaware_doe

I have written before about the problem with donut charts (see here). A box of donuts is worse than one donut. Here, each donut references a school year. The composition by race/ethnicity of the student body is depicted. In aggregate, the composition has not changed drastically although there are small changes from year to year.

In the following alternative, I use a side-by-side line charts, sometimes called slopegraphs, to illustrate the change by race/ethnicity.

Redo_delaware_doe

The key decisions are:

  • using slopes to encode the year-to-year changes, as opposed to having readers compute those changes by measuring and dividing
  • using color to show insights (whether the race/ethnicity has expanded, contracted or remained stable across the three years) as opposed to definitions of the data
  • not showing that the percentages within each year summing to 100% as opposed to explicitly presenting this fact in a circular arrangement
  • placing annual data side by side on the same plot region as opposed to separating them in three charts

***

There is still a further question of how big a change from year to year is considered material.

This is a good example of why there is never "complete data." In theory, the numbers on this chart are "complete," and come from administrative records. Even when ignoring the possibility that some of the records are missing or incorrect, you still have the issue that the students in the system from year to year varies, so a 1 percent increase in the proportion of Hispanic students can indicate a real demographic trend, or it does not.

 

 


When design goes awry

One can't accuse the following chart of lacking design. Strong is the evidence of departing from convention but the design decisions appear wayward. (The original link on Money here)

Mc_cellphones_money17

 

The donut chart (right) has nine sections. Eight of the sections (excepting A) have clearly all been bent out of shape. It turns out that section A does not have the right size either. The middle gray circle is not really in the middle, as seen below.

Redo_mc_cellphone

The bar charts (left) suffer from two ills. Firstly, the full width of the chart is at the 50 percent mark, so readers are forced to read the data labels to understand the data. Secondly, only the top two categories are shown, thus the size of the whole is lost. A stacked bar chart would serve better here.

Here is a bardot chart; the "dot" part of it makes it easier to see a Top 2 box analysis.

Redo_jc_mc_cellphone_2

I explain the bardot chart here.

 

 PS. Here is Jamie's version (from the comment below):

Jamie_mc_cellphone

 

 


Layered donuts have excess fats and oils

Via Twitter, Nicholas S. sent this chart:

Usda_donutchart

It's a layered donut. There isn't much context here except that the chart comes from USDA. Judging from the design, I surmise that the key message is the change in proportion by food groups between 1970 and 2014. I am assuming that these food groups are exhaustive so that it makes sense to put them in a donut chart, with all pieces adding up to 100%.

The following small-multiples line chart conveys most of the information:

Redo_usdadonutchart_jc

The story is the big jump in "Added fats and oils".  In the layered donut, the designer highlighted this by a moire effect, something to be avoided.

Note the parenthetical 2010 next to the Added fats and oils label. The data for all other food groups come from 2014 but the number for the most important category is four years older. The chart would be more compelling if they used 2010 data for everything.

One piece of information is ostensibly absent in the line chart version - the growth in the size of the pie. The total of the data increased about 20% from 1970 to 2014. In theory, the layered donut can convey this growth by the perimeters of the circles. But it doesn't appear that the designer saw this as an important insight since the total area of the outer donut is clearly more than 20% of the area of the inner donut.

 


An unsuccessful adaptation of a classic

Found this chart in Hemispheres magazine on board a United flight:

United_sfemploy_sm

A quick self-sufficiency test reveals the biggest shortcoming of this visual presentation.

United_sfemployment_sufficiency

What would you guess is the difference in areas between the two white-ish sectors (pointing at 9 o'clock and 2 o'clock)? The actual numbers are 18.3% and 12.5%. So roughly, if one takes the 2-o'clock sector (right), halve it and add it back to itself, one should obtain the area of the 9-o'clock sector (left). Clearly, the piece on the left is much too big.

The following chart shows the index of exaggeration increasing with the value of the data. (For example, the highest value of 18.3% is about 9 times the lowest value of 2.3% but the the ratio of the areas depicted is ~500 times.)

United_employment_exag

The distortion is larger than usual because the designer encodes the data twice, once in the angle of the sector, and again in the radius. Both those quantities contribute to the area of a circle.

Readers must look at the data in order to read this chart properly, therefore the visual elements are not self-sufficient. Further, if readers chose to perceive the relative sizes of the sectors, they would have misread the data massively.

***

The designer was probably inspired by the Nightingale rose diagram (link to Wikipedia):

800px-Nightingale-mortality

In the original, Nightingale does not encode data into the angles. The circle is divided evenly into 12 pieces to display the 12 months of the year (She might have taken into account 28-31 days; it's hard to tell by inspection). The data is encoded once along the radial axes.

Another difference between the two charts is the ordering of the data. In Nightingale's version, the order is logically determined by the passing of time. In the Hemispheres chart, the order is chosen based on taste. A more natural order would be by the proportion of employment but I think the resulting chart would look like a snail's shell, or worse. I must say a more balanced "rose diagram" looks nicer but it forces my eyes to jump around to answer a simple question such as which are the top three employment sectors in San Francisco.


Two charts that fail self-sufficiency

My twitter followers have been sending in several howlers.

Twitter (link) made a bunch of bold claims about its own influence by using the number of tweets about the Oscars as fodder. They also adopt the euphenism common to the digital marketing universe, the so-called "view", which credit to them, they define as "how many times tweets are displayed to users". Yes, you read that right, displaying is the same as viewing in this world - and Twitter is just a follower not a trend setter here.

For @dtellom, it is this bubble chart about the Ellen tweet that displeased him:

Twitter_ellenimpressions_0

 

In the meantime, @wilte found this unfortunate donut chart, created by PWC in the Netherlands.

PWCG_donut

Both designers basically used appropriated a graphical form and deprived it of data. In one, the designer threw the concept of scale to the wind. In the other, the designer dumped the law of total probability. In either case, the fundamental rationale for the particular graphical form is sacrificed.

Both are examples that fail our self-sufficiency test. This test says if a visual display cannot be understood unless the entire data set is printed on the chart, then why create a visual display? In both charts, if you block out the numbers, you are left with nothing!

***

The PWC chart was submitted by @graphomate, who also submitted the following KPMG chart:

KPMG_donut

The complaint was the total adding up to 101%. I'm not really bothered by this as it is a rounding issue. That said, I like to "hide" such rounding issues. I have never understood why it is necessary to display the imperfection. Flip a coin and remove the decimals from one of the categories!


The exception to the rule against dual axes

Dual axes are almost always a bad idea. But there is one situation under which I'd use it.

***

Last week, Alberto Cairo (link) engaged in a Twitter/blogging debate about a chart that first appeared in Reuters concerning the state of the woman CEO in the Fortune 500 companies. Here is the chart under discussion:

Original_women_ceo_left

This chart already is cleaner and more useful than the original original, which came from a research report from Catalyst (link):

Catalyst_us_ceos

Jonathan Keller re-made the Reuters chart as follows:

Keller_women_ceo_left

 

Cairo Jorge Camões contributed this version:

  Cairo_women_ceo_left

The Voila blog (link) has yet another take:

Voila_women_ceo_left

Then Chris Moore, responding to Cairo, created this view and also left some insightful comments:

Women_ceo_cmoore

***

What's at stake here? There are really three related topics of discussion.

First, there is the matter of the upper limit of the vertical axis. Three solutions were suggested: 100 percent, 50 percent, and 4 percent. (Cairo at one point suggested 25 percent, which can be wrapped into the 50 percent bucket.) In reality, this is an argument over which of two key messages should be emphasized. The first message is that women still comprises a pathetically small proportion of Fortune 500 CEOs. The second message is more hopeful, that the growth in this proportion has been quite rapid since 1995.

All versions of the chart actually display both messages. In the Reuters chart (as well as Moore and Cairo), the message about the absolute proportion of women is given as an annotation while the Keller and Voila versions extend the vertical axis, thus encoding this message directly to the chart. Conversely, the Keller and Voila versions deemphasize the growth in proportions, and so I'd have preferred to see a note about that growth when using their versions.

Voila selectes a 50% upper limit because the 50/50 split has an intuitive meaning in the context of gender balance. Because the resulting chart is so visually arresting, and so biased to one of the two key messages, I'd only consider it if the point of the display is to draw attention to the female deficit.

***

The second disagreement is in using absolute counts versus relative proportions. Moore chose absolute counts. I am in this camp as well. This is primarily because we are talking about Fortune 500 and the 500 number is an idee fixe. In Moore's version, I find the data labels distracting since all the numbers are small and insignificant.

Finally, the linkage between the absolute and the relative numbers also produces multiple solutions. Cairo's post pinpoints this issue. His solution is to include an inset pie chart with an arrow to explicitly link the two views. Moore likes the inset idea, but experimented with a donut chart or a partition in place of the pie chart. He also removes the explicit guiding arrow.

***

It turns out this dataset is perfectly made for the dual axes. The absolute counts and relative proportions are in one to one correspondence because it's really only one data series expressed twice. This happy situation leads to one line that can be cross-referenced on two axes, one side showing counts and the other side showing proportions. This is shown in my version below (the orange line).

Redo_women_ceo

In addition to having two axes, I have plotted two related data series. The second series (in red) shows the incremental change in the number of women CEOs from the previous year (also shown in both counts and proportions).

The first series (the same one everyone plotted) draws attention to the first message, that the growth rate of women CEOs is quite strong since 1995. The second series is a bit of a downer on that message, suggesting that from the absolute count perspective, the progress (only one or two additions per year) has been painfully slow, and not that impressive.

Thanks again to Alberto for making me aware of this discussion. This has been fun!

 

PS. I have left out the other chart and may return to it in a future post.


What's in a cronut? Let me find out

Analyticsseo_gaReader Ross S. did not join the line for this cronut, illustrating the popularity of different makers of tracking software on 1.3 million websites.

Original by Analytics SEO is here.

***

The biggest beef I have with this cronut is the quality of the data. As I read their description of the underlying data, I see several red flags.

The analysis is hobbled by ignoring the competitive landscape in tracking software. Google Analytics carves out a huge share of the market by virtue of offering a richly featured product for free. (They justify this by establishing a gigantic spying operation on unsuspecting users.) However, industry insiders know that Omniture (owned by Adobe) is the heavyweight enterprise solution, with a complete feature set.

In other words, most of the 670,000 "customers" of Google Analytics are tiny websites; in addition, a lot of large websites also maintain Google Analytics in addition to Omniture since the former is free. It would be great if the researcher gives us one of two alternative views of market share: the share of revenues in the tracking software market; and the share of e-commerce revenues represented by the customers of each tracking software vendor. These two views give a fuller picture of the competitive landscape.

You'll notice this is the same game Google is playing in the mobile universe. Android has the most users but Apple makes the bulk of revenues.

***

The SEO agency says the chart is "based on 1.3 million e-commerce websites in May 2013". Are there really 1.3 million websites out there selling us stuff? How do they define e-commerce? Is NYTimes.com an e-commerce website, for example? Or facebook.com for that matter?

In the summary, they made a pretty startling claim--that "a large number of websites have no tracking software at all". The only problem is readers can't find out what proportion of websites don't track users. The data in the cronut excluded sites without tracking, which is a big problem.

***

Here is the link to the annual Top 500 Retailers report by Internet Retailer magazine. In Sep 2011, they found that 217 out of the top 500 use Omniture, 161 use Google Analytics, and 103 use Coremetrics (now owned by IBM).

Another place to look for corroborating evidence is Google Trends, which measures the popularity of search keywords. The relative order of the major vendors (excluding Google Analytics) does not match well with the data shown by Analytics SEO.

Googletrends_on_tracking

Compared to:

Analyticsseo_gatabletop

Coremetrics is way down in the list compiled by Analytics SEO.