A data graphic that solves a consumer problem

Saw this great little sign at Ippudo, the ramen shop, the other day:

Ippudo_board

It's a great example of highly effective data visualization. The names on the board are sake brands. 

The menu (a version of a data table) is the conventional way of displaying this information.

The Question

Customers are selecting a sake. They don't have a favorite, or don't recognize many of these brands. They know a bit about their preferences: I like full-bodied, or I want the dry one. 

The Data

On a menu, the key data are missing. So the first order of business is to find data on full- and light-bodied, and dry and sweet. The pricing data are omitted, possibly because it clutters up the design, or because the shop doesn't want customers to focus on price - or both.

The Visual

The design uses a scatter plot. The customer finds the right quartet, thus narrowing the choices to three or four brands. Then, the positions on the two axes allow the customer to drill down further. 

This user experience is leaps and bounds above scanning a list of names, and asking someone who may or may not be an expert.

Back to the Data

The success of the design depends crucially on selecting the right data. Baked into the scatter plot is the assumption that the designer knows the two factors most influential to the customer's decision. Technically, this is a "variable selection" problem: of all factors determining the brand choice, which two are the most important? 

Think about the downside of selecting the wrong factors. Then, the scatter plot makes it harder to choose the sake compared to the menu. 

 


How to describe really small chances

Reader Aleksander B. sent me to the following chart in the Daily Mail, with the note that "the usage of area/bubble chart in combination with bar alignment is not very useful." (link)

Dailymail-image-a-35_1431545452562

One can't argue with that statement. This chart fails the self-sufficiency test: anyone reading the chart is reading the data printed on the right column, and does not gain anything from the visual elements (thus, the visual representation is not self-sufficient). As a quick check, the size of the risk for "motorcycle" should be about 30 times larger than that of "car"; the size of the risk for "car" should be 100 times larger than that of "airplane". The risk of riding motorcycles then is roughly 3,000 times that of flying in an airplane. 

The chart does not appear to be sized properly as a bubble chart:

Dailymail_travelrisk_bubble

You'll notice that the visible proportion of the "car" bubble is much larger than that of the "motorcycle" bubble, which is one part of the problem.

Nor is it sized as a bar chart:

Dailymail_travelrisk_bar

As a bar chart, both the widths and the heights of the bars vary; and the last row presents a further challenge as the bubble for the airplane does not touch the baseline.

***

Besides the Visual, the Data issues are also quite hard. This is how Aleksander describes it: "as a reader I don't want to calculate all my travel distances and then do more math to compare different ways of traveling."

The reader wants to make smarter decisions about travel based on the data provided here. Aleksandr proposes one such problem:

In terms of probability it is also easier to understand: "I am sitting in my car in strong traffic. At the end in 1 hour I will make only 10 miles so what's the probability that I will die? Is it higher or lower than 1 hour in Amtrak train?"

The underlying choice is between driving and taking Amtrak for a particular trip. This comparison is relevant because those two modes of transport are substitutes for this trip. 

One Data issue with the chart is that riding a motorcycle and flying in a plane are rarely substitutes. 

***

A way out is to do the math on behalf of your reader. The metric of deaths per 1 billion passenger-miles is not intuitive for a casual reader. A more relevant question is what's the chance of dying from the time I spend per year of driving (or riding a plane). Because the chance will be very tiny, it is easier to express the risk as the number of years of travel before I expect to see one death.

Let's assume someone drives 300 days per year, and 100 miles per day so that each year, this driver contributes 30,000 passenger-miles to the U.S. total (which is 3.2 trillion). We convert 7.3 deaths per 1 billion passenger-miles to 1 death per 137 million passenger-miles. Since this driver does 30K per year, it will take (137 million / 30K) = about 4,500 years to see one death on average. This calculation assumes that the driver drives alone. It's straightforward to adjust the estimate if the average occupancy is higher than 1. 

Now, let's consider someone who flies once a month (one outbound trip plus one return trip). We assume that each plane takes on average 100 passengers (including our protagonist), and each trip covers on average 1,000 miles. Then each of these flights contributes 100,000 passenger-miles. In a year, the 24 trips contribute 2.4 million passenger-miles. The risk of flying is listed at 0.07 deaths per 1 billion, which we convert to 1 death per 14 billion passenger-miles. On this flight schedule, it will take (14 billion / 2.4 million) = almost 6,000 years to see one death on average.

For the average person on those travel schedules, there is nothing to worry about. 

***

Comparing driving and flying is only valid for those trips in which you have a choice. So a proper comparison requires breaking down the average risks into components (e.g. focusing on shorter trips). 

The above calculation also suggests that the risk is not evenly spread out throughout the population, despite the use of an overall average. A trucker who is on the road every work day is clearly subject to higher risk than an occasional driver who makes a few trips on rental cars each year.

There is a further important point to note about flight risk, due to MIT professor Arnold Barnett. He has long criticized the use of deaths per billion passenger-miles as a risk metric for flights. (In Chapter 5 of Numbers Rule Your World (link), I explain some of Arnie's research on flight risk.) The problem is that almost all fatal crashes involving planes happen soon after take-off or not long before landing. 

 


Visualizing the 80/20 rule, with the bar-density plot

Through Twitter, Danny H. submitted the following chart that shows a tiny 0.3 percent of Youtube creators generate almost 40 percent of all viewing on the platform. He asks for ideas about how to present lop-sided data that follow the "80/20" rule.

Dannyheiman_twitter_youtubedata

In the classic 80/20 rule, 20 percent of the units account for 80 percent of the data. The percentages vary, so long as the first number is small relative to the second. In the Youtube example, 0.3 percent is compared to 40 percent. The underlying reason for such lop-sidedness is the differential importance of the units. The top units are much more important than the bottom units, as measured by their contribution to the data.

I sense a bit of "loss aversion" on this chart (explained here). The designer color-coded the views data into blue, brown and gray but didn't have it in him/her to throw out the sub-categories, which slows down cognition and adds hardly to our understanding.

I like the chart title that explains what it is about.

Turning to the D corner of the Trifecta Checkup for a moment, I suspect that this chart only counts videos that have at least one play. (Zero-play videos do not show up in a play log.) For a site like Youtube, a large proportion of uploaded videos have no views and thus, many creators also have no views.

***

My initial reaction on Twitter is to use a mirrored bar chart, like this:

Jc_redo_youtube_mirrorbar_lobsided

I ended up spending quite a bit of time exploring other concepts. In particular, I like to find an integrated way to present this information. Most charts, such as the mirrored bar chart, a Bumps chart (slopegraph), and Lorenz chart, keep the two series of percentages separate.

Also, the biggest bar (the gray bar showing 97% of all creators) highlights the least important Youtubers while the top creators ("super-creators") are cramped inside a slither of a bar, which is invisible in the original chart.

What I came up with is a bar-density plot, where I use density to encode the importance of creators, and bar lengths to encode the distribution of views.

Jc_redo_youtube_bar_h_2col

Each bar is divided into pieces, with the number of pieces proportional to the number of creators in each segment. This has the happy result that the super-creators are represented by large (red) pieces while the least important creators by little (gray) pieces.

The embedded tessellation shows the structure of the data: the bottom third of the views are generated by a huge number of creators, producing a few views each - resulting in a high density. The top 38% of the views correspond to a small number of super-creators - appropriately shown by a bar of low density.

For those interested in technicalities, I embed a Voronoi diagram inside each bar, with randomly placed points. (There will be a companion post later this week with some more details, and R code.)

Here is what the bar-density plot looks like when the distribution is essentially uniform:

Jc_redo_youtube_bar_even
The density inside each bar is roughly the same, indicating that the creators are roughly equally important.

 

P.S.

1) The next post on the bar-density plot, with some experimental R code, will be available here.

2) Check out my first Linkedin "article" on this topic.

 

 

 

 

 


Is the visual serving the question?

The following chart concerns California's bullet train project.

California_bullettrain

Now, look at the bubble chart at the bottom. Here it is - with all the data except the first number removed:

Highspeedtrains_sufficiency

It is impossible to know how fast the four other train systems run after I removed the numbers. The only way a reader can comprehend this chart is to read the data inside the bubbles. This chart fails the "self-sufficiency test". The self-sufficiency test asks how much work the visual elements on the chart are doing to communicate the data; in this case, the bubbles do nothing at all.

Another problem: this chart buries its lede. The message is in the caption: how California's bullet train rates against other fast train systems. California's train speed of 220 mph is only mentioned in the text but not found in the visual.

Here is a chart that draws attention to the key message:

Redo_highspeedtrains

In a Trifecta checkup, we improved this chart by bringing the visual in sync with the central question of the chart.


McKinsey thinks the data world needs more dataviz talent

Note about last week: While not blogging, I delivered four lectures on three topics over five days: one on the use of data analytics in marketing for a marketing class at Temple; two on the interplay of analytics and data visualization, at Yeshiva and a JMP Webinar; and one on how to live during the Data Revolution at NYU.

This week, I'm back at blogging.

McKinsey publishes a report confirming what most of us already know or experience - the explosion of data jobs that just isn't stopping.

On page 5, it says something that is of interest to readers of this blog: "As data grows more complex, distilling it and bringing it to life through visualization is becoming critical to help make the results of data analyses digestible for decision makers. We estimate that demand for visualization grew roughly 50 percent annually from 2010 to 2015." (my bolding)

The report contains a number of unfortunate graphics. Here's one:

Mckinseyreport_pageiii

I applied my self-sufficiency test by removing the bottom row of data from the chart. Here is what happened to the second circle, representing the fraction of value realized by the U.S. health care industry.

Mckinseyreport_pageiii_inset

What does the visual say? This is one of the questions in the Trifecta Checkup. We see three categories of things that should add up to 100 percent. With a little more effort, we find the two colored categories are each 10% while the white area is 80%. 

But that's not what the data say, because there is only one thing being measured: how much of the potential has already been realized. The two colors is an attempt to visualize the uncertainty of the estimated proportion, which in this case is described as 10 to 20 percent underneath the chart.

If we have to describe what the two colored sections represent: the dark green section is the lower bound of the estimate while the medium green section is the range of uncertainty. The edge between the two sections is the actual estimated proportion (assuming the uncertainty bound is symmetric around the estimate)!

A first attempt to fix this might be to use line segments instead of colored arcs. 

Redo_mckinseyreport_inset_jc_1

The middle diagram emphasizes the mid-point estimate while the right diagram, the range of estimates. Observe how differently these two diagrams appear from the original one shown on the left.

This design only works if the reader perceives the chart as a "racetrack" chart. You have to see the invisible vertical line at the top, which is the starting line, and measure how far around the track has the symbol gone. I have previously discussed why I don't like racetracks (for example, here and here).

***

Here is a sketch of another design:

Redo_mckinseyreport_jc_2

The center figure will have to be moved and changed to a different shape. This design conveys the sense of a goal (at 100%) and how far one is along the path. The uncertainty is represented by wave-like elements that make the exact location of the pointer arrow appear as wavering.

 

 

 

 


Webinar Wednesday

Lyon_onlinestreaming


I'm delivering a quick-fire Webinar this Wednesday on how to make impactful data graphics for communication and persuasion. Registration is free, at this link.

***

In the meantime, I'm preparing a guest lecture for the Data Visualization class at Yeshiva University Sims School of Management. The goal of the lecture is to emphasize the importance of incorporating analytics into the data visualization process.

Here is the lesson plan:

  1. Introduce the Trifecta checkup (link) which is the general framework for effective data visualizations
  2. Provide examples of Type D data visualizations, i.e. graphics that have good production values but fail due to issues with the data or the analysis
  3. Hands-on demo of an end-to-end data visualization process
  4. Lessons from the demo including the iterative nature of analytics and visualization; and sketching
  5. Overview of basic statistics concepts useful to visual designers

 


Plotted performance guaranteed not to predict future performance

On my flight back from Lyon, I picked up a French magazine, and found the following chart:

French interest rates chart small

A quick visit to Bing Translate tells me that this chart illustrates the rates of return of different types of investments. The headline supposedly says "Only the risk pays". In many investment brochures, after presenting some glaringly optimistic projections of future returns, the vendor legally protects itself by proclaiming "Past performance does not guarantee future performance."

For this chart, an appropriate warning is PLOTTED PERFORMANCE GUARANTEED NOT TO PREDICT THE FUTURE!

***

Two unusual decisions set this chart apart:

1. The tree ring imagery, which codes the data in the widths of concentric rings around a common core

2. The placement of larger numbers toward the middle, and smaller numbers in the periphery.

When a reader takes in the visual design of this chart, what is s/he drawn to?

The designer evidently hopes the reader will focus on comparing the widths of the rings (A), while ignoring the areas or the circumferences. I think it is more likely that the reader will see one of the following:

(B) the relative areas of the tree rings

(C) the areas of the full circles bounded by the circumferences

(D) the lengths of the outer rings

(E) the lengths of the inner rings

(F) the lengths of the "middle" rings (defined as the average of the outer and inner rings)

Here is a visualization of six ways to "see" what is on the French rates of return chart:

Redo_jc_frenchinterestrates_1

Recall the Trifecta Checkup (link). This is an example where "What does the visual say" and "What does the data say" may be at variance. In case (A), if the reader is seeing the ring widths, then those two aspects are in sync. In every other case, the two aspects are disconcordant. 

The level of distortion is visualized in the following chart:

Redo_jc_frenchinterestrates_2

Here, I normalized everything to the size of the SCPI data. The true data is presented by the ring width column, represented by the vertical stripes on the left. If the comparisons are not distorted, the other symbols should stay close to the vertical stripes. One notices there is always distortion in cases (B)-(F). This is primarily due to the placement of the large numbers near the center and the small numbers near the edge. In other words, the radius is inversely proportional to the data!

 The amount of distortion for most cases ranges from 2 to 6 times. 

While the "ring area" (B) version is least distorted on average, it is perhaps the worst of the six representations. The level of distortion is not a regular function of the size of the data. The "sicav monetaries" (smallest data) is the least distorted while the data of medium value are the most distorted.

***

To improve this chart, take a hint from the headline. Someone recognizes that there is a tradeoff between risk and return. The data series shown, which is an annualized return, only paints the return part of the relationship. 

 

 

 


Made in France stereotypes

France is on my mind lately, as I prepare to bring my dataviz seminar to Lyon in a couple of weeks.  (You can still register for the free seminar here.)

The following Made in France poster brings out all the stereotypes of the French.

Made_in_france_small

(You can download the original PDF here.)

It's a sankey diagram with so many flows that it screams "it's complicated!" This is an example of a graphic for want of a story. In a Trifecta Checkup, it's failing in the Q(uestion) corner.

It's also failing in the D(ata) corner. Take a look at the top of the chart.

Madeinfrance_totalexports

France exported $572 billion worth of goods. The diagram then plots eight categories of exports, ranging from wines to cheeses:

Madeinfrance_exportcategories

Wine exports totaled $9 billion which is about 1.6% of total exports. That's the largest category of the eight shown on the page. Clearly the vast majority of exports are excluded from the sankey diagram.

Are the 8 the largest categories of exports for France? According to this site, those are (1) machinery (2) aircraft (3) vehicles (4) electrical machinery (5) pharmaceuticals (6) plastics (7) beverages, spirits, vinegar (8) perfumes, cosmetics.

Compare: (1) wines (2) jewellery (3) perfume (4) clothing (5) cheese (6) baked goods (7) chocolate (8) paintings.

It's stereotype central. Name 8 things associated with the French brand and cherry-pick those.

Within each category, the diagram does not show all of the exports either. It discloses that the bars for wines show only $7 of the $9 billion worth of wines exported. This is because the data only capture the "Top 10 Importers." (See below for why the designer did this... France exports wine to more than 180 countries.)

Finally, look at the parade of key importers of French products, as shown at the bottom of the sankey:

Madeinfrance_topimporters

The problem with interpreting this list of countries is best felt by attempting to describe which countries ended up on this list! It's the list of countries that belong to the top 10 importers of one or more of the eight chosen products, ordered by the total value of imports in those 8 categories only but only including the value in any category if it rises to the top 10 of the respective category.

In short, with all those qualifications, the size or rank of the black bars does not convey any useful information.

***

One feature of the chart that surprised me was no flows in the Wine category from France to Italy or Spain. (Based on the above discussion, you should realize that no flows does not mean no exports.) So I went to the Comtrade database that is referenced in the poster, and pulled out all the wine export data.

How does one visualize where French wines are going? After fiddling around the numbers, I came up with the following diagram:

Redo_jc_frenchwineexports

I like this type of block diagram which brings out the structure of the dataset. The key features are:

  • The total wine exports to the rest of the world was $1.4 billion in 2016
  • Half of it went to five European neighbors, the other half to the rest of the world
  • On the left half, Germany took a third of those exports; the UK and Switzerland together is another third; and the final third went to Belgium and the Netherlands
  • On the right half, the countries in the blue zone accounted for three-fifths with the unspecified countries taking two-fifths.
  • As indicated, the two-fifths (in gray) represent 20% of total wine exports, and were spread out among over 180 countries.
  • The three-fifths of the blue zone were split in half, with the first half going to North America (about 2/3 to USA and 1/3 to Canada) and the second half going to Asia (2/3 to China and 1/3 to Japan)
  • As the title indicates, the top 9 importers of French wine covered 80% of the total volume (in litres) while the other 180+ countries took 20% of the volume

 The most time-consuming part of this exercise was finding the appropriate structure which can be easily explained in a visual manner.

 

 


Big Macs in Switzerland are amazing, according to my friend

Bigmac_chNote for those in or near Zurich: I'm giving a Keynote Speech tomorrow morning at the Swiss Statistics Meeting (link). Here is the abstract:

The best and the worst of data visualization share something in common: these graphics provoke emotions. In this talk, I connect the emotional response of readers of data graphics to the design choices made by their creators. Using a plethora of examples, collected over a dozen years of writing online dataviz criticism, I discuss how some design choices generate negative emotions such as confusion and disbelief while other choices elicit positive feelings including pleasure and eureka. Important design choices include how much data to show; which data to highlight, hide or smudge; what research question to address; whether to introduce imagery, or playfulness; and so on. Examples extend from graphics in print, to online interactive graphics, to visual experiences in society.

***

The Big Mac index seems to never want to go away. Here is the latest graphic from the Economist, saying what it says:

Econ_bigmacindex

The index never made much sense to me. I'm in Switzerland, and everything here is expensive. My friend, who is a U.S. transplant, seems to have adopted McDonald's as his main eating-out venue. Online reviews indicate that the quality of the burger served in Switzerland is much better than the same thing in the States. So, part of the price differential can be explained by quality. The index also confounds several other issues, such as local inflation and exchange rate

Now, on to the data visualization, which is primarily an exercise in rolling one's eyeballs. In order to understand the red and blue line segments, our eyes have to hop over the price bubbles to the top of the page. Then, in order to understand the vertical axis labels, unconventionally placed on the right side, our eyes have to zoom over to the left of the page, and search for the line below the header of the graph. Next, if we want to know about a particular country, our eyes must turn sideways and scan from bottom up.

Here is a different take on the same data:

Redo_jc_econbigmac2018

I transformed the data as I don't find it compelling to learn that Russian Big Macs are 60% less than American Big Macs. Instead, on my chart, the reader learns that the price paid for a U.S. Big Mac will buy him/her almost 2 and a half Big Macs in Russia.

The arrows pointing left indicate that in most countries, the values of their currencies are declining relative to the dollar from 2017 to 2018 (at least by the Big Mac Index point of view). The only exception is Turkey, where in 2018, one can buy more Big Macs equivalent to the price paid for one U.S. Big Mac. compared to 2017.

The decimal differences are immaterial so I have grouped the countries by half Big Macs.

This example demonstrates yet again, to make good data visualization, one has to describe an interesting question, make appropriate transformations of the data, and then choose the right visual form. I describe this framework as the Trifecta - a guide to it is here.

(P.S. I noticed that Bitly just decided unilaterally to deactivate my customized Bitly link that was configured years and years ago, when it switched design (?). So I had to re-create the custom link. I have never grasped  why "unreliability" is a feature of the offering by most Tech companies.)


Education deserts: places without schools still serve pies and story time

I very much enjoyed reading The Chronicle's article on "education deserts" in the U.S., defined as places where there are no public colleges within reach of potential students.

In particular, the data visualization deployed to illustrate the story is superb. For example, this map shows 1,500 colleges and their "catchment areas" defined as places within 60 minutes' drive.

Screenshot-2018-8-22 Who Lives in Education Deserts More People Than You Might Think 2

It does a great job walking through the logic of the analysis (even if the logic may not totally convince - more below). The areas not within reach of these 1,500 colleges are labeled "deserts". They then take Census data and look at the adult population in those deserts:

Screenshot-2018-8-22 Who Lives in Education Deserts More People Than You Might Think 4

This leads to an analysis of the racial composition of the people living in these "deserts". We now arrive at the only chart in the sequence that disappoints. It is a pair of pie charts:

Chronicle_edudesserts_pie

 The color scheme makes it hard to pair up the pie slices. The focus of the chart should be on the over or under representation of races in education deserts relative to the U.S. average. The challenge of this dataset is the coexistence of one large number, and many small numbers.

Here is one solution:

Redo_jc_chronedudesserts

***

The Chronicle made a commendable effort to describe this social issue. But the analysis has a lot of built-in assumptions. Readers should look at the following list and see if you agree with the assumptions:

  • Only public colleges are considered. This restriction requires the assumption that the private colleges pretty much serve the same areas as public colleges.
  • Only non-competitive colleges are included. Precisely, the acceptance rate must be higher than 30 percent. The underlying assumption is that the "local students" won't be interested in selective colleges. It's not clear how the 30 percent threshold was decided.
  • Colleges that are more than 60 minutes' driving distance away are considered unreachable. So the assumption is that "local students" are unwilling to drive more than 60 minutes to attend college. This raises a couple other questions: are we only looking at commuter colleges with no dormitories? Is the 60 minutes driving distance based on actual roads and traffic speeds, or some kind of simple model with stylized geometries and fixed speeds?
  • The demographic analysis is based on all adults living in the Census "blocks" that are not within 60 minutes' drive of one of those colleges. But if we are calling them "education deserts" focusing on the availability of colleges, why consider all adults, and not just adults in the college age group? One further hidden assumption here is that the lack of colleges in those regions has not caused young generations to move to areas closer to colleges. I think a map of the age distribution in the "education deserts" will be quite telling.
  • Not surprisingly, the areas classified as "education deserts" lag the rest of the nation on several key socio-economic metrics, like median income, and proportion living under the poverty line. This means those same areas could be labeled income deserts, or job deserts.

At the end of the piece, the author creates a "story time" moment. Story time is when you are served a bunch of data or analyses, and then when you are about to doze off, the analyst calls story time, and starts making conclusions that stray from the data just served!

Story time starts with the following sentence: "What would it take to make sure that distance doesn’t prevent students from obtaining a college degree? "

The analysis provided has nowhere shown that distance has prevented students from obtaining a college degree. We haven't seen anything that says that people living in the "education deserts" have fewer college degrees. We don't know that distance is the reason why people in those areas don't go to college (if true) - what about poverty? We don't know if 60 minutes is the hurdle that causes people not to go to college (if true).We know the number of adults living in those neighborhoods but not the number of potential students.

The data only showed two things: 1) which areas of the country are not within 60 minutes' driving of the subset of public colleges under consideration, 2) the number of adults living in those Census blocks.

***

So we have a case where the analysis is incomplete but the visualization of the analysis is superb. So in our Trifecta analysis, this chart poses a nice question and has nice graphics but the use of data can be improved. (Type QV)