First ask the right question: the data scientist edition

A reader didn't like this graphic in the Wall Street Journal:

Wsj_datascientist_timeofday

One could turn every panel into a bar chart but unfortunately, the situation does not improve much. Some charts just can't be fixed by altering the visual design.

The chart is frustrating to read: typically, colors are used to signify objects that should be compared. Focus on the brown wedges for a moment: Basic EDA 46%, Data cleaning 31%, Machine learning 27%, etc. Those are proportions of respondents who said they spent 1 to 3 hours a day on the respective tasks. That is one weird way of describing time use. The people who spent 1 to 3 hours a day on EDA do not necessarily overlap with those who spent 1 to 3 hours a day on data cleaning. In addition, there is no summation formula that lets us know how any individual, or the average data scientist, spends his or her time during a typical day.

***

But none of this is the graphics designer's fault.

The trouble with the chart is in the D corner of the Trifecta checkup. The survey question was poorly posed. The data came from a study by O'Reilly Media. They asked questions of this form:

How much time did you spend on basic exploratory data analysis on average?

A. Less than 1 hour a week
B. 1 to 4 hours a week
C. 1 to 3 hours a day
D. 4 or more hours a day

It is not obvious that those four levels are mutually exhaustive. In fact, they aren't. One hour a day for five working days is a total of 5 hours a week. Those who spent between 4 and 5 hours a week have nowhere to go.

Further, if one had access to individual responses, it's likely that many respondents either worked too many hours or too few hours.

The panels are separate questions which bear no relationship to each other, even though the tasks are clearly related by the fact that there are only so many working hours in a day.

To fix this chart, one must first fix the data. To fix the data, one must ask the right questions.

 

 


A quick lesson in handling more than one messages on one chart

Between teaching two classes, and a seminar, and logging two coast-to-coast flights, I was able to find time to rethink the following chart from the Wall Street Journal: (link to article)

Uk_drinks

I like the right side of this chart, which helps readers interpret what the alcohol consumption guidelines really mean. When we go out and drink, we order beers, or wine, or drinks - we don't think in terms of grams of alcohol.

The left side is a bit clumsy. The biggest message is that the UK has tightened its guidelines. This message is delivered by having U.K. appear twice in the chart, the only country to repeat. In order to make this clear, the designer highlights the U.K. rows. But the style of highlighting used for the two rows differs, because the current U.K. row has to point to the right side, but not the previous U.K. row. This creates a bit of confusion.

In addition, since the U.K. rows are far apart, figuring out how much the guidelines have changed is more work than desired.

The placement of the bars by gender also doesn't help. A side message is that most countries allow men to drink more than women but the U.K., in revising its guidelines, has followed Netherlands and Guyana in having the same level for both genders.

***

After trying a few ideas, I think the scatter plot works out pretty well. One advantage is that it does not arbitrarily order the data men first, women second as in the original chart. Another advantage is that it shows the male-female balance more clearly.

Redo_ukalcohol_2

An afterthought: I should have added the words "Stricter", "Laxer" on the two corners of the chart. This chart shows both the U.K. getting stricter but also that it joins Guyana and Netherlands as countries which treat men and women equally when it comes to drinking.

 

 


Efficiency in space usage leads to efficiency in comprehension

Consider the following two charts that illustrate the same data. (I deliberately took out the header text to make a point. The original chart came from the Wall Street Journal.)

Redo_luxurystoresbycountry

To me, the line chart gets to the point more quickly: that Burberry stores are more numerous in those places shown on the left and fewer in those places shown on the right, relative to comparable luxury brands (Prada and Louis Vuitton).

The reason why the tiled bar chart is tougher to decipher is its inefficient use of space. Within each country group,  the three places are plotted on two levels, one on the upper level, and two on the lower level. Then the two groups of countries are placed top and bottom. Readers have to first size up the individual group of three countries, then make a comparison between the two groups.

***

From a Trifecta checkup perspective, the bigger issue here is the data. The full story seems to be that those two country groups have different currency experiences... Japan and the continental European countries have weakening currencies, which tends to make their goods cheaper for Chinese consumers. This crucial part of the story is not anywhere on the chart.

In addition, the number of stores is not a telling statistic, because stores may have different areas, and certainly the revenues generated by these stores differ, potentially by country. A measure such as change in same-store sales in each country is more informative.

It is also not true that the distribution of stores is purely a matter of business strategy, as Burberry is a British brand, Prada is Italian and Louis Vuitton is French. They each have more stores in their home countries, which seems very logical.


More chart drama, and data aggregation

Robert Kosara posted a response to my previous post.

He raises an important issue in data visualization - the need to aggregate data, and not plot raw data. I have no objection to that point.

What was shown in my original post are two extremes. The bubble chart is high drama at the expense of data integrity. Readers cannot learn any of the following from that chart:

  • the shape of the growth and subsequent decline of the flu epidemic
  • the beginning and ending date of the epidemic
  • the peak of the epidemic*

* The peak can be inferred from the data label, although there appears to be at least one other circle of approximately equal size, which isn't labeled.

The column chart is low drama but high data integrity. To retain some dramatic element, I encoded the data redundantly in the color scale. I also emulated the original chart in labeling specific spikes.

The designer then simply has to choose a position along these two extremes. This will involve some smoothing or aggregation of the data. Robert showed a column chart that has weekly aggregates, and in his view, his version is closer to the bubble chart.

Robert's version indeed strikes a balance between drama and data integrity, and I am in favor of it. Here is the idea (I am responsible for the added color).

Kosara_avianflu2

***

Where I depart from Robert is how one reads a column chart such as the one I posted:

Redo_avianflu2

Robert thinks that readers will perceive each individual line separately, and in so doing, "details hide the story". When I look at a chart like this, I am drawn to the envelope of the columns. The lighter colors are chosen for the smaller spikes to push them into the background. What might be the problem are those data labels identifying specific spikes; they are a holdover from the original chart--I actually don't know why those specific dates are labeled.

***

In summary, the key takeaway is, as Robert puts it:

the point of this [dataset] is really not about individual days, it’s about the grand totals and the speed with which the outbreak happened.

We both agree that the weekly version is the best among these. I don't see how the reader can figure out grand totals and speed with which the outbreak happened by staring at those dramatic but overlapping bubbles.


Is it worth the drama?

Quite the eye-catching chart this:

Wsj_avianflu

The original accompanied this article in the Wall Street Journal about avian flu outbreaks in the U.S.

The point of the chart appears to be the peak in the flu season around May. The overlapping bubbles were probably used for drama.

A column chart, with appropriate colors, attains much of the drama but retains the ability to read the data.

Redo_avianflu2

 


Raw data and the incurious

The following chart caught my eye when it appeared in the Wall Street Journal this month:

Wsj_fedratehike

This is a laborious design; much sweat has been poured into it. It's a chart that requires the reader to spend time learning to read.

A major difficulty for any visualization of this dataset is keeping track of the two time scales. One scale, depicted horizontally, traces the dates of Fed meetings. These meetings seem to occur four times a year except in 2012. The other time scale is encoded in the colors, explained above the chart. This is the outlook by each Fed committee member of when he/she expects a rate hike to occur.

I find it challenging to understand the time scale in discrete colors. Given that time has an order, my expectation is that the colors should be ordered. Adding to this mess is the correlation between the two time scales. As time treads on, certain predictions become infeasible.

Part of the problem is the unexplained vertical scale. Eventually, I realize each cell is a committee member, and there are 19 members, although two or three routinely fail to submit their outlook in any given meeting.

Contrary to expectation, I don't think one can read across a row to see how a particular member changes his/her view over time. This is because the patches of color would be less together otherwise.

***

After this struggle, all I wanted is some learning from this dataset. Here is what I came up with:

Redo_wsjfedratehike

There is actually little of interest in the data. The most salient point is that a shift in view occurred back in September 2012 when enough members pushed back the year of rate hike that the median view moved from 2014 to 2015. Thereafter, there is a decidedly muted climb in support for the 2015 view.

***

This is an example in which plotting elemental data backfires. Raw data is the sanctuary of the incurious.

 

 


Mosquito, shoebox, and an ingenious apartment design

First, I saw Alberto tweet his design for the Wall Street Journal (below is the English version):

Wsj_englishversion

The yellow space is the size of the smallest "livable" apartment in Hong Kong, known as the "mosquito" apartment. Livability is defined by the real estate developers.

If you've lived in a tropical area like Hong Kong, you'll understand the obsession with mosquitoes. The itching for days! The sneaky little things that suck your blood!

In Manhattan, it seems like we prefer saying the shoebox apartment. By comparison, it's not that scary. It's larger in size too.

The graphic is fantastic as it offers various comparisons of everyday spaces, like a NYC parking space and a basketball court, for which many Americans have some sense of their proportion.

***

This chart leads me down an unexpected path. I found a set of very powerful photos, commissioned by a humanitarian association in Hong Kong. Overwhelming. Here's one:

Hongkongabove-2

Yes, that is the entire living space for this family. All of forty square feet.

This article describes the project, as well as links to a number of other equally astounding photos.

These photos are unfair competition for any graphic designer.

***

Finally, I came across an inspiring, ingenious design. Gary Chang, who is an architect in Hong Kong, created his own apartment (344 square feet, almost 10 times larger than that in the photo, and twice as large as the mosquito apartment) in this amazing, space-saving design.

Through a series of movable walls, and beds, his apartment can be configured in 24 different ways. This is a small multiples layout!

Gary-chang-apartment-domestic-transformer-hong-kond-designboom07

Here is an article about his achievement, together with a video tour of his home. Not to be missed. It defines making something out of nothing.

Here is a little graphic describing certain transformations:

Gary-chang-flat-4_1FqTy_48

Here is a different video on Vimeo. And another.


But or because more information

Wall Street Journal uses this paired bar chart to show the favorable/unfavorable ratings of potential GOP candidates for the 2016 presidential elections. (link to original)

Wsj_gop_candidates1

This chart form is fine. From this chart, we can easily see which candidates have the strongest favorable ratings. This is precisely how the candidates were sorted (green bars).

But this chart form has one weakness. It's trying to compress three dimensions into one. The dimension of distractors is harder to understand. The gray bars are not sorted, implying that the unfavorable ratings are not well correlated with favorable ratings. There is also a third category (unknowns) that is lurking.

scatter plot would bring out the correlation between favorable and unfavorable more clearly. In the following version, I coded the unknowns in a green color. The lighter the color, the more unknowns.

Redo_wsj_gopcandidates2

Most candidates have somewhat more supporters than distractors detractors. Trump and Christie are clearly in trouble, with more distractors than supporters, and few unknowns (dark green). Fiorina, who just entered the race, is also weak though she could recover by winning over the substantial number of unknowns.

The scatter plot takes more effort to understand but, or because, it conveys more information.


Nice chart from the neck down

I was drawn to this Wall Street Journal chart because of the blue columns.

Wsj_jobgrowth2015sm2

The blue color solves a common problem in time-series plots when the time axis is incomplete. The first quarter of 2015 is dangling. The article is about first-quarter economic performance and so it is appropriate to focus attention on the Q1 columns.

The rest of the chart is filled with Tufte goodies: the clean axes and labels and so on.

The online edition shows a slightly different chart: ("Slow Job Growth Tests Economy", April 4, 2015):

Wsj_jobgrowthbymonth_online

This one singles out the last column for attention. Readers are invited to compare the most recent month with any of the other months displayed on the chart. By contrast, in the printed version, readers are guided to compare first-quarters across years. The choice of colors leads readers in different directions.

Another difference between the two charts is portrayal of missed expectation in the online version. (I am ignoring the vertical line on top of the T, which is just confusing and unnecessary.) This seems to be the main story in the chart. If so, I'd like to see the forecast data displayed for several other months. By doing so, they drive home the message that this most recent month is uniquely bad.

The footnote is actually very important (I'd place it in a more visible spot). It is because of seasonal adjustment that readers can compare the heights of any column to any column on these two charts. If the data were not adjusted, then it would be difficult to separate the trend from seasonality. For a refresher on seasonal adjustment, see my post here, and the chapters on economic statistics in Numbersense (link).

***

The printed edition above is a great chart from the neck down. The headline of the chart lets it down. "The U.S. economy has seen several disappointing first quarters since the recovery began."

The blue columns do not stand out as particularly bad, especially if one considers that there should be a margin of error around each number in the chart.

The column chart is purely about "nonfarm payrolls" which is only one aspect of the U.S. economy. The other data series, GDP, tucked below the column chart, show a set of positive annual numbers which do not fit the headline either.


Planned redundancy

The following Wall Street Journal caught my eye the other day: (Link to article)

Wsj_cutcable2015

Looking closely, I realize that the four charts are identical, except for the call-outs. This is a kind of small-multiples in which the same data reside in each panel but the labeling changes. It's planned redundancy but I'm afraid I don't see the point.

The chart compares four different ways to save money by cutting cable. Here is an alternative that places the focus on the number of dollars saved:

Redo_wsj_cutcable2015