« February 2016 | Main | April 2016 »

First ask the right question: the data scientist edition

A reader didn't like this graphic in the Wall Street Journal:

Wsj_datascientist_timeofday

One could turn every panel into a bar chart but unfortunately, the situation does not improve much. Some charts just can't be fixed by altering the visual design.

The chart is frustrating to read: typically, colors are used to signify objects that should be compared. Focus on the brown wedges for a moment: Basic EDA 46%, Data cleaning 31%, Machine learning 27%, etc. Those are proportions of respondents who said they spent 1 to 3 hours a day on the respective tasks. That is one weird way of describing time use. The people who spent 1 to 3 hours a day on EDA do not necessarily overlap with those who spent 1 to 3 hours a day on data cleaning. In addition, there is no summation formula that lets us know how any individual, or the average data scientist, spends his or her time during a typical day.

***

But none of this is the graphics designer's fault.

The trouble with the chart is in the D corner of the Trifecta checkup. The survey question was poorly posed. The data came from a study by O'Reilly Media. They asked questions of this form:

How much time did you spend on basic exploratory data analysis on average?

A. Less than 1 hour a week
B. 1 to 4 hours a week
C. 1 to 3 hours a day
D. 4 or more hours a day

It is not obvious that those four levels are mutually exhaustive. In fact, they aren't. One hour a day for five working days is a total of 5 hours a week. Those who spent between 4 and 5 hours a week have nowhere to go.

Further, if one had access to individual responses, it's likely that many respondents either worked too many hours or too few hours.

The panels are separate questions which bear no relationship to each other, even though the tasks are clearly related by the fact that there are only so many working hours in a day.

To fix this chart, one must first fix the data. To fix the data, one must ask the right questions.

 

 


Delegate maps need a color treatment

This year's U.S. primary elections have been very entertaining. Delegate maps are a handy way to keep track of the horse race. They provide data to support (or refute) the narratives created by reporters who use words like "landslide", "commanding", etc.

Here’s a delegate map used by the New York Times on the night of Mar 15th when Hillary Clinton won four out of five states, with the fifth (Missouri) being a cliffhanger:

Nyt_delegatemap_2

Other media outlets are using pretty much the same form, with different color schemes.

The typical color scheme has two binary levels: one color for each candidate (NYT uses blue for Clinton, green for Sanders in the Democratic race); a lighter shade for who's leading, and a darker shade for the declared winner.

***

These maps are missing one crucial piece of information, the margin of victory. The margin is important because in most of the contests, the delegates are split proportionally.

The same shade of blue was used to describe the decisive victory in Florida (64% to 33%) and the laser-thin victory in Illinois (51% to 49%). This color scheme implies a winner-takes-all criterion.

Nyt_delegatemap_1

Here is a map that includes the margin of victory, computed as the excess number of delegates won in the given state:

Redo_delegatemap_2

For the Democratic race, the narrative is that Clinton built a sizeable lead in pledged delegates in the Southern states; elsewhere, the states have been evenly split or slightly favoring Sanders (within about 10 delegates). Also, the West and Northwest have largely not spoken yet.

Other maps can be created using different measures of the margin, such as the diference in vote proportions.

I prefer color schemes that reveal the delegate allocation criterion.


Which way to die, the Bard asked #onelesspie

Happy Pi Day! In honor of Xan Gregg, I take aim at another pie chart today.

This monstrosity was found on Vox (link):

Vox_shakespeare_death_chart

 

The data pose a major challenge here: almost all the numbers are equal to one. This could potentially be fixed by aggregation. Or one can pick a more appropriate chart form, like a text cloud:

Redo_jc_shakespeare_deaths

One can grumble about the imprecision of the text cloud, especially when phrases are involved. But which one serves the message more clearly?

 


The state of the art of interactive graphics

Scott Klein's team at Propublica published a worthy news application, called "Hell and High Water" (link) I took some time taking in the experience. It's a project that needs room to breathe.

The setting is Houston Texas, and the subject is what happens when the next big hurricane hits the region. The reference point was Hurricane Ike and Galveston in 2008.

This image shows the depth of flooding at the height of the disaster in 2008.

Propublica_galveston1

The app takes readers through multiple scenarios. This next image depicts what would happen (according to simulations) if something similar to Ike plus 15 percent stronger winds hits Galveston.

Propublica_galveston2plus

One can also speculate about what might happen if the so-called "Mid Bay" solution is implemented:

Propublica_midbay_sol

This solution is estimated to cost about $3 billion.

***

I am drawn to this project because the designers liberally use some things I praised in my summer talk at the Data Meets Viz conference in Germany.

Here is an example of hover-overs used to annotate text. (My mouse is on the words "Nassau Bay" at the bottom of the paragraph. Much of the Bay would be submerged at the height of this scenario.)

Propublica_nassaubay2

The design has a keen awareness of foreground/background issues. The map uses sparse static labels, indicating the most important landmarks. All other labels are hidden unless the reader hovers over specific words in the text.

I think plotting population density would have been more impactful. With the current set of labels, the perspective is focused on business and institutional impact. I think there is a missed opportunity to highlight the human impact. This can be achieved by coding population density into the map colors. I believe the colors on the map currently represent terrain.

***

This is a successful interactive project. The technical feats are impressive (read more about them here). A lot of research went into the articles; huge amounts of details are included in the maps. A narrative flow was carefully constructed, and the linkage between the text and the graphics is among the best I've seen.