Delegate maps need a color treatment
Football managers on the hot seat

First ask the right question: the data scientist edition

A reader didn't like this graphic in the Wall Street Journal:


One could turn every panel into a bar chart but unfortunately, the situation does not improve much. Some charts just can't be fixed by altering the visual design.

The chart is frustrating to read: typically, colors are used to signify objects that should be compared. Focus on the brown wedges for a moment: Basic EDA 46%, Data cleaning 31%, Machine learning 27%, etc. Those are proportions of respondents who said they spent 1 to 3 hours a day on the respective tasks. That is one weird way of describing time use. The people who spent 1 to 3 hours a day on EDA do not necessarily overlap with those who spent 1 to 3 hours a day on data cleaning. In addition, there is no summation formula that lets us know how any individual, or the average data scientist, spends his or her time during a typical day.


But none of this is the graphics designer's fault.

The trouble with the chart is in the D corner of the Trifecta checkup. The survey question was poorly posed. The data came from a study by O'Reilly Media. They asked questions of this form:

How much time did you spend on basic exploratory data analysis on average?

A. Less than 1 hour a week
B. 1 to 4 hours a week
C. 1 to 3 hours a day
D. 4 or more hours a day

It is not obvious that those four levels are mutually exhaustive. In fact, they aren't. One hour a day for five working days is a total of 5 hours a week. Those who spent between 4 and 5 hours a week have nowhere to go.

Further, if one had access to individual responses, it's likely that many respondents either worked too many hours or too few hours.

The panels are separate questions which bear no relationship to each other, even though the tasks are clearly related by the fact that there are only so many working hours in a day.

To fix this chart, one must first fix the data. To fix the data, one must ask the right questions.




Feed You can follow this conversation by subscribing to the comment feed for this post.


It is also generally regarded that peoples recall of how much time they spent on various activities is fairly poor. Or probably most things for that matter. I have worked with nutritional data, and some people changed their average energy intake by a factor of 3 between surveys. Either they had started Olympic level training (and these were elderly) or they were not filling in the diary daily and were failing to recall what they ate.

daniel l

I love it when these new dangled "data scientists" generate atrocious shit like this.

I would say on greenfield projects, I'm like 50% ETL, 20% validation/sniff tests, 15% creating a presentation, 10% doing a presentation, and 5% smoking and muttering to myself after being told go to back to the drawing board....

The comments to this entry are closed.