One piece of advice I give for those wanting to get into data visualization is to trash the defaults (see the last part of this interview with me). Jon Schwabish, an economist with the government, gives a detailed example of how this is done in a guest blog on the Why Axis.
Here are the highlights of his piece.
He starts with a basic chart, published by the Bureau of Labor Statistics. You can see the hallmarks of the Excel chart using the Excel defaults. The blue, red, green color scheme is most telling.
Just by making small changes, like using tints as opposed to different colors, using columns instead of bars, reordering the industry categories, and placing the legend text next to the columns, Schwabish made the chart more visually appealing and more effective.
The final version uses lines instead of columns, which will outrage some readers. It is usually true that a grouped bar chart should be replaced by overlaid line charts, and this should not be limited to so-called discrete data.
Schwabish included several bells and whistles. The three data points are not evenly spaced in time. The year-on-year difference is separately plotted as a bar chart on the same canvass. I'd consider using a line chart here as well... and lose the vertical axis since all the data are printed on the chart (or else, lose the data labels).
This version is considerably cleaner than the original.
I noticed that the first person to comment on the Why Axis post said that internal BLS readers resist more innovative charts, claiming "they don't understand it". This is always a consideration when departing from standard chart types.
Another reader likes the "alphabetical order" (so to speak) of the industries. He raises another key consideration: who is your audience? If the chart is only intended for specialist readers who expect to find certain things in certain places, then the designer's freedom is curtailed. If the chart is used as a data store, then the designer might as well recuse him/herself.
Rick (via Twitter) tells me he is baffled by this chart that showed up in Financial Review:
I'm baffled as well. What might the designer have in mind?
Based on the cues such as length of the curves, one would expect the US, Singapore, Japan, etc. to be leaders and India and China to be laggards. But what is being plotted on the vertical axis? It's not explained.
The title of the chart seems to indicate there is a time dimension but it's not on the horizontal axis where you'd expect it. The vertical axis does not appear to be time either, as it runs negative. The length of the lines could encode time but it is counterintuitive since China's line should then be much longer than that of the U.S., given its history.
Finally, how does one explain the placement of the callout box, noting China's GDP per capita. It literally points to nowhere.
Notice that I have indiced every metric against the league average. This is shown in the first panel. I use a red dot to warn readers that the direction of this metric is opposite to the others (left of center is a good thing!)
You can immediately make a bunch of observations:
Alex Smith was quite poor, except for interceptions.
Colin Kaepernick had similar passing statistics as Smith. His only advantage over Smith was the rushing.
Joe Flacco, as we noted before, is as average as it goes (except for rushing yards).
Tyrrod Taylor is here to remind us that we have to be careful about backup players being included in the same analysis.
The second version is a heatmap.
This takes inspiration from the fact that any serious reader of the spider chart will be reading the eight spokes (dimensions) separately. Why not plot these neatly in columns and use color to help us find the best and worst?
Imagine this to be a large table with as many rows as there are quarterbacks. You will able to locate the red (hot) zones quickly. You can also scan across a row to understand that player's performance relative to the average, on every metric.
I like this visualization best, primarily because it scales beautifully.
The final version is a profile chart, or sometimes called a parallel coordinates plot. While I am an advocate of profile charts, they really only work when you have a small number of things to compare.
I like many aspects of this exercise. This chart displays the results of an experiment conducted by a computer games company to show that the new build ("249") renders frames faster than the older build ("248"). The messages of the chart are clear: the 249 build (blue bars) is substantially faster, over 80% of the frames render in 7 miliseconds or fewer under 249 compared to less than 40% under 248, and less obviously, the variance of frame times is also significantly smaller.
The slight problem is that readers probably have to read the text to grasp most of the above.
In the text, the author explains how to turn time per frame into frame per second, the more common way of measuring rendering speed. The formula is 1000 divided by time per frame. Wouldn't it be better if the chart plots fps directly?
When it comes to presenting distributions (or variability), the cumulative chart is more useful but it also is harder for readers to comprehend. For example:
The beauty of this chart is that one can take any point on the vertical axis, say, 80% level and read off the comparative values of 7 millisecond for the blue line (249) and 10.5 ms for the red (248). That means 80% of the 249 frames were rendered in fewer than 7 ms, relative to 10.5 ms for 248 frames.
Alternatively, taking a point on the horizontal axis, say 5 milliseconds, one can see that about 8% of 248 frames would reach that threshold but 30% of 249 frames did.
The steeper the ascent of the S-curve, the more efficient is the rendering.
Dona Wong asked me to comment on a project by the New York Fed visualizing funding and expenditure at NY and NJ schools. The link to the charts is here. You have to click through to see the animation.
Here are my comments:
I like the "Takeaways" section up front, which uses words to tell readers what to look for in the charts to follow.
I like the stutter steps that are inserted into the animation. This gives me time to process the data. The point of these dynamic maps is to showcase the changes in the data over time.
I really, really want to click on the green boxes (the legend) and have the corresponding school districts highlighted. In other words, turning the legend into something functional. Tool developers, please take notes!
The other options on the map are federal, state and local shares of funding, given in proportions. These are controlled by the three buttons above. This is a design decision that privileges showing how federal funds are distributed across districts and across time. The tradeoff is that it's harder to comprehend the mix of sources of funds within each district over time.
I usually like to flip back and forth between actual values and relative values. I find that both perspectives provide information. Here, I'd like to see dollars and proportions.
I also find the line charts to be much clearer but the maps are more engaging. Here is an example of the line chart: (the blue dashed line is the New York state average)
After looking at these charts, I also want to see a bivariate analysis. How is funding per student and expenditure per student related?
Note: The winner of the Book Quiz Round 2 was announced on my book blog. Congratulations to the winners. You can get your own copy of Numbersensehere.
A common advice for anyone living in the U.S. is "read the fine print." If you receive a notice or see an ad, and there is an asterisk or some copy in almost invisible font located at the bottom of the page, you better pull out your magnifying glass.
If you are a data analyst, you better have a magnifying glass in your pocket at all times. One of the recurring themes in Numbersense is that details matter... a lot. This is particularly relevant to Chapters 6 and 7 on economic data.
Last week, on the first Friday of the month, the jobs report came out. For the best reporting on the data itself, with succinct commentary but no hand-waving, I go to Calculated Risk blog.
One of the charts highlighted (in this post) is the unemployment rate by educational attainment. This is the chart that leads to horribly misleading statements saying that the solution to the unemployment crisis is more education. I ranted about this before--see here and here.
Taking this chart at face value, you'd say that the unemployment rate is lower, the more education one has. One can also say that the unemployment rate is less volatile, the more education one has.
Bill makes two succinct comments, basically letting his readers know this chart is next to worthless.
1. Although education matters for the unemployment rate, it doesn't appear
to matter as far as finding new employment - and the unemployment rate
is moving sideways for those with a college degree!
The issue behind this is the "cohort effect". The chart above aggregates everyone from 25 years old and over. This means it treats equally people who just graduated from college last year and people who got their degrees thirty years ago. Why does this matter? A jobs recession hits certain types of people harder than others, and one important determinant is work experience (another would be the industry one works in.) The low unemployment rate for all college graduates masks the challenging job market for recent college graduates. The misinterpretation of this chart leads to wrongheaded policies such as make more college gradutes.
2. This says nothing about the quality of jobs - as an example, a college
graduate working at minimum wage would be considered "employed".
This is where the magnifying glass is critical. You should not assume that your idea of "employed" is the same as the official definition of "employed". Bill raised the issue of minimum wage. Elsewhere, other commentators noted the issue of "part-timers". Part-time employment is not distinguished from full-time employment in the official aggregate statistics.
Taking this further, isn't it plausible that unemployment "trickles down"? As the college graduates grab whatever job they can find, including the minimum-wage ones, they push the high-school graduates out of jobs.
In data, there is often no fine print to be found. In Big Data, this problem is aggravated by a thousand times. Unfortunately, magnifying blank is still blank. So, having the magnifying glass is not enough.
The solution then is to create your own fine print. Spend inordinate amounts of time understanding how data is collected. Dig deeply into how data is defined.
No, this work is not sexy. (PS. If you can't stand it, you really shouldn't be in data science.)
In Chapter 6 of Numbersense, I did this work for you as it relates to jobs data. What I show there is that there is no "right" way to measure employment--it's not as clearcut as you'd like to think. If you were to put forth your definition of "employed" for comment, your definition will absolutely get criticized, just the same way you're criticizing the government's definition.
PS. Larry at Good Stats, Bad Stats pulled out his magnifying glass and wrote a series of posts about education, employment and income. He mildly disagrees with me.
A reader sent in this amusement. See if you can figure out the chart:
The article is here. It then goes into a lot of numbers about 200 accidents. I didn't pay much attention after that first paragraph, where it said 16% of the accidents were in one year, with 84% in the next year. That implied a growth of more than five times from one year to the next. Seems to me an issue with data collection. The author then goes on to aggregate the two years, and reports dozens of findings.
Anyway, what is the point of the ribbon chart?
Reminder: Contest to win a book is open till Friday. Enter through here.
Abhinav asks me to check out his blog post on a chart on global warming (I prefer the term climate change) featured on Wonkblog. The chart is sourced to a report by the World Metereological Association (link to PDF).
Hello, start the axis at zero whenever you are using plotting columns. That's as fundamental as only plot proportions on a pie chart.
There is a reason why the designer didn't like to start the axis at zero. It is this (Abhinav helpfully made all these charts):
The trouble is that for this data set (on global average temperature), the area below 13 is completely useless. It's like plotting body temperature on a scale of 0 - 100 Celsius when all feasible values fall into a tight range, maybe 35-38 Celsius. I recount a similar situation that led to a college president saying something stupid in Chapter 1 of my new book, Numbersense. (Information on the book is here.)
So we understand the desire to get rid of the irrelevant white space. This is accomplished by using a line chart. (I'd prefer to omit the data values, and rely on the axis.)
Abhinav then created various versions of this by compressing and expanding the vertical scales. I don't think there is anything wrong with the above scale. As I mentioned, the scale should focus on the range of values that are feasible.
One of the most important steps in analyzing data is to remove noise. First, we have to identify where the noise is, then we find ways to reduce the noise, which has the effect of surfacing the signal.
The labor force participation rate data, discussed here and here, can be decomposed into two components, known as the trend and residuals. (See right.) The residuals are the raw data minus the trend; in other words, they are the data after removing the trend.
If the purpose of the analysis is to describe the evolution of the labor force participation rate over time, then the trend is the signal we're after.
Our purpose is the opposite. I want to remove the trend in order to surface correlations that are unrelated to time evolution. Thus, the residuals are where the signal is.
Another way to think about the residuals (bottom chart) is that positive values imply the actual data was above trend while negative values imply the actual data was below trend.
After decomposing the miles-driven data in the same way, I obtain two sets of residuals. These were plotted in the last post in a scatter plot.
The lack of correlation is also obvious in the plot below. You can see that the periods when one series of residuals went above trend was not well correlated with the other series being above trend (or below trend).