One piece of advice I give for those wanting to get into data visualization is to trash the defaults (see the last part of this interview with me). Jon Schwabish, an economist with the government, gives a detailed example of how this is done in a guest blog on the Why Axis.
Here are the highlights of his piece.
He starts with a basic chart, published by the Bureau of Labor Statistics. You can see the hallmarks of the Excel chart using the Excel defaults. The blue, red, green color scheme is most telling.
Just by making small changes, like using tints as opposed to different colors, using columns instead of bars, reordering the industry categories, and placing the legend text next to the columns, Schwabish made the chart more visually appealing and more effective.
The final version uses lines instead of columns, which will outrage some readers. It is usually true that a grouped bar chart should be replaced by overlaid line charts, and this should not be limited to so-called discrete data.
Schwabish included several bells and whistles. The three data points are not evenly spaced in time. The year-on-year difference is separately plotted as a bar chart on the same canvass. I'd consider using a line chart here as well... and lose the vertical axis since all the data are printed on the chart (or else, lose the data labels).
This version is considerably cleaner than the original.
I noticed that the first person to comment on the Why Axis post said that internal BLS readers resist more innovative charts, claiming "they don't understand it". This is always a consideration when departing from standard chart types.
Another reader likes the "alphabetical order" (so to speak) of the industries. He raises another key consideration: who is your audience? If the chart is only intended for specialist readers who expect to find certain things in certain places, then the designer's freedom is curtailed. If the chart is used as a data store, then the designer might as well recuse him/herself.
I like many aspects of this exercise. This chart displays the results of an experiment conducted by a computer games company to show that the new build ("249") renders frames faster than the older build ("248"). The messages of the chart are clear: the 249 build (blue bars) is substantially faster, over 80% of the frames render in 7 miliseconds or fewer under 249 compared to less than 40% under 248, and less obviously, the variance of frame times is also significantly smaller.
The slight problem is that readers probably have to read the text to grasp most of the above.
In the text, the author explains how to turn time per frame into frame per second, the more common way of measuring rendering speed. The formula is 1000 divided by time per frame. Wouldn't it be better if the chart plots fps directly?
When it comes to presenting distributions (or variability), the cumulative chart is more useful but it also is harder for readers to comprehend. For example:
The beauty of this chart is that one can take any point on the vertical axis, say, 80% level and read off the comparative values of 7 millisecond for the blue line (249) and 10.5 ms for the red (248). That means 80% of the 249 frames were rendered in fewer than 7 ms, relative to 10.5 ms for 248 frames.
Alternatively, taking a point on the horizontal axis, say 5 milliseconds, one can see that about 8% of 248 frames would reach that threshold but 30% of 249 frames did.
The steeper the ascent of the S-curve, the more efficient is the rendering.
Reader Steph G. didn't like the effort by WRAL (North Carolina) to visualize the demographics of protestors in Raleigh. It sounds like the citizens of NC are making their voices heard. Maybe my friends in Raleigh can give us some background.
There are definitely problems with the choice of charts. But I rate this effort a solid B. In the Trifecta Checkup, they did a good job describing the central question, as well as compiled an appropriate dataset. I love it when people go out to collect the right data rather than use whatever they could grab. The issue was the execution of the charts.
The first was a map showing where the arrested protestors came from.
Maps are typically used to show geographical distribution. The chosen color scheme (two levels of green and gray) compresses the data so much that we learn almost nothing about distribution. I clicked on Wake County to learn that there were 178 arrests there. The neighboring Randolph County had only 1 arrest but you can't tell from the colors.
The next chart shows the trend of arrests over time. I like the general appearance (except for the shadows). The problem is the even spacing of the columns when the gaps between the arrests are uneven.
Here's a quick redo, with proper spacing:
The final set of charts is inspired. They compare the demographics of those arrested protestors against the average North Carolina resident. For example:
For categories like Age with quite a few levels, the pie chart isn't a good choice. It's also hard to compare across pie charts. A column or dot chart works better.
Luck is not easy to nail down in a number. For the fantasy football league, I have a way of looking at luck. One aspect of luck is which team you are matched up with in any given week. There is a matter of facing a stronger or a weaker opponent. There is a different matter of whether you face a given opponent on his/her hot or cold day. Sort of like whether a hitter faces a pitcher on his good or bad day.
As noted before, each FFL player picks nine players out of 14 every week, and those nine earn points. There are typically 200-300 possible choices of nine players. So we can measure how well any FFL owner performs by looking at the points total of the activated squad against the whole distribution of 200-300 options. This was the topic of my earlier post.
Now, if I am lucky, then I tend to face opponents in the weeks in which they perform poorly. And the following chart shows this measure from week to week:
In Week 1, this owner was rather unlucky, in the sense that his opponent pretty much used his best possible squad. On the other hand, in Week 4, his opponent (a different team) played a weak hand, something close to the median squad (in addition, the entire histogram sits on the left side of the chart, meaning that even his opponent's best possible squad this week would have been easy to beat.)
Luck can be measured over the course of the 13 weeks. If the vertical lines tend to show up on the right tails of these histograms, then this owner isn't lucky. On the other hand, if the lines show up mostly on the left half of the histograms, then this owner is lucky.
In Chapter 8 of Numbersense, I use such an analysis to figure out the role of luck. This luck factor turned out to be even more important than the owner's own skills!
Special for Junk Charts readers: here is an excerpt from Chapter 8 (link).
The second book giveaway contest is under way on the sister blog. Enter the contest here.
Abhinav asks me to check out his blog post on a chart on global warming (I prefer the term climate change) featured on Wonkblog. The chart is sourced to a report by the World Metereological Association (link to PDF).
Hello, start the axis at zero whenever you are using plotting columns. That's as fundamental as only plot proportions on a pie chart.
There is a reason why the designer didn't like to start the axis at zero. It is this (Abhinav helpfully made all these charts):
The trouble is that for this data set (on global average temperature), the area below 13 is completely useless. It's like plotting body temperature on a scale of 0 - 100 Celsius when all feasible values fall into a tight range, maybe 35-38 Celsius. I recount a similar situation that led to a college president saying something stupid in Chapter 1 of my new book, Numbersense. (Information on the book is here.)
So we understand the desire to get rid of the irrelevant white space. This is accomplished by using a line chart. (I'd prefer to omit the data values, and rely on the axis.)
Abhinav then created various versions of this by compressing and expanding the vertical scales. I don't think there is anything wrong with the above scale. As I mentioned, the scale should focus on the range of values that are feasible.
Reader omegatron came back with another shocking instance of a pie chart:
Here is the link to the AVERT organization in the U.K. that published the chart and several others.
For the umpteenth time, the pie chart plots proportions. All proportions are percentages but some percentages are not proportions. The data here would appear to be "rate of diagnosis" rather than proportion of diagnoses by age.
The data came from Table 3a of this CDC report (link), and they are clearly labelled "Rate". The footnote even disclosed that the "Rate" is measured per 100,000 people so they are being mislabeled as percentages.
Let's summarize. The percentages add up to much more than 100%, they are clearly not proportions, they are not even percentages, they are rates per 100,000.
omegatron even got confused by the colors. You'd think that the slices would be arranged by age group but no! The order of the slices is by size of the pie slices, with one exception--the lime green slice of 11.4%, which I cannot explain. In practice, this means the order goes from Under 13 to 13-14 to Over 65 to 60-64 to 50-54, etc.
A smarter use of color here would be to stick to one color while varying the tinge acccording to the rate of diagnosis. Using 13 colors for 13 age groups is distracting.
As a teacher, it's shocking that such pie charts continue to see the light of day. It's very disappointing, as I'd assume every teacher who teaches the pie chart will have pointed out the pitfalls. Why is this happening?
With this chart, I'm mostly baffled by the top corner of the Trifecta Checkup. What is the point of this data? If I understand the "per 100,000 population" definition, these rates are computed as the number of diagnosed divided by the population in each age group. So the diagnosis rate is a function of how many people in each age group are actually infected, and how effective is the diagnosis procedures, and whether that effectiveness varies with age. Plus, the completeness of reporting by age group (the footnote acknowledged that the mathematical model does not account for incomplete reporting. To call a spade a spade, that means the model assumes complete reporting.)
The rate of diagnosis can be low because the rate of infection is low or the proportion of the infected who gets diagnosed is low. I just can't conceive of a use of data that confound these factors.
A time series treatment would be interesting althought that addresses a different question.
Today, we review one of the basic principles Ed Tufte very effectively advocated in his famous book: use gridlines and data labels only if absolutely necessary. The enemy is redundancy.
Here is a chart that appeared in the New York Times Real Estate pages: (with this article)
The gridlines serve no purpose. Between the axis labels and the data labels, the designer should pick one. If the data labels are used, then the vertical axis can be removed entirely without affecting our ability to understand the data. One can also argue that the data labels do not convey any real information since the average person is unlikely to be able to process 1004 feet versus 1250 feet. Why not remove the data labels and retain only the axis labels?
I'd be willing to go so far as to remove all data from the chart itself. This is because the Empire State Building has been chosen as the reference point. The assumption behind this choice is that the readers have a sense of "tallness" of the Empire State Building. It is then sufficient to just place columns of different heights next to the Empire State Building. To make the comparison a little easier, one can draw a reference line from the top of the Empire State, like this:
One of the most important steps in analyzing data is to remove noise. First, we have to identify where the noise is, then we find ways to reduce the noise, which has the effect of surfacing the signal.
The labor force participation rate data, discussed here and here, can be decomposed into two components, known as the trend and residuals. (See right.) The residuals are the raw data minus the trend; in other words, they are the data after removing the trend.
If the purpose of the analysis is to describe the evolution of the labor force participation rate over time, then the trend is the signal we're after.
Our purpose is the opposite. I want to remove the trend in order to surface correlations that are unrelated to time evolution. Thus, the residuals are where the signal is.
Another way to think about the residuals (bottom chart) is that positive values imply the actual data was above trend while negative values imply the actual data was below trend.
After decomposing the miles-driven data in the same way, I obtain two sets of residuals. These were plotted in the last post in a scatter plot.
The lack of correlation is also obvious in the plot below. You can see that the periods when one series of residuals went above trend was not well correlated with the other series being above trend (or below trend).
The first problem readers encounter with this image is "What is MMI?" I like to think of any presentation as a set of tearout pages. Even if the image is part of a book, or part of a deck of slides, once it is published, the writer should expect readers from tearing a sheet out and passing it along. In fact, you'd love to have people pass along your work. This means that when creating a plot such as this, the designer must explain what MMI is in the footnote. Yes, on every chart even if every chart in the report deals with MMI.
MMI, I'm told, is some kind of metric of health care cost.
What a mess. They are trying to use the metaphor of "measuring one's temperature", which I suppose is cute because MMI measures health care costs.
Next, the designer chose to plot the index against the national average as opposed to the dollar amount of MMI. This presents a challenge since the thermometer does not have a natural baseline number. This is especially true on the Fahrenheit scale used in the U.S.
Then, a map is introduced to place the major cities. The bulb of each thermometer now doubles as a dot on the map. This step is mind-boggling because the city labels aren't even on the map. So if you know where these cities are, you don't need the map for guidance but if you don't know the locations, you're as hopeless as before.
How the data now gets onto the complex picture requires some deconstruction.
First, start with a bar chart of the relative index (the third column of the table shown above).
Then, chop off the parts below 85 (colored gray).
Next, identify the cities that are below the national average (i.e. index < 100) and color them blue.
You can see this by focusing only on the chart above the map. In other words, this part:
To get from here to the version published, add a guiding line from each bar to the dot on the map for the corresponding city. Notice that a constant-length portion of each bar has been chopped off, and now each bar is augmented by some additional length that varies with the distance of the bar chart from the geographical location of the city as shown on the map below. For instance, Miami, which is furthest south, has the biggest distortion.
The choice of 85 as a cutoff is arbitrary and inexplicable. If we really want to create a "cutoff" of sorts, we can use 100, which represents the national average. By plotting the gap between the city index and the national index, effectively, the percent difference, we also can use the sign of the difference to indicate above/below the national average, thus saving a color.
One of the most telling signs of a failed chart is the appearance of the entire data set next to the chart. That's the essence of the self-sufficiency test.