One piece of advice I give for those wanting to get into data visualization is to trash the defaults (see the last part of this interview with me). Jon Schwabish, an economist with the government, gives a detailed example of how this is done in a guest blog on the Why Axis.
Here are the highlights of his piece.
He starts with a basic chart, published by the Bureau of Labor Statistics. You can see the hallmarks of the Excel chart using the Excel defaults. The blue, red, green color scheme is most telling.
Just by making small changes, like using tints as opposed to different colors, using columns instead of bars, reordering the industry categories, and placing the legend text next to the columns, Schwabish made the chart more visually appealing and more effective.
The final version uses lines instead of columns, which will outrage some readers. It is usually true that a grouped bar chart should be replaced by overlaid line charts, and this should not be limited to so-called discrete data.
Schwabish included several bells and whistles. The three data points are not evenly spaced in time. The year-on-year difference is separately plotted as a bar chart on the same canvass. I'd consider using a line chart here as well... and lose the vertical axis since all the data are printed on the chart (or else, lose the data labels).
This version is considerably cleaner than the original.
I noticed that the first person to comment on the Why Axis post said that internal BLS readers resist more innovative charts, claiming "they don't understand it". This is always a consideration when departing from standard chart types.
Another reader likes the "alphabetical order" (so to speak) of the industries. He raises another key consideration: who is your audience? If the chart is only intended for specialist readers who expect to find certain things in certain places, then the designer's freedom is curtailed. If the chart is used as a data store, then the designer might as well recuse him/herself.
Notice that I have indiced every metric against the league average. This is shown in the first panel. I use a red dot to warn readers that the direction of this metric is opposite to the others (left of center is a good thing!)
You can immediately make a bunch of observations:
Alex Smith was quite poor, except for interceptions.
Colin Kaepernick had similar passing statistics as Smith. His only advantage over Smith was the rushing.
Joe Flacco, as we noted before, is as average as it goes (except for rushing yards).
Tyrrod Taylor is here to remind us that we have to be careful about backup players being included in the same analysis.
The second version is a heatmap.
This takes inspiration from the fact that any serious reader of the spider chart will be reading the eight spokes (dimensions) separately. Why not plot these neatly in columns and use color to help us find the best and worst?
Imagine this to be a large table with as many rows as there are quarterbacks. You will able to locate the red (hot) zones quickly. You can also scan across a row to understand that player's performance relative to the average, on every metric.
I like this visualization best, primarily because it scales beautifully.
The final version is a profile chart, or sometimes called a parallel coordinates plot. While I am an advocate of profile charts, they really only work when you have a small number of things to compare.
Reader Steph G. didn't like the effort by WRAL (North Carolina) to visualize the demographics of protestors in Raleigh. It sounds like the citizens of NC are making their voices heard. Maybe my friends in Raleigh can give us some background.
There are definitely problems with the choice of charts. But I rate this effort a solid B. In the Trifecta Checkup, they did a good job describing the central question, as well as compiled an appropriate dataset. I love it when people go out to collect the right data rather than use whatever they could grab. The issue was the execution of the charts.
The first was a map showing where the arrested protestors came from.
Maps are typically used to show geographical distribution. The chosen color scheme (two levels of green and gray) compresses the data so much that we learn almost nothing about distribution. I clicked on Wake County to learn that there were 178 arrests there. The neighboring Randolph County had only 1 arrest but you can't tell from the colors.
The next chart shows the trend of arrests over time. I like the general appearance (except for the shadows). The problem is the even spacing of the columns when the gaps between the arrests are uneven.
Here's a quick redo, with proper spacing:
The final set of charts is inspired. They compare the demographics of those arrested protestors against the average North Carolina resident. For example:
For categories like Age with quite a few levels, the pie chart isn't a good choice. It's also hard to compare across pie charts. A column or dot chart works better.
Reader omegatron came back with another shocking instance of a pie chart:
Here is the link to the AVERT organization in the U.K. that published the chart and several others.
For the umpteenth time, the pie chart plots proportions. All proportions are percentages but some percentages are not proportions. The data here would appear to be "rate of diagnosis" rather than proportion of diagnoses by age.
The data came from Table 3a of this CDC report (link), and they are clearly labelled "Rate". The footnote even disclosed that the "Rate" is measured per 100,000 people so they are being mislabeled as percentages.
Let's summarize. The percentages add up to much more than 100%, they are clearly not proportions, they are not even percentages, they are rates per 100,000.
omegatron even got confused by the colors. You'd think that the slices would be arranged by age group but no! The order of the slices is by size of the pie slices, with one exception--the lime green slice of 11.4%, which I cannot explain. In practice, this means the order goes from Under 13 to 13-14 to Over 65 to 60-64 to 50-54, etc.
A smarter use of color here would be to stick to one color while varying the tinge acccording to the rate of diagnosis. Using 13 colors for 13 age groups is distracting.
As a teacher, it's shocking that such pie charts continue to see the light of day. It's very disappointing, as I'd assume every teacher who teaches the pie chart will have pointed out the pitfalls. Why is this happening?
With this chart, I'm mostly baffled by the top corner of the Trifecta Checkup. What is the point of this data? If I understand the "per 100,000 population" definition, these rates are computed as the number of diagnosed divided by the population in each age group. So the diagnosis rate is a function of how many people in each age group are actually infected, and how effective is the diagnosis procedures, and whether that effectiveness varies with age. Plus, the completeness of reporting by age group (the footnote acknowledged that the mathematical model does not account for incomplete reporting. To call a spade a spade, that means the model assumes complete reporting.)
The rate of diagnosis can be low because the rate of infection is low or the proportion of the infected who gets diagnosed is low. I just can't conceive of a use of data that confound these factors.
A time series treatment would be interesting althought that addresses a different question.
The first problem readers encounter with this image is "What is MMI?" I like to think of any presentation as a set of tearout pages. Even if the image is part of a book, or part of a deck of slides, once it is published, the writer should expect readers from tearing a sheet out and passing it along. In fact, you'd love to have people pass along your work. This means that when creating a plot such as this, the designer must explain what MMI is in the footnote. Yes, on every chart even if every chart in the report deals with MMI.
MMI, I'm told, is some kind of metric of health care cost.
What a mess. They are trying to use the metaphor of "measuring one's temperature", which I suppose is cute because MMI measures health care costs.
Next, the designer chose to plot the index against the national average as opposed to the dollar amount of MMI. This presents a challenge since the thermometer does not have a natural baseline number. This is especially true on the Fahrenheit scale used in the U.S.
Then, a map is introduced to place the major cities. The bulb of each thermometer now doubles as a dot on the map. This step is mind-boggling because the city labels aren't even on the map. So if you know where these cities are, you don't need the map for guidance but if you don't know the locations, you're as hopeless as before.
How the data now gets onto the complex picture requires some deconstruction.
First, start with a bar chart of the relative index (the third column of the table shown above).
Then, chop off the parts below 85 (colored gray).
Next, identify the cities that are below the national average (i.e. index < 100) and color them blue.
You can see this by focusing only on the chart above the map. In other words, this part:
To get from here to the version published, add a guiding line from each bar to the dot on the map for the corresponding city. Notice that a constant-length portion of each bar has been chopped off, and now each bar is augmented by some additional length that varies with the distance of the bar chart from the geographical location of the city as shown on the map below. For instance, Miami, which is furthest south, has the biggest distortion.
The choice of 85 as a cutoff is arbitrary and inexplicable. If we really want to create a "cutoff" of sorts, we can use 100, which represents the national average. By plotting the gap between the city index and the national index, effectively, the percent difference, we also can use the sign of the difference to indicate above/below the national average, thus saving a color.
One of the most telling signs of a failed chart is the appearance of the entire data set next to the chart. That's the essence of the self-sufficiency test.
Nick C. on Twitter sent us to the following chart of salaries in Major League Soccer. (link)
This chart is hosted at Tableau, which is one of the modern visualization software suites. It appears to be a user submission. Alas, more power did not bring more responsibility.
Sorting the bars by total salary would be a start.
The colors and subsections of the bars were intended to unpack the composition of the total salaries, namely, which positions took how much of the money. I'm at a loss to explain why those rectangles don't seem to be drawn to scale, or what it means to have rectangles stacked on top of each other. Perhaps it's because I don't know much about how the cap works.
Combined with the smaller chart (shown below), the story seems to be that while all teams have similar cap numbers, the actual salaries being paid could differ by multiples.
This is the standard stacked bar chart showing the distribution of salary cap usage by team:
I have never understood the appeal of stacking data. It's not easy to compare the middle segments.
After quite a bit of work, I arrived at the following:
The MLS teams are divided into five groups based on how they used the salary cap. Salary cap figures are converted into proportion of total cap. For example, the first cluster includes Chicago, Los Angeles, New York, Seattle and Toronto, and these teams spread the wealth among the D, F, and M players while not spending much on goalie and "others". On the other hand, Groups 2 and 3, especially Group 3 allocated 30-45% of the cap on the midfield.
Three teams form their own clusters. CLB spends more of its cap on "others" than any other team (others are mostly hyphenated positions like D-F, F-M, etc.) DAL and VAN spend a lot less on midfield players than other teams. VAN spends a lot on defense.
My version has many fewer data points (although the underlying data set is the same) but it's easier to interpret.
I tried various chart types like bar charts, and even pie charts. I still like the profile (line) charts best.
In a modern software (I'm using JMP's Graph Builder here), it's only one click to go from line to bar, and one click to go to pie.
When we visualize data, we want to expose the information contained within, or to use the terminology Nate Silver popularized, to expose the signal and leave behind the noise.
When graphs are not done right, sometimes they manage to obscure the information.
Reader John H. found a confusing bar chart while studying a paper (link to PDF) in which the authors compared two algorithms used to determine the position of Wi-Fi access points under various settings.
The first reaction is maybe the researchers are telling us there is no information here. The most important variable on this chart is what they call "datanum", and it goes from left to right across the page. A casual glance across the page gives the idea that nothing much is going on.
Then you look at the row labels, and realize that this dataset is very well structured. The target variables (AP Position Error) is compared along four dimensions: datanum, the algorithm (WCL, or GPR+WCL), the number of access points, and the location of these access points (inner, boundary, all).
When the data has a nice structure, there should be better ways to visualize it.
John submitted a much improved version, which he created using ggplot2.
This is essentially a small multiples chart. The key differences between the two charts are:
Giving more dimensions a chance to shine
Spacing the "datanum" proportional to the sample size (we think "datanum" means the number of sample readings taken from each access point)
Using a profile chart, which also allows the y-axis to start from 2
When you read this chart, you finally realize that the experiment has yielded several insights:
Increasing sample size does not affect aggregate WCL error rate but it does reduce the error rate of aggregate GPR+WCL.
The improvement of GPR+WCL comes only from the inner access points.
The WCL algorithm performs really well in inner access points but poorly in outer access points.
The addition of GPR to the WCL algorithm improves the performance of outer access points but deteriorates the performance of inner access points. (In aggregate, it improves the performance... this is only because there are almost two outer access points to every inner access point.)
Now, I don't know anything about this position estimation problem. The chart leaves me thinking why they don't just use WCL on inner access points. The performance under that setting is far and away the best of all the tested settings.
The researchers described their metric as AP Position Error (2drms, 95% confidence). I'm not sure what they mean by that because when I see 95% confidence, I am expecting to see the confidence band around the point estimates being shown above.
And yet, the data table shows only point estimates -- in fact, estimates to two decimal places of precision. In statistics, the more precision you have, the less confidence.
Quite a few problems crop up here. The most hurtful is that the context of the chart is left to the text. If you read the paragraph above, you'll learn that the data represents only a select group of institutions known as the Russell Group; and in particular, Cambridge University was omitted because "it did not provide data in 2005". That omission is a curious decision as the designer weighs one missing year against one missing institution (and a mighty important one at that). This issue is easily fixed by a few choice words.
You will also learn from the text that the author's primary message is that among the elite institutions, little if any improvement has been observed in the enrollment of (disadvantaged) students from "low participation areas". This chart draws our attention to the tangle of up and down segments, giving us the impression that the data is too complicated to extract a clear message.
The decision to use 21 colors for 21 schools is baffling as surely no one can make out which line is which school. A good tip-off that you have the wrong chart type is the fact that you need more than say three or four colors.
The order of institutions listed in the legend is approximately reverse of their appearance in the chart. If software can be "intelligent", I'd hope that it could automatically sort the order of legend entries.
If the whitespace were removed (I'm talking about the space between 0% and 2.25% and between 8% and 10%), the lines could be more spread out, and perhaps labels can be placed next to the vertical axes to simplify the presentation. I'd also delete "Univ." with abandon.
The author concludes that nothing has changed among the Russell Group. Here is the untangled version of the same chart. The schools are ordered by their "inclusiveness" from left to right.
This is a case where the "average" obscures a lot of differences between institutions and even within institutions from year to year (witness LSE).
In addition, I see a negative reputation effect, with the proportion of students from low-participation areas decreasing with increasing reputation. I'm basing this on name recognition. Perhaps UK readers can confirm if this is correct. If correct, it's a big miss in terms of interesting features in this dataset.