An anonymous reader sent in a Type V critique of the following map of July unemployment rates by state. The map was published by the Bureau of Labor Statistics (BLS), and used in a recent article in Vox.
Matt @ Vox took the BLS's bait, and singled out Mississippi as the worst in the nation. Our reader-contributor is none too pleased with this conclusion.
He noted that the red state stands out only because of the high "out of sample" top range of the legend. Three out of the seven colors are not found on the map at all! This is kind of like the white space problem when doing a line plot with large values and an axis starting at zero (for example, here), but the opposite. All the states are compressed into four colors, three of which are shades of orange.
The reader investigated, and reported back:
The top end of the legend seems to be set by Puerto Rico's 13.1%. Puerto Rico is omitted from the Vox map as well as from the BLS publication (link to PDF).
Mississippi only has the bare minimum, 8.0%, to qualify for the red color. Georgia is a 7.8; Michigan, Nevada, and Rhode Island are all 7.7.
24 (of the 50 States plus DC) are in the 6-8% band, and 21 are in the 4-6% band, with the remaining 5 under 4%.
None of the above is obvious when looking at the map.
In the Trifecta Checkup, this is a Type V chart. The data is accurate. The question being asked is clear but the visual construction is problematic.
[I'm seizing back the mike.] While the map is often not the best choice for showing geographic data, something we frequently cover on this blog, in this particular case, there is a strong regional pattern. Of course, with the compressed choice of colors, this regional pattern is not easily observed in the original.
Matthew Yglesias, writing for Vox, cited the following chart from a World Bank project:
His comment was: "We can see that while China has overtaken Germany and Japan to become the world's second-largest economy (i.e., total area of the rectangle) its citizens are nowhere near being as rich as those of those countries or even Mexico."
Yes, the chart encodes the size of the economy in a rectangular area, with one side being the per-capita GDP and the other being the population. I am not sure about the "we can see". I am not confident that the short and wide rectangle for China is larger than the thin and tall ones for Japan and for Germany. Perhaps Matthew is relying on knowledge in his head, rather than knowledge on the chart, to come to this conclusion.
This is the trouble with rectangular area charts: they have a nerdy appeal since side x side = area but as a communications device, they fail.
Here are some problems with the chart:
it's difficult to compare rectangular areas
the columns can only be sorted in one way (I'd have chosen to order it by population)
colors are necessitated by the chart type not the data
the cumulative horizontal axis makes no sense unless the vertical axis is cumulative GDP (or cumulative GDP per capita)
Matthew should also have mentioned PPP (Purchasing Power Parity). If GDP is used as a measure of "wellbeing", then costs of living should be taken into account in addition to incomes. The cost of living in China is much lower than in Japan or Germany and using the prevailing exchange rates disguises this point.
Try your hand at fixing this one. There are no easy solutions. Does interactivity help? How about multiple charts? You will learn why I classify it as QDV instead of just DV.
[Update, 8/18/2014:] Xan Gregg created a scatter plot version of the chart. He also added, "There is still the issue of what the question is, but I'm assuming it's along the lines of "How do economies compare regarding GDP, population, and GDP/capita?" I'm using the PPP-based GDP, but I didn't read the report carefully enough to figure out if another measure was better."
This sort of chart is, unfortunately, quite common in business circles. Just about the only thing one can read readily from this chart is the overall growth in the plug-in vehicle market (the heights of the columns).
To fix this chart, start subtracting. First, we can condense the monthly data to quarterly:
This version is a bit less busy but there are still too many colors, and too many things to look at.
Next, we can condense the makes of the vehicles and focus on the manufacturers:
This version is still less busy and more readable. We can now see Chevrolet, Nissan, Toyota, Ford and Tesla being the five biggest manufacturers in this category. All the small brands have been aggregated into the "Others" category. The stacked column chart still makes it hard to know what's going on with each individual brand's share, other than the one brand situated at the bottom of the stack.
This shows the growth in the overall market, as well as several interesting developments:
The growth in the number of competitors in the market especially since 2012
The fragmentation of the market. Before mid 2012, Chevrolet was dominating the market. Since then, there are five or six brands splitting the market
The first-to-market brands have not been able to sustain their advantage
A smoothed version of the line chart is even more readable:
Graphics is a discipline that often rewards subtracting. Less is more.
In the above discussion, I focused on the Visual aspect of the Trifecta Checkup. This dataset is really difficult to interpret, and I'd not want to visualize it directly.
The real question we are after is to assess which manufacturer is leading the pack in plug-in vehicles.
There are a number of obstacles in our path. Different makes are being launched at different times, and it takes many months for a new make to establish itself in the market. Thus, comparing one make that just launched with another that has been in the market for twelve months is a problem.
Also, makes are of different vehicle types: compacts, SUVs, sedans, etc. More expensive vehicles will have fewer sales whether they are plug-ins or not.
Thirdly, population grows over time. The analyst would need to establish growth that is above the level of population growth.
Note to New York metro readers: I'm an invited speaker at NYU's "Art and Science of Brand Storytelling" summer course which starts tomorrow. I will be speaking on Thursday, 12-1 pm. You can still register here.
The home run data set, compiled by ESPN and visualized by Mode Analytics, is pretty rich. I took a quick look at one aspect of the data. The question I ask is what differences exist among the 10 hitters that are highlighted in the previous visualization. (I am not quite sure how those 10 were picked because they are not the Top 10 home run hitters in the dataset for the current season.)
The following chart focuses on two metrics: the total number of home runs by this point in the season; and the "true" distances of those home runs. I split the data by whether the home run was hit on a home field or an away stadium, on the hunch that we'd need to correct for such differences.
The hitters are sorted by total number of home runs. Because I am using a single season, my chart doesn't suffer from a cohort bias. If you go back to the original visualization, it is clear that some of these hitters are veterans with many seasons of baseball in them while others are newbies. This cohort bias explains the difference in dot densities of those plots.
Having not been following baseball recently, I don't know many of these names on the list. I have to look up Todd Frazier - does he play in a hitter-friendly ballpark? His home to away ratio is massive. Frazier plays for Cincinnati, at the Great American Ballpark. That ballpark has the third highest number of home runs hit of all ballparks this season although up till now, opponents have hit more home runs there than home players. For reference, Troy Tulowitzki's home field is Colorado's Coors Field, which is hitter's paradise. Giancarlo Stanton, who also hits quite a few more home runs at home, plays for Miami at Marlins Park, which is below the median in terms of home run production; thus his achievement is probably the most impressive amongst those three.
Josh Donaldson is the odd man out, as he has hit more away home runs than home runs at home. His O.co Coliseum is middle-of-the-road in terms of home runs.
In terms of how far the home runs travel (bottom part of the chart), there are some interesting tidbits. Brian Dozier's home runs are generally the shortest, regardless of home or away. Yasiel Puig and Giancarlo Stanton generate deep home runs. Adam Jones Josh Donaldson, and Yoenis Cespedes have hit the ball quite a bit deeper away from home. Giancarlo Stanton is one of the few who has hit the home-run ball deeper at his home stadium.
The baseball season is still young, and the sample sizes at the individual hitter's level are small (~15-30 total), thus the observed differences at the home/away level are mostly statistically insignificant.
The prior post on the original graphic can be found here.
Reader Joe D. tipped me about a nice visualization project by a pair of grad students at WPI (link). They displayed data about the Boston subway system (i.e. the T).
The project has many components, one of which is the visualization of the location of every train in the Boston T system on a given day. This results in a very tall chart, the top of which I clipped:
I recall that Tufte praised this type of chart in one of his books. It is indeed an exquisite design, attributed to Marey. It provides data on both time and space dimensions in a compact manner. The slope of each line is positively correlated with the velocity of the train (I use the word correlated because the distances between stations are not constant as portrayed in this chart). The authors acknowledge the influence of Tufte in their credits, and I recognize a couple of signatures:
For once, I like how they hide the names of the intermediate stations along each line while retaining the names of the key stations. Too often, modern charts banish all labels to hover-overs, which is a practice I dislike. When you move the mouse horizontally across the chart, you will see the names of the unnamed stations.
The text annotations on the right column are crucial to generating interest in this tall, busy chart. Without those hints, readers may get confused and lost in the tapestry of schedules. If you scroll to the middle, you find an instance of train delay caused by a disabled train. Even with the hints, I find that it takes time to comprehend what the notes are saying. This is definitely a chart that rewards patience.
Clicking on a particular schedule highlights that train, pushing all the other lines into the background. The side panel provides a different visual of the same data, using a schematic subway map.
Notice that my mouse is hovering over the 6:11 am moment (represented by the horizontal guide on the right side). This generates a snapshot of the entire T system shown on the left. This map shows the momentary location of every train in the system at 6:11 am. The circled dot is the particular Red Line train I have clicked on before.
This is a master class in linking multiple charts and using interactivity wisely.
You may feel that the chart using the subway map is more intuitive and much easier to comprehend. It also becomes very attractive when the dots (i.e., trains) are animated and shown to move through the system. That is the image that project designers have blessed with the top position of their Github page.
However, the image above allows us to see why the Marey diagram is the far superior representation of the data.
What are some of the questions you might want to answer with this dataset? (The Q of our Trifecta Checkup)
Perhaps figure out which trains were behind schedule on a given day. We can define behind-schedule as slower than the average train on the same route.
It is impossible to figure this out on the subway map. The static version presents a snapshot while the dynamic version has moving dots, from which readers are challenged to estimate their velocities. The Marey diagram shows all of the other schedules, making it easier to find the late trains.
Another question you might ask is how a delay in one train propagates to other trains. Again, the subway map doesn't show this at all but the Marey diagram does - although here one can nitpick and say even the Marey diagram suffers from overcrowding.
On that last question, the project designers offer up an alternative Marey. Think of this as an indiced view. Each trip is indiced to its starting point. The following setting shows the morning rush hour compared to the rest of the day:
I think they can utilize this display better if they did not show every single schedule but show the hourly average. Instead of letting readers play with the time scale, they should pre-compute the periods that are the most interesting, which according to the text, are the morning rush, afternoon rush, midday lull and evening lull.
The trouble with showing every line is that the density of lines is affected by the frequency of trains. The rush hours have more trains, causing the lines to be denser. The density gradient competes with the steepness of the lines for our attention, and completely overwhelms it.
There really is a lot to savor in this project. You should definitely spend some time reviewing it. Click here.
Also, there is still time to sign up for my NYU chart-making workshop, starting on Saturday. For more information, see here.
Carl Bialik used to be the Numbers Guy at Wall Street Journal - he's now with FiveThirtyEight. Apparently, he left a huge void. John Eppley sent me to this set of charts via Twitter.
This chart about Citibike is very disappointing.
Using the Trifecta checkup, I first notice that it addresses a stale question and produces a stale answer. The caption below the chart says "the peak times ... seem to be around 9 am and 6 pm." What a shock!
I sense a degree of meekness in usnig "seem to be". There is not much to inspire confidence in the data: rather than the full statistics which you'd think someone at Citibike has, the chart is based on "a two-day sample last autumn". The number of days is less concerning than the question of whether those two autumn days are representative of the year. Curious readers might want to know what data was collected, how it was collected, and the sample size.
Finally, the graph makes a mess of the data. While the black line appears to be data-rich, it is not. In fact, the blue dots might as well be randomly scattered and connected. As you can see from the annotations below, the scale of the chart makes no sense.
Plus, the execution is sloppy, with a missing data label.
The next chart is not much better.
The biggest howler is the choice of pie charts to illustrate three numbers that are not that different.
But I have to say the chart raises more questions than it answers. I am not an expert in pregnancy but doesn't a pregnant woman's weight include the weight of the baby she's carrying? So the more weight the woman gains, on average, the heavier is her baby. What a shock!
The last and maybe the least is this chart about basketball players in the playoff.
It's the dreaded bubble chart. The players are arranged in a perplexing order. I wonder if there is a natural numbering system for basketball positions (center = #1, etc.), like there is in soccer. Even if there is such a natural numbering system, I still question the decision to confound that system with a complicated ranking of current-year playoff players against all-time players.
Above all, the question being asked is uninteresting, and so the chart is uninformative. A more interesting question to me is whether the best players are playing in this year's playoff. To answer this question, the designer should be comparing only currently active players, and showing the all-time ranks of those players who are playing in the playoffs versus those who aren't.
Reading Alberto Cairo’s fabulous book, The Functional Art, feels like reading my own work. It’s staggering how closely aligned our sensibilities are, notwithstanding our disparate backgrounds, he a data journlist by training, and I a statistician. We probably can finish each other’s sentences—and did at this recent Analytically Speaking webcast (link to clip).
Cairo currently teaches data visualization at the University of Miami; this is after a distinguished career as a data/visual journalist, having won many awards.
The Functional Art is divided into halves, which can be read independently.
The front part is a terrific overview of data visualization concepts. Cairo’s interest is in principles, rather than recipes. The field of data visualization has developed separately under three academic disciplines: design, computer science, and statistics. Inevitably, the work products contain contradictions and much re-invention. Cairo achieves a synthesis of these schools of thought, and this book is the clarion call for more work on unifying the key intellectual threads of the field.
The second half contains a series of interviews with industry luminaries. This section is a unique contribution to the literature, glancing at behind-the-scenes of the craft. Practitioners will find these short pieces illuminating and profitable. It is often a long journey to arrive at the graphic in print. The selection of designers emphasizes mainstream media outlets although the interviewees have wide-ranging views.
Included in these pages are plenty of published data graphics, frequently work that Cairo produced while working for the Brazilian publication, Epoca. These graphics are elaborate and ambitious, and nicely reproduced in color images. They reward detailed study, with attention to composition, narrative structure, chart types, selection of statistics, etc.
There are plenty of books on the market about how to do graphics (Dona Wong, Naomi Robbins, Nathan Yau come to mind.) Cairo’s book is not about doing, but about thinking about charts. Trust me, time spent thinking about charts will make your charts much improved.
I will now describe some sections of the book that particularly hold my interest:
In Chapter 3, Cairo explains the “visualization wheel,” a nice way to visualize the decisions that designers make when creating charts. Each decision is presented as a trade-off between two extremes. For example, a chart can be “light” or “dense.” This axis evokes Tufte’s data-ink ratio. Devices such as this wheel are useful for integrating the diverse viewpoints that coexist in our field. Frequently, these trade-off decisions are made implicitly—but they can really benefit from explicit consideration.
Figure 4.11 is one of the Epoca charts narrating a Brazilian election. Just recently, I linked to Cairo’s blog post about a similar chart. In both, a spider (radar) plot features prominently. On the same chart, you’ll find a nice demonstration of the small-multiples principle. I applaud the publisher of Epoca for supporting such deep data graphics.
Chapter 8 is invaluable in documenting the chart-making process. Trial and error is a key element of this process. Here, Cairo shows some of the earlier drafts of projects that eventually went to publication. This material is similar to what Kevin Quealy shows at his ChartNThings blog about New York Times graphics.
Chapter 9 is one of the more mature discussions of interactive graphics I have seen. Too often, interactivity is reduced to a feature that is layered onto any dataset. It should rightfully be seen as a problem of design.
Figure 10.1 is not strictly speaking a “data” graphic but I love John Grimwade’s visual explanation of the “transatlantic superhighway”.
I highlighted the columns for 1993 and 1996. Visually, the height of one column is twice that of the other column. And yet the axis labels tell us that the difference is 65% versus 62.5%.
The reason for the start-at-zero rule is to avoid exaggerating meaningless differences.
To judge whether a change is meaningful or not, in time-series data like this, we have to use history to understand the general variability in college enrollment rates. Based on what we can see in this data (about 20 years), the college enrollment rate hovers between 60 and 70 percent. There is no data between 0 and 60 percent. Those are irrelevant values for this data series. This is why starting at zero is counterproductive.
Here is the line chart starting at zero:
This display has the unintended effect of squashing meaningful changes over time by inserting a lot of empty space below the line.