This chart published in Harvard Magazine has won my heart.
It is well executed in many ways. The chart illustrates a study of time spent by assistant and associate professors. It focuses specifically on time spent working versus time spent on household chores. One of the obvious questions of the study is whether female professors are disadvantaged when they have family obligations.
The general visual framework is the profile chart. Four segments of professors are arranged left to right from single with no children to married, with children and both parents working or single parent. The chart makes these points clear:
Having children adds about 15-30 hours to time spent on household duties, per partner
Household duties are not evenly split by gender, with the expected bias. (Of course, this observation must be carefully vetted. The men and women are not married to each other, even on the right side of the chart. But I presume the usual interpretation should hold.)
Male professors with kids do spend more time on household chores than those without but not as much as female professors with kids
In the meantime, the amount of time spent working is about the same for all four segments, raising a side question: what other activities got displaced? The juxtaposition of the lines allows us to see that the displaced hours are almost 50 percent of the total time spent working! What did they do less of?
I especially like the explicit depiction and labeling of the "gender gap" (the orange vertical lines). Also, the use of median hours instead of average hours.
My one little complaint is that the designer forgot to tell us the hours are off a weekly basis (I'm guessing here). Just adding "per week" after "median hours" would have fixed this.
One simple chart cannot address all possible questions on such a complicated subject. I like the restraint the designer exercised in not saddling the chart with too many questions.
I will just mention one tricky statistical issue. Getting tenure and making babies are both activities that occur within some time window in a professor's life, if at all. So there is a survivorship bias. The professors who receive tenure drops out of the picture. If you are older, and still in the pool, you probably are less "accomplished" from the perspective of the tenure-granting process. The longer you stay in that pool, the more likely you will have gotten married and/or have children--thus, there is an age bias going from left to right, as well as a survivorship bias. This implies that the characteristics of the professors in the four groups are likely to be different not just on their marital and child-rearing statuses but also on age and probability of tenure.
An anonymous reader sent in a Type V critique of the following map of July unemployment rates by state. The map was published by the Bureau of Labor Statistics (BLS), and used in a recent article in Vox.
Matt @ Vox took the BLS's bait, and singled out Mississippi as the worst in the nation. Our reader-contributor is none too pleased with this conclusion.
He noted that the red state stands out only because of the high "out of sample" top range of the legend. Three out of the seven colors are not found on the map at all! This is kind of like the white space problem when doing a line plot with large values and an axis starting at zero (for example, here), but the opposite. All the states are compressed into four colors, three of which are shades of orange.
The reader investigated, and reported back:
The top end of the legend seems to be set by Puerto Rico's 13.1%. Puerto Rico is omitted from the Vox map as well as from the BLS publication (link to PDF).
Mississippi only has the bare minimum, 8.0%, to qualify for the red color. Georgia is a 7.8; Michigan, Nevada, and Rhode Island are all 7.7.
24 (of the 50 States plus DC) are in the 6-8% band, and 21 are in the 4-6% band, with the remaining 5 under 4%.
None of the above is obvious when looking at the map.
In the Trifecta Checkup, this is a Type V chart. The data is accurate. The question being asked is clear but the visual construction is problematic.
[I'm seizing back the mike.] While the map is often not the best choice for showing geographic data, something we frequently cover on this blog, in this particular case, there is a strong regional pattern. Of course, with the compressed choice of colors, this regional pattern is not easily observed in the original.
Rescheduling Notice: I have been informed by the organizers that the Meetup tonight has to be rescheduled due to an unexpected problem with the venue. When a new date is set, I will let you know.
Since I am not working on the slides for the Meetup, I have a little time to follow up on the post about the World Bank graphic.
One common response, also expressed on Twitter, is to "fix" it by using a scatter plot. Xan helpfully drew one up, which I added to the post.
I mentioned, cryptically, that if you try making improvements, you will find that the chart is a Type QD, not a Type D. There are clearly problems with the data but this chart cannot be "fixed" until one clarifies what the message of the chart really is.
The original chart plots (y=) GDP per capita against (x=) cumulative proportion of the world's population with countries ordered from lowest to highest GDP per capita. Embedded in the rectangular areas is total GDP.
Xan's chart plots (y=) total GDP in PPP terms against (x=) population. The per-capita PPP GDP is readable through diagonal gridlines.
Xan's chart is undoubtedly less confusing, and more direct. But it won't answer the cumulative question that the World Bank seems to be asking. That question is: how much of the world's wealth (measured in GDP) is held by the poorest X% of the population. This isn't something you can find on the scatter plot.
Now, the "cumulative" question is nice to think about but it is ill-posed for the kinds of data available. Each country ends up being represented by its average (per capita) wealth, but there is rampant wealth inequality within countries. Even though Nigeria is in the bottom 15%, it is certainly not true that the entire population of Nigeria belongs to the world's poorest 15%.
When a reader tweeted that a scatter plot is the solution, I asked: "Which two variables?" Here are just a few candidates:
total GDP GDP per capita total GDP PPP PPP GDP per capita cumulative total GDP, ordered by per-capita GDP cumulative total GDP, ordered by total GDP cumulative total GDP, ordered by total population cumulative total GDP, ordered by population growth cumulative total GDP PPP, ordered by per-capita GDP PPP cumulative total GDP PPP, ordered by total GDP PPP cumulative total GDP PPP, ordered by total population cumulative total GDP PPP, ordered by population growth cumulative total population cumulative GDP per capita cumulative GDP PPP per capita population working population total GDP growth total GDP PPP growth total GDP per capita growth total GDP PPP per capita growth total population growth total working population growth median GDP median GDP PPP
Different charts address different questions, some of which are more meaningful and some of which have better data. There may be a few interesting questions, in which case a set of scatter plots may work better.
The New York Times Upshot team came up with a dataviz that is worth your time. This is a set of maps that gives a perspective on migration patterns within the US. The metric being portrayed is the birthplace of current residents of each state.
Here is the chart for California:
I see a few smart ideas, starting with the little map on the bottom left. It servies multiple functions. It is a legend mapping colors to four regions of the US. It serves as a visual guide to the definition of regions. It serves as an interactive tool to select states. Readers might remember the use of a pie chart as a legend in my remake of one of the Wikipedia pie charts (link).
The aggregation up to regions is what really makes this chart work. This aggregation reduces the number of pieces from about 50 to about 10.
They also did a great job with the axes and gridlines. Much of the data labels are hidden but the most important numbers are retained. These include the proportion of residents who were born in their home state, the proportion of residents who were born outside the U.S., and any state(s) that contribute a significant portion of residents. In the California example, we see that the proportion of Midwest-born people living in California has declined by a lot over time.
Users can interactively hover over the gridlines to uncover the data labels.
As you scroll through the states, there are some recurring patterns.
Some states clearly have become more desirable over time. Georgia, for instance, has seen strong in-migration (colored pieces) especially from non-Southern states:
This pattern is repeated in other southeastern states, including Virginia, North Carolina and Tennessee.
By contrast, some states are not getting the migrants. As a result, the share of residents born in the home state has increased over time. The Midwestern states have this problem. For instance, Minnesota:
I also find a few states with special features. Nevada has always been a state of migrants:
Wyoming on the other hand has become popular with migrants over time but the composition has shifted away from MidWest states.
I'd have preferred presenting the charts in clusters based on patterns.
I haven't been able to figure out the multi-color spaghetti. I think the undulations are purely for aesthetic reasons.
One way to read the chart, then, is to first see three big patches (light grey for born in current state; white patch for born in other U.S. states; dark gray for born outside the U.S.). Within the white patch, we are looking for the shift between the colors (i.e. regions).
I also write popular books about statistics applied to daily life. Here is Numbers Rule Your World, and Numbersense. Tom Peters tweeted: On my 13-hour Boston-Dubai flight, I re-read cover-2-cover Kaiser Fung's superb-useful-fun book Number Sense. Trust him, or trust me.
I am a business statistician and speaker/trainer for hire. This is about a few hours in my life. I built the data teams at Vimeo (still part of the family) and Sirius XM Radio. Click here to write me. Some citations in the media. Talks. A free course.
Oct 2014: I'm teaching my data visualization workshop. This is a great way for you to learn how to make great graphics in a fun, immersive setting. See a fuller description here.
Matthew Yglesias, writing for Vox, cited the following chart from a World Bank project:
His comment was: "We can see that while China has overtaken Germany and Japan to become the world's second-largest economy (i.e., total area of the rectangle) its citizens are nowhere near being as rich as those of those countries or even Mexico."
Yes, the chart encodes the size of the economy in a rectangular area, with one side being the per-capita GDP and the other being the population. I am not sure about the "we can see". I am not confident that the short and wide rectangle for China is larger than the thin and tall ones for Japan and for Germany. Perhaps Matthew is relying on knowledge in his head, rather than knowledge on the chart, to come to this conclusion.
This is the trouble with rectangular area charts: they have a nerdy appeal since side x side = area but as a communications device, they fail.
Here are some problems with the chart:
it's difficult to compare rectangular areas
the columns can only be sorted in one way (I'd have chosen to order it by population)
colors are necessitated by the chart type not the data
the cumulative horizontal axis makes no sense unless the vertical axis is cumulative GDP (or cumulative GDP per capita)
Matthew should also have mentioned PPP (Purchasing Power Parity). If GDP is used as a measure of "wellbeing", then costs of living should be taken into account in addition to incomes. The cost of living in China is much lower than in Japan or Germany and using the prevailing exchange rates disguises this point.
Try your hand at fixing this one. There are no easy solutions. Does interactivity help? How about multiple charts? You will learn why I classify it as QDV instead of just DV.
[Update, 8/18/2014:] Xan Gregg created a scatter plot version of the chart. He also added, "There is still the issue of what the question is, but I'm assuming it's along the lines of "How do economies compare regarding GDP, population, and GDP/capita?" I'm using the PPP-based GDP, but I didn't read the report carefully enough to figure out if another measure was better."
Thanks to the ~200 or so people who showed up at last week's Data Scientist Meetup in Cambridge, Mass., hosted by John Baker. I gave a brief introduction to the concept of "numbersense", and was part of a panel of "chief data scientists" talking about how to run data teams. Thanks to those who asked questions.
This month, I am back in New York, and will be giving two talks.
First up is the Data Visualization New York Meetup organized by Paul Trowbridge. The link to register is here but it looks like all slots have been taken within days. You should get on the wait list as some registrants will eventually drop out. This event is on Aug 20 (Wed).
On Aug 26 (Tues), I am giving the "thought leader" presentation for the Optimizely Experience. I will be talking about statistical testing for online marketing aka A/B testing. The title of the talk is "Five Questions About Testing You Wanted to Ask But Didn't" unless I come up with something better. You can register here.
This will be a brand-new presentation, and I look forward to sharing my ten-plus years of running online experiments. See you there!
Also, please let the organizers at SXSW know you want to hear me and other data viz experts talk about visualizing data in Austin. Jon Schwabish has put together a fabulous panel with people from different parts of the spectrum, and it promises to be an engaging conversation.
This sort of chart is, unfortunately, quite common in business circles. Just about the only thing one can read readily from this chart is the overall growth in the plug-in vehicle market (the heights of the columns).
To fix this chart, start subtracting. First, we can condense the monthly data to quarterly:
This version is a bit less busy but there are still too many colors, and too many things to look at.
Next, we can condense the makes of the vehicles and focus on the manufacturers:
This version is still less busy and more readable. We can now see Chevrolet, Nissan, Toyota, Ford and Tesla being the five biggest manufacturers in this category. All the small brands have been aggregated into the "Others" category. The stacked column chart still makes it hard to know what's going on with each individual brand's share, other than the one brand situated at the bottom of the stack.
This shows the growth in the overall market, as well as several interesting developments:
The growth in the number of competitors in the market especially since 2012
The fragmentation of the market. Before mid 2012, Chevrolet was dominating the market. Since then, there are five or six brands splitting the market
The first-to-market brands have not been able to sustain their advantage
A smoothed version of the line chart is even more readable:
Graphics is a discipline that often rewards subtracting. Less is more.
In the above discussion, I focused on the Visual aspect of the Trifecta Checkup. This dataset is really difficult to interpret, and I'd not want to visualize it directly.
The real question we are after is to assess which manufacturer is leading the pack in plug-in vehicles.
There are a number of obstacles in our path. Different makes are being launched at different times, and it takes many months for a new make to establish itself in the market. Thus, comparing one make that just launched with another that has been in the market for twelve months is a problem.
Also, makes are of different vehicle types: compacts, SUVs, sedans, etc. More expensive vehicles will have fewer sales whether they are plug-ins or not.
Thirdly, population grows over time. The analyst would need to establish growth that is above the level of population growth.