Alberto Cairo just gave a wonderful talk to my workshop, in which he complains about the state of dataviz teaching. So, it's quite opportune that reader Maja Z. sent in a couple of examples from a recent course on data visualization for academics. She was surprised to see these held out as examples of good work. I'll discuss one chart today, and the other one some other day.
The instructor for the course praised this chart for this principle: "always try to find a graphic that relates to your subject, like the bullets here representing military spending, and use it in the chart."
For students who take my class, they learn the opposite lesson: I like to say imagery often backfires. I do like charts with imagery that makes the data come alive but more often than not, the designer falls in love with the imagery and let the data down.
This chart presumably shows the top 10 military spenders in the world by total amount spent in 2013. You'd think that the Chinese spent a bit more than half what the Americans did. But the data labels say $640 billion vs $188 billion, only about 30%. Next, the Russian spend is 46% of the Chinese according to the data, etc. So, is this really a data visualization or just some pictures with numbers printed next to them?
It's possible that the data is encoded in the surface areas or the volumes of these warheads but in reality, this is a glorified column chart, so most readers will respond to the heights of the columns.
Perhaps the shadows are there to demonstrate shadow spending.
The designer seems to appreciate that total spending is not necessarily a great metric. Spending as a proportion of GDP is provided as a secondary metric. I'm not so sure what to make of this though: should we expect richer nations to need/want to spend more building bombs and such? It just doesn't seem very logical to me.
Instead, a more meaningful metric might be military spending per capita. Controlling for population seems somewhat logical; the more people you have to protect, the more money you have to spend.
In the end, I made this scatter plot that tries to have it both ways:
(The percentages are of GDP.)
Here, we can see that Saudi Arabia and the U.S. are particularly aggressive spenders, spending over $2000 per person per year. The respective two dots are way above the average line (for the top 10 spenders). At the richer end of the scale, the American spending is way above the international average. On the other hand, Japan and Germany both spend significantly less than would be predicted by their GDP per capita levels.
Of note, readers more easily relate to the per-capita numbers than the aggregate figures in the original chart. They learn, for instance, that Saudi Arabia's average GDP was $27,000 per head, of which $2,500 went to arming itself up.
Rescheduling Notice: I have been informed by the organizers that the Meetup tonight has to be rescheduled due to an unexpected problem with the venue. When a new date is set, I will let you know.
Since I am not working on the slides for the Meetup, I have a little time to follow up on the post about the World Bank graphic.
One common response, also expressed on Twitter, is to "fix" it by using a scatter plot. Xan helpfully drew one up, which I added to the post.
I mentioned, cryptically, that if you try making improvements, you will find that the chart is a Type QD, not a Type D. There are clearly problems with the data but this chart cannot be "fixed" until one clarifies what the message of the chart really is.
The original chart plots (y=) GDP per capita against (x=) cumulative proportion of the world's population with countries ordered from lowest to highest GDP per capita. Embedded in the rectangular areas is total GDP.
Xan's chart plots (y=) total GDP in PPP terms against (x=) population. The per-capita PPP GDP is readable through diagonal gridlines.
Xan's chart is undoubtedly less confusing, and more direct. But it won't answer the cumulative question that the World Bank seems to be asking. That question is: how much of the world's wealth (measured in GDP) is held by the poorest X% of the population. This isn't something you can find on the scatter plot.
Now, the "cumulative" question is nice to think about but it is ill-posed for the kinds of data available. Each country ends up being represented by its average (per capita) wealth, but there is rampant wealth inequality within countries. Even though Nigeria is in the bottom 15%, it is certainly not true that the entire population of Nigeria belongs to the world's poorest 15%.
When a reader tweeted that a scatter plot is the solution, I asked: "Which two variables?" Here are just a few candidates:
total GDP GDP per capita total GDP PPP PPP GDP per capita cumulative total GDP, ordered by per-capita GDP cumulative total GDP, ordered by total GDP cumulative total GDP, ordered by total population cumulative total GDP, ordered by population growth cumulative total GDP PPP, ordered by per-capita GDP PPP cumulative total GDP PPP, ordered by total GDP PPP cumulative total GDP PPP, ordered by total population cumulative total GDP PPP, ordered by population growth cumulative total population cumulative GDP per capita cumulative GDP PPP per capita population working population total GDP growth total GDP PPP growth total GDP per capita growth total GDP PPP per capita growth total population growth total working population growth median GDP median GDP PPP
Different charts address different questions, some of which are more meaningful and some of which have better data. There may be a few interesting questions, in which case a set of scatter plots may work better.
Making data graphics interactive should improve the user experience. In practice, interactivity too often becomes overhead, making it harder for users to understand the data on the graph.
Reader Joe D. (via Twitter) admires the statistical sophistication behind this graphic about home runs in Major League Baseball. This graphic does present interesting analyses, as opposed to acting as a container for data.
For example, one can compare the angle and distance of the home runs hit by different players:
One can observe patterns as most of these highlighted players have more home runs on the left side than the right side. However, for this chart to be more telling, additional information should be provided. Knowing whether the hitter is left- or right-handed or a switch hitter would be key to understanding the angles. Also, information about the home ballpark, and indeed differentiating between home and away home runs, are also critical to making sense of this data. (One strange feature of baseball fields is that they all have different dimensions and shapes.)
But back to my point about interactivity. The original chart does not present the data in small multiples. Instead, the user must "interact" with the chart by clicking successively on each player (listed above the graphic).
Given that the graphic only shows one player at a time, the user must use his or her memory to make the comparison between one player and the next.
The chosen visual form discourages readers from making such comparisons, which defeats one of the primary goals of the chart.
The Facebook data science team has put together a great course on EDA at Udacity.
EDA stands for exploratory data analysis. It is the beginning of any data analysis when you have a pile of data (or datasets) and you need to get a feel for what you're looking at. It's when you develop some intuition about what sort of methodology would be appropriate to analyze the data.
Not surprisingly, graphical methods form a big part of EDA. You will commonly see histograms, boxplots, and scatter plots. The scatterplot matrix (see my discussion of this) makes an appearance here as well.
The course uses R and in particular, Hadley's ggplot package throughout. I highly recommend the course for anyone who wants to become an expert in ggplot. ggplot does use quite a bit of proprietary syntax. This EDA course offers a lot of instruction in coding. You do have to work hard, but you will learn a lot. By working hard, I mean reading supplementary materials, and doing the exercises throughout the course. As good instruction goes, they expect students to discover things, and do not feed you bullet points.
While this course is not freeThis course is free, plus the quality of the instruction is heads and shoulders above other MOOCs out there. The course is designed from the ground up for online instruction, and it shows. If you have tried other online courses, you will immediately notice the difference in quality. For example, the people in these videos talk directly to you, and not a bunch of tuition-paying students in some remote classroom.
Sign up before they get started at Udacity. Disclaimer: No one paid me to write this post.
Some graphics are made to inform, some to amuse, some to delight. But the following scatter plot makes one wonder why why why...
What does the designer want to say?
I saw this chart inside an infographics titled "Where in the World are the Best Schools and the Happiest Kids?", via the Cool Infographics blog. The horizontal axis is happiness and the vertical axis is average test score.
So it appears that happy kids can get the best and the worst test scores, and kids with the best test scores can be both happy and sad.
That means the happiness of kids does not depend on their test scores.
The financial media, ranging from Wall Street Journal to Zero Hedge, blogged about the geographical distribution of U.S. millionaires. The stories came with a map, and in the case of the latter, two data tables ranked by ascending and descending prevalence of millionaires. The map looks like this:
The talking point lifted from the press release of Phoenix Marketing, who is the origin of the data, focuses improbably on North Dakota. For example, the WSJ blog began with:
The state making the fastest climb up the millionaire rankings doesn’t have a single Tiffany or Saks Fifth Avenue store. The closest BMW dealership is a six-hour drive from the capital.
Welcome to North Dakota, which jumped 14 spots in the annual rankings of millionaire households per capita released by Phoenix Marketing International.
The trouble is, you can't pick North Dakota out of the map; it just doesn't stand out. The map uses a different methodology of ordering the states, by groupings of the prevalence of millionaires, that is, the proportion of households in each state who are labeled "millionaires" by Phoenix Marketing.
The text, by contrast, draws attention to the change in the rank of states using the proportion of households who are millionaires as the ranking criterion. This data is two steps removed from the data used for the map (start with the map data, compute the year-to-year change, then convert to ranks).
State-level averages pose a challenge: state population varies a lot, and this leads to variability in the estimates of smaller states. You are likely to find smaller states over-represented in the top and bottom of state ranking charts. I talked about a similar situation relating to interpreting high schools test data (see this post, and Prologue of Numbersense link.)
Instead of using proportion of households who are millionaires, I prefer to use the number of millionaires per 1,000 households. Mathematically, these two are equivalent. If we plot that metric versus the size of states (number of households), we see the familiar pattern:
I labeled the North Dakota data point to show how unremarkable it is. While it may have risen in "rank", it is still ranked below median in terms of number of millionaires per 1000 households. Also notice that of states with similar number of households, the millionaires metric ranges wildly from 40 to 70 per 1000 households.
An interpretation of these state average millionaire metrics has to account for state population size.
The following map illustrates the ups and downs between 2007 and 2013 by state. (I found 2007 data but not the 2012 data.)
Think of an accounting equation. In this view, the positive changes must balance out the negative changes since I am only converned about any shift in mix. What this map shows is that Texas, California, New York, and Washington have the top net gains in the number of millionaires while Florida, and Michigan have the biggest net losses. North Dakota is again in the middle of the bunch.
This view ignores the total net change in millionaires as it focuses on the mix by state. You'd need to figure out what is the relevant question before you can come up with a good visualization of this (or any) data.
Joe D., a long time reader, points us to a few blogs that have been active creating redesigns of charts, similar to how we do it here.
First up, here are some examples from Storytelling With Data (link).
This example transformed a grouped bar chart into a line chart, something that I have long advocated. I'm still waiting for the day when market research companies start to switch from bars to lines.
Jorge Camoes, also a long-time reader, produced a redesign of a chart on military spending first printed in Time magazine. (link)
Dual-axis plots have been pilloried here often, especially when the two axes have different and incompatible units, as in here. As usual, transforming to a scatter plot is a good first step, which is what Jorge has done here. He then connected the dots to indicate the time evolution of the relationship. This is a smart move here just because the pattern is so stark.
The chart now illustrates an "inflexion point" in 2000. Prior to 2000, troop size was decreasing while the budget was stable. After 2000, budget increased sharply while troop size remained relatively stable.
Now peer back at the original chart. You can discern the sharp decrease in troop size over time, and the sharp increase in budget over time, but separately. The chart teases a cross-over point around 1995 which turned out to be misleading. This is a great illustration of why dual-axis plots are dangerous.
On Twitter, Joe D. disliked the following chart on the Information is Beautiful blog:
The chart carries a long list of flaws.
The column labeled "%" is probably the most jarring. The meaning of these numbers changes with the color. When pink, they give the proportion of females; when blue, the proportion of males. As the stated purpose of the chart is to explore the male-female balance at different websites, it is a bad decision to fold two dimensions into one. While you're thinking about what I just said, what do you think the percentages in gray mean? Your guess is as good as mine.
Now, I appreciate that the designer uses a margin of error (implicitly), and separated these three sites as representing "equality", even though only one of them has the exact 50/50 split.
Wait, for Orkut (second row), it's 51 percent female, and for Foursquare, it's 52 percent male. The gender is coded in the figurines. You can check that with your magnifying glass.
It gets better.
The list of websites is ordered by increasing polarity but only within the three sections. Logically, the three "equality" sites should sit between the "matriarchy" and the "patriarchy". Pinterest and Reddit, the two most polarized sites, should stand on the edges. On the diagram shown right, I simulated a reader who wants to scan through the list of websites from the most female-oriented (Pinterest) to the most male-oriented (Reddit). It's quite the obstacle course.
Let's get to Joe D.'s issue with the chart. How many people does each figurine represent? It's quite a mouthful. Each figurine represents one percent of the unique visitors at the specific website but only in excess of fifty-percent. In effect, the Facebook figurine represents a huge number of people compared to the figurine of a less popular website like tagged. The designer did not explain the inclusion criteria for websites.
If you didn't get that definition, just ignore the figurines and think of this chart as a bar chart in which the bars start at 50 percent (rather than zero as it should). A standard population pyramid appears to do a better job - just add bars to the left of the diagram and properly align the male and female sections.
As I said before, read the fine print.
Here's the fine print:
If I am not mistaken, the designer applied the gender proportions to the traffic totals to obtain the rightmost column, labeled "million more monthly female or male visitors". The trouble is one number pertains to U.S. visitors while the other pertains to worldwide traffic. By multiplying them, the designer makes an assumption: that gender ratio is equivalent inside and outside the U.S., for every website.
Just to give you a sense of scale, according to this chart, Facebook has an excess of 155 million female visitors per month. According to Comscore, the key provider of such data, Facebook has about 145 million total U.S. visitors in June, 2013. It's not a small deal to mix up the geographies.
This example illustrates what I call "use at your own peril". It's like the surgeon's warning in restaurants in the U.S.: we warn you that drinking alcohol while pregnant could lead to birth defects, but you are free to do whatever you want with this information.
As of this writing, the original chart has thousands of Facebook likes, hundreds of shares on Linkedin and Pinterest, etc.
It appears that a lot of people are enjoying the chart more than Joe and I do.
Finally, here is a sketch of how I would plot this type of data. (U.S. traffic data from Comscore, various months of 2012, where I can find them. Comscore is a fee-based service so it is not easy to find data for the smaller sites unless you have a subscription.)