I'd like to start 2015 on a happy note. I enjoyed reading the piece by Steven Rattner in the New York Times called "The Year in Charts". (link)
I particularly like the crisp headers, and unfussy language, placing the charts at the center. The components of the story flow nicely.
Here are my notes on some of the charts:
This chart is missing context, which is performance against population growth or potential. Changing the context also changes the implicit yardstick. The implied metric here is more-than-zero growth or continued growth.
It took me a while to find the titles to know what each section depicts. I'd prefer to put the titles back to the top or the top left corner. The "information in my head" is making me look at the "wrong" places. But otherwise, this is Tufte goodness.
This innocent thing prompts a host of questions. First, how could a "median" be found to have so many values within one population? It would appear that this is an exercise in isolating each quintile (decile in the case of the top 20%) and computing the median within each segment. In other words, the data represent these income percentiles: 95th, 85th, 75th, 50th, 3oth and 10th. Given that the income data have already been grouped, computing group averages makes more sense than calculating group medians. This is especially so when comparing changes over time. The robust median suppresses changes.
The bucketing of income presents another challenge. All buckets except at the very top are essentially bounded. All the central buckets have minimum and maximum values. The bottom bucket is bounded under by zero. The top bucket, however, is basically unbounded so important features of this data could be lost by summarizing the top bucket by its median.
A third problem surfaces if one were to inquire how the survey collects its data. According to the Federal Reserve description, the data concern "usual income" as opposed to "actual income". Respondents are told to ignore "temporary" conditions in describing their "usual incomes". It is likely the case that people think income increases are permanent while getting laid off is temporary so while usual income solves one problem (the long-term planner's problem), it creates a different problem (short-term bias). I particularly don't think it is a good metric for assessing changes around a recession/recovery.
I also wonder about the imputation of missing data. I'd assume that possibly there is a preponderance of missing values for unemployed people. If the imputation cannot predict the employment status of those people, then it would surely have inflated incomes.
I wonder if any of my readers knows details about some of these potential problems. Would love to hear how the Fed's statisticians deal with these issues.
On this chart, the author has found an excellent story, and the graphic is effective. I prefer to see the horizontal axis labelled "More Unequal" as opposed to "Less Equal" because of the conventional that "more" is usually placed to the right of "less" on the horizontal axis. Here is a scatter plot version of the data:
It shows the U.S. is a bit more extreme than all others.
This is another great chart. I like the imagery of the emptying middle. I find the labels a bit too long and requiring too much interpreting. I prefer this:
Found this chart in the magazine that Charles Schwab sends to customers:
When there are two variables, and their correlation is of interest, a scatter plot is usually recommended. But not here!
The text labels completely dominate this chart and the designer tried very hard to place them but a careful look reveals that some boxes are placed above the dots while others are placed to their right and the dot for "Short Treasuries" holds refuge quite a while away from the dot. This means the locations of the text boxes do not substitute for the dots.
Here is a different view of this data:
I am using a bumps-style chart, which allows the labels to be written horizontally outside the canvass. Instead of all categories plotted on the same chart, I use a small multiples setup to differentiate three types of risk-return relationships.
Alberto Cairo just gave a wonderful talk to my workshop, in which he complains about the state of dataviz teaching. So, it's quite opportune that reader Maja Z. sent in a couple of examples from a recent course on data visualization for academics. She was surprised to see these held out as examples of good work. I'll discuss one chart today, and the other one some other day.
The instructor for the course praised this chart for this principle: "always try to find a graphic that relates to your subject, like the bullets here representing military spending, and use it in the chart."
For students who take my class, they learn the opposite lesson: I like to say imagery often backfires. I do like charts with imagery that makes the data come alive but more often than not, the designer falls in love with the imagery and let the data down.
This chart presumably shows the top 10 military spenders in the world by total amount spent in 2013. You'd think that the Chinese spent a bit more than half what the Americans did. But the data labels say $640 billion vs $188 billion, only about 30%. Next, the Russian spend is 46% of the Chinese according to the data, etc. So, is this really a data visualization or just some pictures with numbers printed next to them?
It's possible that the data is encoded in the surface areas or the volumes of these warheads but in reality, this is a glorified column chart, so most readers will respond to the heights of the columns.
Perhaps the shadows are there to demonstrate shadow spending.
The designer seems to appreciate that total spending is not necessarily a great metric. Spending as a proportion of GDP is provided as a secondary metric. I'm not so sure what to make of this though: should we expect richer nations to need/want to spend more building bombs and such? It just doesn't seem very logical to me.
Instead, a more meaningful metric might be military spending per capita. Controlling for population seems somewhat logical; the more people you have to protect, the more money you have to spend.
In the end, I made this scatter plot that tries to have it both ways:
(The percentages are of GDP.)
Here, we can see that Saudi Arabia and the U.S. are particularly aggressive spenders, spending over $2000 per person per year. The respective two dots are way above the average line (for the top 10 spenders). At the richer end of the scale, the American spending is way above the international average. On the other hand, Japan and Germany both spend significantly less than would be predicted by their GDP per capita levels.
Of note, readers more easily relate to the per-capita numbers than the aggregate figures in the original chart. They learn, for instance, that Saudi Arabia's average GDP was $27,000 per head, of which $2,500 went to arming itself up.
Rescheduling Notice: I have been informed by the organizers that the Meetup tonight has to be rescheduled due to an unexpected problem with the venue. When a new date is set, I will let you know.
Since I am not working on the slides for the Meetup, I have a little time to follow up on the post about the World Bank graphic.
One common response, also expressed on Twitter, is to "fix" it by using a scatter plot. Xan helpfully drew one up, which I added to the post.
I mentioned, cryptically, that if you try making improvements, you will find that the chart is a Type QD, not a Type D. There are clearly problems with the data but this chart cannot be "fixed" until one clarifies what the message of the chart really is.
The original chart plots (y=) GDP per capita against (x=) cumulative proportion of the world's population with countries ordered from lowest to highest GDP per capita. Embedded in the rectangular areas is total GDP.
Xan's chart plots (y=) total GDP in PPP terms against (x=) population. The per-capita PPP GDP is readable through diagonal gridlines.
Xan's chart is undoubtedly less confusing, and more direct. But it won't answer the cumulative question that the World Bank seems to be asking. That question is: how much of the world's wealth (measured in GDP) is held by the poorest X% of the population. This isn't something you can find on the scatter plot.
Now, the "cumulative" question is nice to think about but it is ill-posed for the kinds of data available. Each country ends up being represented by its average (per capita) wealth, but there is rampant wealth inequality within countries. Even though Nigeria is in the bottom 15%, it is certainly not true that the entire population of Nigeria belongs to the world's poorest 15%.
When a reader tweeted that a scatter plot is the solution, I asked: "Which two variables?" Here are just a few candidates:
total GDP GDP per capita total GDP PPP PPP GDP per capita cumulative total GDP, ordered by per-capita GDP cumulative total GDP, ordered by total GDP cumulative total GDP, ordered by total population cumulative total GDP, ordered by population growth cumulative total GDP PPP, ordered by per-capita GDP PPP cumulative total GDP PPP, ordered by total GDP PPP cumulative total GDP PPP, ordered by total population cumulative total GDP PPP, ordered by population growth cumulative total population cumulative GDP per capita cumulative GDP PPP per capita population working population total GDP growth total GDP PPP growth total GDP per capita growth total GDP PPP per capita growth total population growth total working population growth median GDP median GDP PPP
Different charts address different questions, some of which are more meaningful and some of which have better data. There may be a few interesting questions, in which case a set of scatter plots may work better.
Making data graphics interactive should improve the user experience. In practice, interactivity too often becomes overhead, making it harder for users to understand the data on the graph.
Reader Joe D. (via Twitter) admires the statistical sophistication behind this graphic about home runs in Major League Baseball. This graphic does present interesting analyses, as opposed to acting as a container for data.
For example, one can compare the angle and distance of the home runs hit by different players:
One can observe patterns as most of these highlighted players have more home runs on the left side than the right side. However, for this chart to be more telling, additional information should be provided. Knowing whether the hitter is left- or right-handed or a switch hitter would be key to understanding the angles. Also, information about the home ballpark, and indeed differentiating between home and away home runs, are also critical to making sense of this data. (One strange feature of baseball fields is that they all have different dimensions and shapes.)
But back to my point about interactivity. The original chart does not present the data in small multiples. Instead, the user must "interact" with the chart by clicking successively on each player (listed above the graphic).
Given that the graphic only shows one player at a time, the user must use his or her memory to make the comparison between one player and the next.
The chosen visual form discourages readers from making such comparisons, which defeats one of the primary goals of the chart.
The Facebook data science team has put together a great course on EDA at Udacity.
EDA stands for exploratory data analysis. It is the beginning of any data analysis when you have a pile of data (or datasets) and you need to get a feel for what you're looking at. It's when you develop some intuition about what sort of methodology would be appropriate to analyze the data.
Not surprisingly, graphical methods form a big part of EDA. You will commonly see histograms, boxplots, and scatter plots. The scatterplot matrix (see my discussion of this) makes an appearance here as well.
The course uses R and in particular, Hadley's ggplot package throughout. I highly recommend the course for anyone who wants to become an expert in ggplot. ggplot does use quite a bit of proprietary syntax. This EDA course offers a lot of instruction in coding. You do have to work hard, but you will learn a lot. By working hard, I mean reading supplementary materials, and doing the exercises throughout the course. As good instruction goes, they expect students to discover things, and do not feed you bullet points.
While this course is not freeThis course is free, plus the quality of the instruction is heads and shoulders above other MOOCs out there. The course is designed from the ground up for online instruction, and it shows. If you have tried other online courses, you will immediately notice the difference in quality. For example, the people in these videos talk directly to you, and not a bunch of tuition-paying students in some remote classroom.
Sign up before they get started at Udacity. Disclaimer: No one paid me to write this post.
Some graphics are made to inform, some to amuse, some to delight. But the following scatter plot makes one wonder why why why...
What does the designer want to say?
I saw this chart inside an infographics titled "Where in the World are the Best Schools and the Happiest Kids?", via the Cool Infographics blog. The horizontal axis is happiness and the vertical axis is average test score.
So it appears that happy kids can get the best and the worst test scores, and kids with the best test scores can be both happy and sad.
That means the happiness of kids does not depend on their test scores.
The financial media, ranging from Wall Street Journal to Zero Hedge, blogged about the geographical distribution of U.S. millionaires. The stories came with a map, and in the case of the latter, two data tables ranked by ascending and descending prevalence of millionaires. The map looks like this:
The talking point lifted from the press release of Phoenix Marketing, who is the origin of the data, focuses improbably on North Dakota. For example, the WSJ blog began with:
The state making the fastest climb up the millionaire rankings doesn’t have a single Tiffany or Saks Fifth Avenue store. The closest BMW dealership is a six-hour drive from the capital.
Welcome to North Dakota, which jumped 14 spots in the annual rankings of millionaire households per capita released by Phoenix Marketing International.
The trouble is, you can't pick North Dakota out of the map; it just doesn't stand out. The map uses a different methodology of ordering the states, by groupings of the prevalence of millionaires, that is, the proportion of households in each state who are labeled "millionaires" by Phoenix Marketing.
The text, by contrast, draws attention to the change in the rank of states using the proportion of households who are millionaires as the ranking criterion. This data is two steps removed from the data used for the map (start with the map data, compute the year-to-year change, then convert to ranks).
State-level averages pose a challenge: state population varies a lot, and this leads to variability in the estimates of smaller states. You are likely to find smaller states over-represented in the top and bottom of state ranking charts. I talked about a similar situation relating to interpreting high schools test data (see this post, and Prologue of Numbersense link.)
Instead of using proportion of households who are millionaires, I prefer to use the number of millionaires per 1,000 households. Mathematically, these two are equivalent. If we plot that metric versus the size of states (number of households), we see the familiar pattern:
I labeled the North Dakota data point to show how unremarkable it is. While it may have risen in "rank", it is still ranked below median in terms of number of millionaires per 1000 households. Also notice that of states with similar number of households, the millionaires metric ranges wildly from 40 to 70 per 1000 households.
An interpretation of these state average millionaire metrics has to account for state population size.
The following map illustrates the ups and downs between 2007 and 2013 by state. (I found 2007 data but not the 2012 data.)
Think of an accounting equation. In this view, the positive changes must balance out the negative changes since I am only converned about any shift in mix. What this map shows is that Texas, California, New York, and Washington have the top net gains in the number of millionaires while Florida, and Michigan have the biggest net losses. North Dakota is again in the middle of the bunch.
This view ignores the total net change in millionaires as it focuses on the mix by state. You'd need to figure out what is the relevant question before you can come up with a good visualization of this (or any) data.