People are happier in some parts of the country as Labor Day nears

An anonymous reader sent in a Type V critique of the following map of July unemployment rates by state. The map was published by the Bureau of Labor Statistics (BLS), and used in a recent article in Vox.

Vox_bls_stateunemploy

Matt @ Vox took the BLS's bait, and singled out Mississippi as the worst in the nation. Our reader-contributor is none too pleased with this conclusion.

He noted that the red state stands out only because of the high "out of sample" top range of the legend. Three out of the seven colors are not found on the map at all! This is kind of like the white space problem when doing a line plot with large values and an axis starting at zero (for example, here), but the opposite. All the states are compressed into four colors, three of which are shades of orange.

The reader investigated, and reported back:

The top end of the legend seems to be set by Puerto Rico's 13.1%. Puerto Rico is omitted from the Vox map as well as from the BLS publication (link to PDF).

Mississippi only has the bare minimum, 8.0%, to qualify for the red color. Georgia is a 7.8; Michigan, Nevada, and Rhode Island are all 7.7.
 
24 (of the 50 States plus DC) are in the 6-8% band, and 21 are in the 4-6% band, with the remaining 5 under 4%.
 
None of the above is obvious when looking at the map.
 
In the Trifecta Checkup, this is a Type V chart. The data is accurate. The question being asked is clear but the visual construction is problematic.
 
***
[I'm seizing back the mike.] While the map is often not the best choice for showing geographic data, something we frequently cover on this blog, in this particular case, there is a strong regional pattern. Of course, with the compressed choice of colors, this regional pattern is not easily observed in the original.
 
The following small-multiples set of maps makes clear the regional pattern.
 
Redo_voxblsstateunemploy

Happy Labor Day!

 


Law of small numbers, in action

Loyal reader John M. expressed dismay over Twitter about 538's excessive use of bubble charts. Here's the picture that pushed John over the edge:

538-morris-datalab-trout

The associated article is here.

The question on the table is motivated by the extraordinary performance of a young baseball player Mike Trout. The early success can be interpreted either as evidence of future potential or as evidence of a future drought. As an analogy, someone wins a lottery. You can argue that the odds are so low that winning again is impossible. Or you can argue that winning once indicates that this person is "lucky" and lucky people might win again.

The chart shows the proportion of players who performed even better after the initial success, given the age at which they first broke out. One way to read this chart is to mentally replace the bubbles with dots (or columns), and then interpret the size of the bubbles as the statistical significance of the corresponding probability estimate. The legend says number of players, which is the sample size, which governs the error bar associated with that particular number.

This bubble chart is no different from others: it is impossible to judge the relative sizes of bubbles. Even though the legend provides us two reference points (a nice enough idea on its own), it is still impossible to know, for example, what proportion of players did better later in life when they first peaked at age 24. The bubble for age 23 looks like it's exactly five players but I still cannot figure out how many players the adjacent bubble represents.

The designer should have just replaced each bubble with an error bar, and the chart is instantly more readable. (I have another version of this at the end of the post.)

The rest of the design elements are clean and well-done, particularly use of notes to point out interesting aspects of the data.

***

From a Trifecta checkup perspective, I am uncertain about how the nature of the data used to investigate the interesting question posed above.

Readers should note the concept of "early success" and "later success" are not universally defined. The author here selects two proxies. Reaching an early peak is equated to "batters first posting 15+ WAR over two seasons". Next, reversion to the mean is defined as not having a better two-year span subsequent to the aforementioned early peak.

Why two seasons? Why WAR and not a different metric? Why 15 as the cutoff? These are all design decisions made while working with the data.

One can make reasonable arguments to justify the above two questions. A bigger head-scratcher relates to the horizontal axis, which identifies the first time a player reaches his "early peak," as defined above. The way the above chart is set up, it is almost preordained to exhibit a negative slope. The older the player is when he reaches the first peak, the fewer years left in his playing career to try to emulate or surpass that feat.

This last point is nicely illustrated in the next chart of the article:

538-morris-datalab-trout2

 This chart is excellent on many levels. It's not clear, though, whether it says anything other than aging.

***

Near the end of the post, the author rightfully pointed out that "there’s not really enough data to demonstrate this effect". Going back to the first chart, it appears that no single bubble contains a double-digit count of players. So every sample size is between one and, say, seven. We should be wary of conclusions based on so little data.

It's always fun to find examples of the Law of Small Numbers, courtesy of Kahneman & Tversky.

***

Here is a sketch of how I might re-make the first chart (I made up data; see the note below).

Redo_538_miketrout

While making this chart, I realize another issue with the original bubble chart. When the proportion of players improving on their early peak is zero percent, how many players did not make it is quite hidden. In the revised chart, this data is clearly seen (look at age 22).

Note: I wonder if I totally missed the point of the original chart.... I actually had trouble eyeballing the data so I ended up making up numbers. The bubble at age 22 looks like it should stand for 5 players and yet it sits at precisely 50%, which would map to 2.5 players. If I assume the 22 bubble to be 4 players, then I don't know what the 26 bubble is. If it is 4 players also, then the minimum non-zero proportion should have been 1/4, but the bubble clearly lies below 25%. If it is 3 players, the minimum non-zero proportion is 1/3, which should be at 33%.

 


When a chart does nothing for the story

PixardeclineexcelThere is some banter on Twitter about a chart that appeared in The Atlantic on "Pixar's Sad Decline--in One Chart". (@thewhyaxis, @jschwabish, @tealtan).

Link to article

***

It's a bit horrible but not the worst chart ever.

The most offensive aspect is the linear regression line. It's clearly an inappropriate model for this dataset.

I also don't like charts that include impossible values on the axis, in this case, the Rotten Tomato Score does not ever go above 100%.

If the chart is turned on its side, the movie titles can be read horizontally.

Redo_pixar

***
I am compelled by the story but the chart doesn't help at all. Of course, it would be better if they can find data on the profitability of each movie. Readers should ask how correlated the Rotten Tomato Score is with box office, and also, what are the relative costs of producing these different movies. Jon has the score against profit chart (link).

 


Breaking every limb is very painful

This Financial Times chart is a big failure:

Ft_hb1_locations

Look at the axis. Usually a break in the axis is reserved for outliers. If there is one bar in a bar chart that extends way beyond the rest of the data, then you would sever that bar to let readers know that the scale is broken. Here, the designer broke every bar in the entire chart. It's as if the designer knows we'll complain about not starting the chart at zero -- so the bars all start at zero except they jump from zero to 70 right away.

***

Trifecta_checkupThe biggest issue with this chart is not its graphical element. It's the other two corners of the Trifecta checkup: what is the question being asked? And what data should be used to address that question?

The accompanying article complains about the dearth of HB1 H-1B visas for technical talent at businesses. But it never references the data being plotted.

It's hard for me to even understand what the chart is saying. I think it is saying that in Bloomington-Normal, IL, 94.8 percent of its HB1 H-1B visa requests are science related. There is no way to interpret this number without knowing the percentage for the entire country. It is most likely true that HB1 H-1B visas are primarily used to recruit technical talent from overseas, and the proportion of such requests that are STEM related is high everywhere. In this sense, it's not clear that the proportion of HB1 H-1B requests is a useful indicator of the dearth of technical talent.

Secondly, it is highly unlikely that the decimal point is meaningful. Given the highly variable total number of requests across different locations, the decimal point would represent widely varying numbers of requests.

I'd prefer to look at absolute number of requests for this type of analysis, given that Silicon Valley has orders of magnitude more technical jobs than most of the other listed locations. Requests aren't even a good indicator of labor shortage. Typically HB1 H-1B visas run up against the quota sometime during the year, and companies will stop requesting new visas since there is no chance of getting approved. This is a form of survivorship bias. Wouldn't it be easier to collect data on the number of vacant technical jobs in each location?

 

 


Interpreting some charts about guns

Felix linked to a set of charts about guns in the U.S. (and elsewhere). The original charts, by Liz Fosslien, are found here.

I like the clean style used by Fosslien. Some of the charts are thought-provoking. Many of them may raise more questions than they answer. Here are a few that caught my eye.

Handguns_1

A simplistic interpretation would claim that banning handguns is futile, and may even have an adverse impact on murder rate. However, this chart does not reveal the direction of causality. Did some countries ban handguns because they are reacting to higher violence? If that is the case, this chart is confirming that the countries with handgun bans are a self-selected group.

***

Handguns_2

The U.S. is an outlier, both in terms of firearm ownership and firearm homicides. This makes the analysis much harder because the U.S. is really in a class of its own. It's not at all clear whether there is a positive correlation in the cluster below, and even if there is, whether we can draw a straight line up to the U.S. dot is also dubious.

***

Handguns_3

Fosslien is being cheeky to deny us the identity of the other outlier, the country with few firearms but even higher death rate from intentional homicide. These scatter plots are great by the way to show bivariate distributions.

***

Handguns_4

I'd still prefer a line chart for this type of data but this particular paired bar chart works for me as well. The contents of this chart is a shock to me.

***

Handguns_5

I just don't get this one. Why is there a fan?


Mountain, molehill

Reader Jordan G. wasn't impressed by an attempt to visualize medal counts by country and sport in the Olympics over 112 years by Christian Gross at Visualizing.org (link).

Vis_gb_1988The author chose to use the metaphor of "mountains" to portray the cumulative medals earned by each country. Each country is treated as a unit in a small-multiples-style presentation (see right). The bars represent different sports, and they are arranged as if arranging lanes in a swimming contest, with the largest haul in the middle, and second largest on the left, third largest on the right, etc.

This exercise highlights two important considerations from the designer's perspective.

The first is scaling. You'll notice that the first page (for Athens, 1896; excerpted below) is essentially unreadable. This is because the designer uses the same scale for every single page, and because he is plotting the cumulative number of medals over time. These two decisions mean that the initial pages would have much lower values than the latter pages. 

Vis_snap_1896

It also means that on other pages, the extreme values walk off the edge of the chart area. (I think the reason is that if the scale has been tailored to present these extreme values, then pretty much every chart that doesn't contain extreme values would become unreadable.)

Vis_us_1988

The choice of making countries units (discussed further below) makes for some awkwardness in latter years as the medals became more spread out among more countries. In the first Olympics, only 10 countries won any medals but in 2008, 127 different countries won at least one medal, among which 50 countries or so had never won more than 20 medals in all sports combined. This skewed distribution causes the designer to break one of the cardinal rules of small multiples, which is that the design of each unit must be the same, with only the data varying. Here, the top countries have their data plotted on a different scale from that of the other countries, as we can see from the different sized squares. When you mouse over a particular bar for a particular country, that sport is now colored red and corresponding bars are highlighted for every other country -- the problem is that the scales are not the same so the lengths of the bars give us misleading comparisons.

***

The second consideration is pagination. The data set has three dimensions: country, sport and time. In this presentation, the designer places sport within country within time. Put differently, the time categories are placed furthest apart - in fact, the reader must load a different page to see the evolution from one Olympics to another. The sport categories are placed closest to each other, in the same chart unit and so it requires the least effort for the reader to compare the number of medals won by the US in athletics (say) compared to gymnastics.

This goes back to the top corner of our Trifecta Checkup. What is the most important question the designer is trying to address? If it is the evolution over time, then the time dimension should not be placed furthest apart. If it is comparison across sports, then the sport dimension should be placed innermost.

For me the country dimension is the least important because everyone knows the US typically wins the most medals, and the top 5-10 countries are quite stable. Within a sport, I might wonder if certain countries are dominant in certain periods, and if certain countries started developing particular sports from a certain time period onwards. In this case, I'd place country and time within sport. 

The following gives an idea of an alternative way of visualizing this data:

Redo_medals

Apologies for not completing the dataset. Both charts are missing countries as well as years of history. But you can see where I'm going with this. There would be one chart per sport.

In gymnastics, we see that the US and China are latecomers, Russia has been the superpower until recently while Japan and Germany have stagnated.


Look what I found: two amazing charts

While doing some research for my statistics blog, I came across a beauty by Lane Kenworthy from almost a year ago (link) via this post by John Schmitt (link).

How embarrassing is the cost effectiveness of U.S. health care spending?

Americasinefficienthealthcaresystem-figure1-version2

When a chart is executed well, no further words are necessary.

I'd only add that the other countries depicted are "wealthy nations".

***

Even more impressive is this next chart, which plots the evolution of cost effectiveness over time. An important point to note is that the U.S. started out in 1970 similar to the other nations.

Americasinefficienthealthcaresystem-figure2-version5

Let's appreciate this beauty:

  • Let the data speak for itself. Time goes from bottom left to upper right. As more money is spent, life expectancy goes up. However, the slope of the line is much smaller for the US than the other countries. There is no need to add colors, data labels, interactivity, animation, etc.
  • Recognize what's important, what's not. The US line is in a different color, much thicker and properly made the foreground of the chart.
  • Rather than clutter up the chart, the other 19 lines are anonymized. They all have the same color and thickness, and all given one aggregate label. This is an example of overcoming loss aversion (see this post for more): it is ok to suppress some of the data.
  • The axis labeling is superb. Tufte preaches this clean style. There is no need to use regularly-spaced axis labels... use data-informed labels. Unfortunately, software is way behind on this issue. You can do this in R but that's about it.

 


Someone submits a good infographic

Reader Chris P. sent me to this Mint infographic showing the income distribution in the U.S. (link). I found the second section more interesting so this post will focus on that one chart. But I want to let Chris have his word also, so we have a double post. To see Chris's comment on the chart, see here.

Here is the chart from the second section:

Mint_stateincome

What do I like about this chart?

It tells a story without appealing directly to the data.  I see only 7x2 = 14 numbers on the chart, all embedded into the legend/scale. So many charts of this type send readers immediately into a twister by bombarding our eyes with data.

In the middle of the chart, for instance, states like MD and MA contrast with states like MI and MS. Poorer people are in the yellow segments while richer people are in the greener segments. So we can see that in MD and MA, the green part extends below the first horizontal gridline while in MI and MS, that gridline cuts into the orange. The implication is that there are more rich people in MD and MA than in MI and MS.

The horizontal gridlines are subtle but surprisingly functional, allowing readers to pick out the information. The gridlines divide each column into 4 equal parts so each part is a quarter (quartile) of the state population. In MD and MA, at least the top 25% of their populations are considered rich by national standards. Rich, as defined by the green as defined by the legend, means household incomes greater than $75,000. In both those states, the top 25% earn at least $100,000.

Similarly, by looking at the color of the segment that crosses the lowest horizontal gridline, we know how much the bottom 25% earn in each state. The poorest segment seems to be smaller in AK than in other states.

The row of state boundaries at the bottom of the chart is very cute. And it encodes information, which is a wonderful touch. I believe (though haven't verified) the color of the state map tells us the mean household income within the state.

***

A few improvements would make this column chart better. One shouldn't place the national average above the chart horizontally using a different scale. Just place it as an additional column next to the other 50+ columns, with a slight offset and proper labeling. This allows direct lookup of how a state compares to the national average.

Also, try ordering by income inequality. The alphabetical order does the reader no favors. The ordering is particularly important because the main finding of the chart is that income distribution exhibits only moderate variability by state - most states look alike.

***

Given the low variability, the challenge is how to bring out the mild differences: which parts of the income distribution of which state show variance against the national average?

In the following attempt, we plot the "excess" proportion relative to the national average by state. 

For example, in the most "unequal" "state", District of Columbia (first chart), we find that it has a shortage (negative excess) of people earning below $75,000, and an excess of people earning above $75,000 when compared to the national income distribution. The proportion of "excess" increases with each higher income bracket (moving from left to right of the chart).

Redo_stateincome2

I have grouped and ordered the states by the orientation of the line plots. The first group of states, boxed in red, are all similar to DC, in the sense that they have a shortage of low earners and an excess of high earners.

Some states, like Texas, Pennsylvania and Georgia, have an income distribution that almost exactly mirrors the national average. Then, those states boxed in aquamarine have a small excess of poor people and a shortage of rich people compared to the national average. Not unexpectedly, Puerto Rico is on its own.

***

One has to be careful with this type of data because the income distributions are highly skewed. How are the income brackets determined?

Lumping everyone in the top 4% or so (earning $300,000 or more) into one bracket obscures the tremendous income inequality even within that bracket. In fact, for my chart above, I have to decide where to put the last data point, i.e. the people earning $200,000 or more, because $200,000 or more is not a point on the horizontal axis but an open-ended range. I just used $300,000 but the better thing to do is to find out the average income within that top bracket and place the point there.

 


The meaning of pretty pictures and the case of 15 scales

When we call something a "pretty picture", what do we mean? 

Based on the evidence out there, it would seem like "pretty" means one or more of the following:

  • unusual: not your Grandma's bar chart or line chart
  • visually appealing: say, have irregular shapes, lots of colors, curved lines and so on
  • complex: if you don't get the point right away, the chart must be smart, and must contain a lot of information
  • data-rich: a variant of complex

***

I pondered that question while staring at this chart, reprinted in the NYT Magazine, in which they pitched a new book by Craig Robinson called "Fip Flop Fly Ball".  According to the editors, the book is a "beautiful, number-crunched (sic) combination of statistical and graphic-design geekery". So here's Exhibit A:

Nytm_flipflop This chart is supposed to tell us whether big payroll equals success in Major League Baseball, and success is measured variously by making the playoffs, making the championship series or winning the championship. It nicely uses a relatively long time horizon of 15 years.


The problem: how are we supposed to learn the answer to the question?

To learn it, we have to go through these steps:

Read the fine print under the title that tells us the vertical scale is the rank by payroll, so within each season, the top spender is at the top, and the bottom spender at the bottom. (Strictly speaking, there are 15 different scales, see discussion below.)

Figure out that the black row has all of the championship teams aligned at the same vertical level.

Realize that the more teams that are listed below the black line, the bigger the payroll of the championship team in that season.

Alternatively, the more teams that are found above the black line, the smaller the payroll is of the winning team that year.

From that, we see that for almost every season in the last 15 years, the winner comes from a relatively free-spending team. Florida in 2003 is a big outlier.

***

Maybe that isn't too bad. Now, try to interpret the blue boxes, which label all the playoff teams in every season. Is it that playoff teams also are bigger spenders than non-playoff teams?

To learn this, try the following step:

Ignore the relative height of the columns from season to season, and focus only on the relative positions of the blue slots within each column.

Are these blue slots more likely to be crowded towards the top of the column than the bottom?

The answer should be obvious but why does it feel so hard?

***

You may be confused by the vertical scale. Is it the case that in 2003, the entire league decided to splurge on spending? Does the protruding tower in 2003 indicate especially high payrolls?

No, it doesn't. It turns out there are really 15 separate vertical scales on this one chart; each column has to be viewed separately. There is a ranking within each column but the relative height  from one column to the next means nothing. Each column is hinged to the black row which is the rank by payroll of the championship team in that season.

The decision to anchor the columns in this way is what dooms this chart. In the junkart version below, I reversed this decision and ended up with a much clearer picture:

Redo_flipflop

It's now clear that almost all the playoff teams come from the top quartile or top third of the table in terms of payroll. In more recent years, the correlation between spending and success seems less assured - perhaps it's partly a result of the analytics revolution, as nicely portrayed in Moneyball. It is still true that any team in the bottom third of the payroll scale has little chance to making the playoffs; however, once the smaller-payroll team makes the playoffs, it appears that they do well, as in three of the last four seasons, a small-payroll team has made the finals.

Note that I grayed out the four cells at the bottom left. There were only 28 teams before 1997. I also removed the names of the teams that didn't make the playoffs, which serves no purpose in a chart like this.

***

That's the descriptive statistics. It's really hard to draw robust conclusions from such data. You can say it's harder for small-payroll teams to have consistently great performance in the regular season but easier in a short playoff series - so in a sense, we are looking at luck, not skill.

But could it be that those small-payroll teams, given that they made the playoffs, must have some usual success in that season, perhaps because they discovered some young talent that cost next to nothing, and so the fact that they made the playoffs despite the smaller payroll is a good predictor that they would do well in the playoff?

The other important issue to realize is that by plotting the rank of payroll, rather than true payroll, the scale of payroll differences has been taken out of the picture. The team listed at the median rank most likely spent much less than half of the team listed at the top of the table. If you grab the actual payroll amounts, there is much more you can do to display this data.

 


A good question deserves good data


The last chart in the infographics on OECD education data asks another intriguing question: do countries that pay teachers more achieve better test scores?

Soshable_payperf

This chart suffers from the same ill as the one previously discussed (here): the data is not suitable to address the question. It is mighty hard to see any pattern in the set of bar charts on offer. This lack of correlation can be confirmed by displaying the data in a scatter plot:

Redo_payperf
The scatter on the left presents the data as shown in the original, with a regression line drawn in that appears to indicate a positive correlation of higher spending and higher achievement.

Here, spending is measured by the ratio of primary teacher pay after 15 years of service to average GDP while achievement is indicated by the proportion of students who attain a "top" level of proficiency in any or all of the three test subjects.

But notice the solitary point sitting on the top right corner (labelled "1"). That point is Korea, which has both the highest achievement and the highest spending (by far). Korea is an outlier (known as a leverage point). The chart on the right is the same as the one on the left with Korea removed. What appears to be a moderate positive correlation vanishes. (The numbers plotted are the ranking of countries by the proportion of students attaining top proficiency, the metric on the vertical axis.)

So, either the message is that achievement and spending are uncorrelated (for every country except Korea), or that we have a measurement problem. I think the latter is more likely, and would defer to psychometricians to say what are acceptable measures for spending and for achievement. Do primary teachers with 15 years or more of service represent "education spending"? Do top students adequately capture general achievement in the education system?

***

Soshable_payperf_closeup The original chart contains a serious misinterpretation of the data (source: Education at a Glance 2009, OECD). It falsely assumes that the proportion of students attaining top proficiency in each subject is additive. In fact, because the same student could be top in one or more subjects, the base of such a sum would not be 100%.

In my version, the metric used is the proportion of students who attain top proficiency in 1, 2 or all 3 subjects. This metric is computed off a 100% base.

I also removed the breakdown by gender. This creates clutter, and I can't find any interest in the male or female data.

 ***

See also our first post on this infographics.