## Different pictures of unemployment

##### Aug 23, 2010

Unemployment and job losses being such a worrying social problem in the U.S., one can find many attempts to visualize the predicament. In this post, I will look at two widely circulated charts, and some design decisions behind these charts.

First up, Slate uses an interactive map. (Click on the link for interactivity.)

Here, county-level data is being plotted, with the size of the bubbles indicating the number of jobs, red for jobs lost, blue for jobs gained, all of which computed year on year for a given month.

As you play with this display, think about the first question of the Trifecta checkup: what is the practical issue being addressed by this chart? What is the message the designer wants to convey?

Most likely, the answer will be something like the progress of job losses between 2007 and 2009, or which parts of the country are most affected by job losses.

Is this display the best at illuminating these issues? The designer has chosen the map to illustrate geography, and interactivity to illustrate time. These are not controversial -- but they should be controversial.

Maps are over-used objects. We see the biggest circles always in California, along the Eastern seaboard and in the lake region. This is true pretty much 90% of the time. What we are seeing is the distribution of population across the U.S. What we are not seeing is how job losses affect different regions on the right scale. The bubbles in California are almost always larger than those in the Midwest because there are more people in California.

***

On the time dimension, the designer has chosen to use monthly data but only for three years 2007-9. However, when this is multiplied hundreds of times by the county dimension, it is simply impossible for readers to grasp any trends from the interactive chart. We can learn the aggregate trajectory of when job losses start to pile up, when the recession deepens, etc. but since you are living through this recession, you don't need this map to tell you that.

It is in fact alright for the designer to collapse the time dimension! Look at the following chart used by the Calculated Risk blog, which displays a similar data set (unemployment rate rather than jobs gained/lost).

Notice that this designer collapsed both the time and geography dimensions. Time is partially present inside the boxes, as the maximum, minimum and current unemployment levels being plotted correspond to certain years in the past. The max and min are picked from data stretching back to 1976, a much longer period than the Slate chart. Geography is at the state level, rather than the county level (even though county-level data is available.) The states are sorted by the current level (July 2010) of unemployment.

The purpose of this designer is much easier to identify. For states like Nevada and California, the current situation is at the historical worst while for the Dakotas, they have seen much worse before.

If, for example, we want to know if different regions in the U.S. show discernable patterns, all we need to do is to use different colors of the boxes for different regions.

***

A problem with using the range (maximum and minimum) is outliers. The maximum or minimum values could be outliers. Put differently, the blue boxes shown above, while containing all unemployment rates going back to 1976, may not tell us much about the typical unemployment rate. What we might want to know is what the unemployment rate is like for most years.

For this, we can convert the max-min boxes into Tukey's boxplots.

In a boxplot, the box (gray area) contains half of the historical data. So if you look at DC (third from the bottom), unemployment in most years are narrowly constrained to about 6 to 8 percent although the max-min range is from under 5 to above 12.

For this chart, I sorted the states by median unemployment (black line inside the box) and the blue asterisks indicate the current level of unemployment (June 2010). Data comes from the BLS website.

Again, if regional differences need to be exposed, the boxes can be colored differently.

The outliers are plotted as dots on these boxplots; that too is data that may be considered extraneous to our purpose for this chart.

***

Is it a horrible thing for the designer to collapse dimensions like this? The data is available, and shouldn't all of them be used?

The truth is one can never cram all the data into a single chart. Even the Slate chart has collapsed some dimensions. Namely, the unemployment rates by demographics (age, gender, race, etc.) and by industry sector. Arguably those dimensions are as interesting as time and geography.

The bottom line: don't try to use every piece of data, you can't anyway, you will be making choices as to which dimensions to expose and which to hide, choose wisely.

***

Thanks to Aleks for pointing to the Visualizing Economics blog, which collects graphs about the economy, from where I found these charts.

You can follow this conversation by subscribing to the comment feed for this post.

I like your "trifecta checkup" as a quick way to see if a chart meets some basic standards, but I'm not sure if it goes far enough. A designer of data graphics needs to consider how the different elements of the chart blend together to create a “big picture.”

As you point out, a chart designer must choose which elements to highlight, which elements to keep in the background, and which elements to exclude altogether. In some cases “background” information can scream into the foreground of the chart, obscuring the more important information the data (and your chart) are trying to convey. In your third graph I think you create a lot of noise that masks what the data are saying and hides the practical question.

First, you can increase the actual information available to the chart reader. Instead of an asterisk or a line or a mark, it looks like there is enough space for the whole number. If your focus is on the median and current unemployment rate, could you actually plot those numbers on the graph as if they were the markers?

Second, I think some background elements overshadow the more important information. The box-and-whisker plot can get very muddy quickly: lots of heavy black lines and dots distract from the real focus of the chart. But it’s pretty easy to fix this (and, as Tufte would say, “maximize the data-ink”) while still providing the background information. Start by getting rid of the box’s outline: a gray bar is sufficient visual. Similarly, the black dotted lines for Q1 and Q4 are heavy. Even though each only represents a small quartile, it overpowers the more important data-point: the current unemployment rate. A thin gray line accomplishes the same task without being so loud. Likewise, the little o’s for outlier observations almost overpower the whole thing; use a very light single dot for outlier observations instead of the heavy black “o.”

Finally, it seems you're saying the practical question is, “how much higher is the current unemployment rate above the median unemployment rate?” So why did you choose to sort on the median only? It might make more sense to sort on the difference between median and current unemployment rate?

I hope you don’t mind me making a few comments. I’ve only just come across your blog, but I think I’ll be adding it to my regulars. Thanks!

Hi Kaiser, Chris Wilson here--I designed the Slate map you discuss in this thoughtful post. Naturally, I disagree on a few points:

-While charts must first and foremost be clear, I believe they must also be visually arresting if one hopes to engage a viewer. While the two examples you include may fit your criteria better than my map, they are also boring as sin. This is particularly important when the chart is in the service of journalism and published in a general interest magazine.

-Charts must convey the scale of the data they represent. Since humans are not trained to think logarithmically, it is very difficult to grasp the magnitude of a problem. Was it 100,000 barrels spilled in the Gulf or 100 million or 100 billion? This map clearly depicts the extraordinary scope and breadth of the recession in a way that the other two do not.

-I agree that visualizing raw numbers of people geographically often mimics a population density map, and if time were of no object I would include numbers proportional to the population as well. But in this case, the raw numbers are quite telling. First, the pure number of people without jobs in a location is significant and important. Second, you do not in fact see the red affect the country uniformly. Poor Detroit is losing jobs from the first slide while the rest of the country is still flush. As early as March of 2008 you see lost jobs along the western coast of Florida, which has a lot to do with the second-home industry.

-While more information is by no means always better, this map has over 100,000 data points compared to a far more meager fare in yours. This means any user can get information for his or her own county. (I'll be the first to admit I could use a search and zoom function.) This shows both local and national trends in a compelling way.

I love Junk Charts and agree with most of what you write on the blog, but I'll stack my chart up against your examples any day.

-CW

Hi Chris. I agree that a chart must be visually arresting if one hopes to engage a viewer. It should also strive to tell the best story that the data can.

I think your map does a better job of engagement. But I think the calculated risk blog map does better on the story front...purely because it used unemployment rate. This means we get a meaningful answer to the question "compared to what.

Perhaps a good option would be a button that lets you toggle between unemployment, and unemployment rate.

I'm not a fan of maps...I wrote up a guest post on such maps at http://chandoo.org/wp/2009/07/24/medicare-chart-critique/ if you're interested. But I do prefer your map over Kaiser's boxplot given the audience isn't statisticians. Although your map does take up a lot of space. So I'd be inclined to print a smaller map, and include a 2nd chart that ranks the entire US values from smallest to largest, and highlights where the currently selected state falls on that map. I'd keep the headline numbers of people who have lost their jobs nationwide. And I'd color code your legend so that it matches the colors on the map.

Interesting discussion here. I would tend to side with the map faction, given the audience but I also agree that the unemployment rate is more interesting than the absolute numbers. I think a heatmap of the unemployment rate or the percentage change in the employment rate would do the best job to accomodate both.

Chris: Thanks for the generous remarks. As always, I am a great admirer of all the work that is being done out there. I know how much time and effort goes into each of these creations. That's one of the reasons why the re-made charts never look very good because I can't spend that amount of time on each post!

Some of the posts here, like this one, are presented to provoke thinking about charts. Sometimes it's clear that a different presentation would be clearer or more engaging but sometimes I just want to present alternative ideas. I certainly do not intend to suggest that you should have published the third chart.

The issue of visual engagement versus message clarity has often been raised on this blog, and I even got a question about this at the Ed Lab seminar recently. A summary of my thoughts are here.

All of us would agree that engagement plus clarity is the holy grail but too often it seems like they clash. In fact, the more data one stuffs into a chart, the more likely it becomes less clear, but the more likely it becomes more engaging.

I believe there is a way out of this knot, and the key is a more relaxed interpretation of Tufte's data ink ratio:

In a box plot, the numbers being plotted are medians, maxima, minima, etc. These statistics are computed from thousands of data points--the chart could not be created without having processed all that data. So even though only a small number of statistics show up in the plot, a huge amount of data lies just beneath the surface. With fewer things on the chart, it has a good chance to be clear, so now if one can make it also engaging, one will hit the bull's eye.

I have a question about it though. I am, after all, a chart n00b.

To me, you'd use this chart if you're trying to show regional trends/patterns in the data. That is, after all, the only reason you'd ever use a map.

Those patterns/trends could just as easily be illustrated with the data summed by state. So the county data becomes fog - unnecessary detail - that just muddles what you can take from the chart.

I just found this blog today. I loved it.
Noew, one question: What software did you use to produce the box-plot chart? R? Any tip about how I can do it myself? Packages?

thanks agains for the post and blog. It is already in my google reader.

Manoel: I used R to generate that boxplot. Unfortunately many software packages don't do boxplots (e.g. Excel). Overlaying the blue dots is a separate step. But everything else is pretty standard. Well, except ordering the countries requires a bit of work.

Wong: If you email me, I can try to explain more to you. Too bad I don't know how to type in Chinese, otherwise I can translate the parts that are causing problems.

The comments to this entry are closed.