Five-value summaries of distributions

BG commented on my previous post, describing her frustration with the “stacked range chart”:

A stacked graph visualizes cubes stacked one on top of the other. So you can't use it for negative numbers, because there's no such thing [as] "negative data". In graphs, a "minus" sign visualizes the opposite direction of one series from another. Doing average plus average plus average plus average doesn't seem logical at all.

***

I have already planned a second post to discuss the problems of using a stacked column chart to show markers of a numeric distribution.

I tried to replicate how the Youtuber generated his “stacked range chart” by appropriating Excel’s stacked column chart, but failed. I think there are some missing steps not mentioned in the video. At around 3:33 of the video, he shows a “hack” involving adding 100 degrees (any large enough value) to all values (already converted to ranges). Then, the next screen displays the resulting chart. Here is the dataset on the left and the chart on the right.

Minutephysics_londontemperature_datachart

Afterwards, he replaces the axis labels with new labels, effectively shifting the axis. But something is missing from the narrative. Since he’s using a stacked column chart, the values in the table are encoded in the heights of the respective blocks. The total stacked heights of each column should be in the hundreds since he has added 100 to each cell. But that’s not what the chart shows.

***

In the rest of the post, I’ll skip over how to make such a chart in Excel, and talk about the consequences of inserting “range” values into the heights of the blocks of a stacked column chart.

Let’s focus on London, Ontario; the five temperature values, corresponding to various average temperatures, are -3, 5, 9, 14, 24. Just throwing those numbers into a stacked column chart in Excel results in the following useless chart:

Stackedcolumnchart_londonontario

The temperature averages are cumulatively summed, which makes no sense, as noted by reader BG. [My daily temperature data differ somewhat from those in the Youtube. My source is here.]

We should ignore the interiors of the blocks, and instead interpret the edges of these blocks. There are five edges corresponding to the five data values. As in:

Junkcharts_redo_londonontariotemperatures_dotplot

The average temperature in London, Ontario (during Spring 2023-Winter 2024) is 9 C. This overall average hides seasonal as well as diurnal variations in temperature.

If we want to acknowledge that night-time temperatures are lower than day-time temperatures, we draw attention to the two values bracketing 9 C, i.e. 5 C and 14 C. The average daytime (max) temperature is 14 C while the average night-time (min) temperature is 5 C. Furthermore, Ontario experiences seasons, so that the average daytime temperature of 14 C is subject to seasonal variability; in the summer, it goes up to 24 C. In the winter, the average night-time temperature goes down to -3 C, compared to 5 C across all seasons. [For those paying closer attention, daytime/max and night-time/min form congruous pairs because the max temperature occurs during daytime while the min temperature occurs during night-time. Thus, the average of maximum temperatures is the same as the average of daytime maximum temperatures.]

The above dotplot illustrates this dataset adequately. The Youtuber explained why he didn’t like it – I couldn’t quite make sense of what he said. It’s possible he thinks the gaps between those averages are more meaningful than the averages themselves, and therefore he prefers a chart form that draws our attention to the ranges, rather than the values.

***

Our basic model of temperature can be thought of as: temperature on a given day = overall average + adjustment for seasonality + adjustment for diurnality.

Take the top three values 9, 14, 24 from above list. Starting at the overall average of 9 C, the analyst gets to 14 if he hones in on max daily temperatures, and to 24 if he further restricts the analysis to summer months (which have the higher temperatures). The second gap is 10 C, twice as large as the first gap of 5 C. Thus, the seasonal fluctuations have larger magnitude than daily fluctuations. Said differently, the effect of seasons on temperature is bigger than that of hour of day.

In interpreting the “ranges” or gaps between averages, narrow ranges suggest low variability while wider ranges suggest higher variability.

Here's a set of boxplots for the same data:

Junkcharts_redo_londonontariotemperatures

The boxplot "edges" also demarcate five values; they are not the same five values as defined by the Youtuber but both sets of five values describe the underlying distribution of temperatures.

 

P.S. For a different example of something similar, see this old post.


What is this "stacked range chart"?

Long-time reader Aleksander B. sent me to this video (link), in which a Youtuber ranted that most spreadsheet programs do not make his favorite chart. This one:

Two questions immediately come to mind: a) what kind of chart is this? and b) is it useful?

Evidently, the point of the above chart is to tell readers there are (at least) three places called “London”, only one of which features red double-decker buses. He calls this a “stacked range chart”. This example has three stacked columns, one for each place called London.

What can we learn from this chart? The range of temperatures is narrowest in London, England while it is broadest in London, Ontario (Canada). The highest temperature is in London, Kentucky (USA) while the lowest is in London, Ontario.

But what kind of “range” are we talking about? Do the top and bottom of each stacked column indicate the maximum and minimum temperatures as we’ve interpreted them to be? In theory, yes, but in this example, not really.

Let’s take one step back, and think about the data. Elsewhere in the video, another version of this chart contains a legend giving us hints about the data. (It's the chart on the right of the screenshot.)

Each column contains four values: the average maximum and minimum temperatures in each place, the average maximum temperature in summer, and the average minimum temperature in winter. These metrics are mouthfuls of words, because the analyst has to describe what choices were made while aggregating the raw data.

The raw data comprise daily measurements of temperatures at each location. (To make things even more complex, there are likely multiple measurement stations in each town, and thus, the daily temperatures themselves may already be averages; or else, the analyst has picked a representative station for each town.) From this single sequence of daily data, we extract two subsequences: the maximum daily, and the minimum daily. This transformation acknowledges that temperatures fluctuate, sometimes massively, over the course of each day.

Each such subsequence is aggregated to four representative numbers. The first pair of max, min is just the averages of the respective subsequences. The remaining two numbers require even more explanation. The “summer average maximum temperature” should be the average of the max subsequence after filtering it down to the “summer” months. Thus, it’s a trimmed average of the max subsequence, or the average of the summer subsequence of the max subsequence. Since summer temperatures are the highest of the four seasons, this number suggests the maximum of the max subsequence, but it’s not the maximum daily maximum since it’s still an average. Similarly, the “winter average minimum temperature” is another trimmed average, computed over the winter months, which is related to but not exactly the minimum daily minimum.

Thus, the full range of each column is the difference between the trimmed summer average and the trimmed winter average. I assume weather scientists use this metric instead of the full range of max to min temperature because it’s less affected by outlier values.

***

Stepping out of the complexity, I’ll say this: what the “stacked range chart” depicts are selected values along the distribution of a single numeric data series. In this sense, this chart is a type of “boxplot”.

Here is a random one I grabbed from a search engine.

Analytica_tukeyboxplotA boxplot, per its inventor Tukey, shows a five-number summary of a distribution: the median, the 25th and 75th percentile, and two “whisker values”. Effectively, the boxplot shows five percentile values. The two whisker values are also percentiles, but not fixed percentiles like 25th, 50th, and 75th. The placement of the whiskers is determined automatically by a formula that determines the threshold for outliers, which in turn depends on the shape of the data distribution. Anything contained within the whiskers is regarded as a “normal” value of the distribution, not an outlier. Any value larger than the upper whisker value, or lower than the lower whisker value, is an outlier. (Outliers are shown individually as dots above or below the whiskers - I see this as an optional feature because it doesn't make sense to show them individually for large datasets with lots of outliers.)

The stacked range chart of temperatures picks off different waypoints along the distribution but in spirit, it is a boxplot.

***

This discussion leads me to the answer to our second question: is the "stacked range chart" useful?  The boxplot is indeed useful. It does a good job describing the basic shape of any distribution.

I make variations of the boxplot all the time, with different percentiles. One variation commonly seen out there replaces the whisker values with the maximum and minimum values. Thus all the data live within the whiskers. This wasn’t what Tukey originally intended but the max-min version can be appropriate in some situations.

Most statistical software makes the boxplot. Excel is the one big exception. It has always been a mystery to me why the Excel developers are so hostile to the boxplot.

 

P.S. Here is the official manual for making a box plot in Excel. I wonder if they are the leading promoter of the max-min boxplot that strays from Tukey's original. It is possible to make the original whiskers but I suppose they don't want to explain it, and it's much easier to have people compute the maximum and minimum values in the dataset.

The max-min boxplot is misleading if the dataset contains true outliers. If the maximum value is really far from the 75th percentile, then most of the data between the 75th and 100th percentile could be sitting just above the top of the box.

 

P.S. [1/9/2025] See the comments below. Steve made me realize that the color legend of the London chart actually has five labels, the last one is white which blends into the white background. Note that, in the next post in this series, I found that I could not replicate the guy's process to produce the stacked column chart in Excel so I went in a different direction.


Graphical advice for conference presenters - demo

Yesterday, I pulled this graphic from a journal paper, and said one should not copy and paste this into an oral presentation.

Example_presentation_graphic

So I went ahead and did some cosmetic surgery on this chart.

Redo_example_conference_graphic

I don't know anything about the underlying science. I'm just interpreting what I see on the chart. It seems like the key message is that the Flowering condition is different from the other three. There are no statistical differences between the three boxplots in the first three panels but there is a big difference between the red-green and the purple in the last panel. Further, this difference can be traced to the red-green boxplots exhibiting negative correlation under the Flowering condition - while the purple boxplot is the same under all four conditions.

I would also have chosen different colors, e.g. make red-green two shades of gray to indicate that these two things can be treated as the same under this chart. Doing this would obviate the need to introduce the orange color.

Further, I think it might be interesting to see the plots split differently: try having the red-green boxplots side by side in one panel, and the purple boxplots in another panel.

If the presentation software has animation, the presenter can show the different text blocks and related materials one at a time. That also aids comprehension.

***

Note that the plot is designed for an oral presentation in which you have a minute or two to get the message across. It's debatable as to whether journal editors should accept this style for publications. I actually think such a style would improve reading comprehension but I surmise some of you will disagree.


Various ways of showing distributions

The other day, a chart about the age distribution of Olympic athletes caught my attention. I found the chart on Google but didn't bookmark it and now I couldn't retrieve it. From my mind's eye, the chart looks like this:

Age_olympics_stackedbars

This chart has the form of a stacked bar chart but it really isn't. The data embedded in each bar segment aren't proportions; rather, they are counts of athletes along a standardized age scale. For example, the very long bar segment on the right side of the bar for alpine skiing does not indicate a large proportion of athletes in that 30-50 age group; it's the opposite: that part of the distribution is sparse, with an outlier at age 50.

The easiest way to understand this chart is to transform it to histograms.

Redo_age_olympics_histo2

In a histogram, the counts for different age groups are encoded in the heights of the columns. Instead, encode the counts in a color scale so that taller columns map to darker shades of blue. Then, collapse the columns to the same heights. Each stacked bar chart is really a collapsed histogram.

***

The stacked bar chart reminds me of boxplots that are loved by statisticians.

Redo_age_olympics_boxplot2b

In a boxplot, the box contains the middle 50% of the athletes in each sport (this directly maps to the dark blue bar segments from the chart above). Outlier values are plotted individually, which gives a bit more information about the sparsity of certain bar segments, such as the right side of alpine skiing.

The stacked bar chart can be considered a nicer-looking version of the boxplot.

 

 


Revisiting the home run data

Note to New York metro readers: I'm an invited speaker at NYU's "Art and Science of Brand Storytelling" summer course which starts tomorrow. I will be speaking on Thursday, 12-1 pm. You can still register here.

***

The home run data set, compiled by ESPN and visualized by Mode Analytics, is pretty rich. I took a quick look at one aspect of the data. The question I ask is what differences exist among the 10 hitters that are highlighted in the previous visualization. (I am not quite sure how those 10 were picked because they are not the Top 10 home run hitters in the dataset for the current season.)

The following chart focuses on two metrics: the total number of home runs by this point in the season; and the "true" distances of those home runs. I split the data by whether the home run was hit on a home field or an away stadium, on the hunch that we'd need to correct for such differences.

Jc_top10hitters_homeaway_splits

The hitters are sorted by total number of home runs. Because I am using a single season, my chart doesn't suffer from a cohort bias. If you go back to the original visualization, it is clear that some of these hitters are veterans with many seasons of baseball in them while others are newbies. This cohort bias explains the difference in dot densities of those plots.

Having not been following baseball recently, I don't know many of these names on the list. I have to look up Todd Frazier - does he play in a hitter-friendly ballpark? His home to away ratio is massive. Frazier plays for Cincinnati, at the Great American Ballpark. That ballpark has the third highest number of home runs hit of all ballparks this season although up till now, opponents have hit more home runs there than home players. For reference, Troy Tulowitzki's home field is Colorado's Coors Field, which is hitter's paradise. Giancarlo Stanton, who also hits quite a few more home runs at home, plays for Miami at Marlins Park, which is below the median in terms of home run production; thus his achievement is probably the most impressive amongst those three.

Josh Donaldson is the odd man out, as he has hit more away home runs than home runs at home. His O.co Coliseum is middle-of-the-road in terms of home runs.

In terms of how far the home runs travel (bottom part of the chart), there are some interesting tidbits. Brian Dozier's home runs are generally the shortest, regardless of home or away. Yasiel Puig and Giancarlo Stanton generate deep home runs. Adam Jones Josh Donaldson, and Yoenis Cespedes have hit the ball quite a bit deeper away from home.  Giancarlo Stanton is one of the few who has hit the home-run ball deeper at his home stadium.

The baseball season is still young, and the sample sizes at the individual hitter's level are small (~15-30 total), thus the observed differences at the home/away level are mostly statistically insignificant.

The prior post on the original graphic can be found here.

 


An overused chart, why it fails, and how to fix it

Reader and tipster Chris P. found this "death spiral" chart dizzying (link).

Piomas_image

It's one of those charts that has conceptual appeal but does not do the data justice. As the name implies, the designer has a strong message, that the arctic sea ice volume has dramatically declined over time. This message is there in the chart but the reader has to work hard to find it.

Why doesn't this spider chart work? We can be more precise.

  • A big problem is the lack of scalability. This chart looks different every year. If you add an extra year to the chart, you either have to increase the density of the years or you have to drop the earliest year.
  • Years are not circular or periodic so the metaphor doesn't quite work.
  • This chart type requires way too many gridlines.
  • Axis labeling is also awkward. Because of the polar coordinates, the axes are radiating so the numbers run up toward the top but run down toward the bottom.
  • This specific instance of spider chart benefits from the well-behaved data: the between-year variability is much lower than the within-year variability. As a result, the lines don't cross each other much. If the variability from year to year fluctuates a lot, we would have seen a bunch of noodles.

This is a pity because the designer did very well in aligning two corners of the Trifecta Checkup, namely what is the question and what does the data show? It is a great idea to control for month of year, and look at year to year changes. (A more typical view would be to look at month to month changes and plot one line per year.)

This is an example of a chart that does well on one side of the checkup but the failure is that the graph isn't in tune with the data or the question being addressed.

Whenever I see a spider chart, I want to unroll the spiral and see if a line chart is better. Thus:

Redo_piomas1

The dramatic decrease in Arctic ice volume (no matter the month) is clear as day. You can actually read off the magnitude of the drop. (Try doing that in the spider chart, say between 1978 and 1995.)

This chart still has issues, namely too many colors. One can color the lines by season of the year, like this:

Redo_piomas_season1

Or switch to a small-multiples set up with three lines per chart and one chart per season.

The seasonal arrangement is not arbitrary. You can see the effect of season by looking at side by side boxplots:

Redo_piomas2

 The pattern is UP-DOWN-DOWN-UP.

In fact, a side-by-side boxplot of the data provides a very informative look:

Redo_piomas3

The monthly series is obscured in this view, built into the vertical variability, which we can see is quite stable. The idea of controlling for month is to make it irrelevant. This view emphasizes the year on year decline of the entire distribution.

If you're worried that dropping too much information, the data can be grouped by season as before in a small-multiples setup like this:

Redo_piomas4

Regardless of season, the trend is down.

 

PS. Alberto reminds me of his post about one example of a spider chart (radar chart) that works. Here's the link. It works because the graphical element is more in tune with the data. While the ice cap data has a linear trend over time, the voting data is all about differences in distribution. Also, the designer is expecting readers to care about the high-level pattern, not about the specifics.


Vanity heights and scary charts

Sometimes I wonder if I should just become a chart doctor. Andrew recently wrote that journals should have graphical editors. Businesses also need those, judging from this submission through Twitter (@francesdonald). Link is here.

You don't know whether to laugh or cry at this pie chart:

Quartz_tallbuildings2

The author of the article complains that all the tall buildings around the world are cheats: vanity height is defined as the height above which the floors are unoccupied. The sample proportions aren't that different between countries, ranging from 13% to 19% (of the total heights). Why are they added together to make a whole?

The following boxplot illustrates both the average and the variation in vanity heights by region, and tells a more interesting story:

Redo_tallbuildings

Recall that in a boxplot, the gray box contains the middle 50% of the data and the white line inside the box indicates the median value. UAE has a tendency to inflate the heights more while the other three regions are not much different.

***

The other graphic included in the same article is only marginally better, despite a much more attractive exterior:

Quartz_ten_tallest-1

This chart misrepresents the actual heights of the buildings. At first glance, I thought there must be a physical limit to the number of occupied floors since the grayed out sections are equal heights. If the decision has been made to focus on the vanity height, then just don't show the rest of the buildings.

Also, it's okay to assume a minimal intelligence on the part of readers - I mean, is there a need to repeat the "non-occupiable height" label 10 times? Similarly, the use of 10 sets of double asterisks is rather extravagant.

 


Book quiz data geekery, plus another free book

The winner of the Numbersense Book Quiz has been announced. See here.

GOOD NEWS: McGraw-Hill is sponsoring another quiz. Same format. Another chance to win a signed book. Click here to go directly to the quiz.

***

Numbersense_quiz1_timingI did a little digging around the quiz data. The first thing I'd like to know is when people sent in responses.

This is shown on the right. Not surprisingly, Monday and Tuesday were the most popular days, combining for 70 percent of all entries. The contest was announced on Monday so this is to be expected.

There was a slight bump on Friday, the last day of the contest.

I'm at a loss to explain the few stray entries on Saturday. This is very typical of real-world data; strange things just happen. In the software, I set the stop date to be Saturday, 12:00 AM, and I was advised that they abide by Pacific Standard Time. This doesn't seem to be the case, unless... the database itself is configured to a different time standard!

The last entry was around 7 am on Saturday. Pacific Time is about 8 hours behind Greenwich Mean Time, which is also the ISO 8601 standard used by a lot of web servers.

That's my best guess. I can't spend any more time on this investigation.

***

The next question that bugs me is how could only about 80% of the entries contained 3 correct answers. The quiz was designed to pose as low a barrier as possible, and I know based on interactions on the blog that the IQ of my readers is well above average.

I start with a hypothesis. Perhaps the odds of winning the book is rather low (even though it's much higher than any lottery), and some people are just not willing to invest the time to answer 3 questions, and they randomly guessed. What would the data say?

Numbersense_quiz_durationeligiblesHaha, these people are caught red-handed. The boxplots (on the left) show the time spent completing the quiz.

Those who have one or more wrong answers are labelled "eligible = 0" and those who have all 3 answers are labelled "eligible = 1".

There is very strong evidence that those who have wrong answers spent significantly less time doing the quiz. In fact, 50 percent of these people sent in their response less than 1 minute after starting the quiz! (In a boxplot, the white line inside the box indicates the median.)

Also, almost everyone who have one or more wrong answers spent less time filling out the quiz than the 25th-percentile person who have three correct answers.

As with any data analysis, one must be careful drawing conclusions. While I think these readers are unwilling to invest the time, perhaps just checking off the answers at random, there are other reasons for not having three correct answers. Abandonment is one, maybe those readers were distracted in the middle of the quiz. Maybe the system went down in the middle of the process (I'm not saying this happened, it's just a possibility.)

***

Finally, among those who got at least one answer wrong, were they more likely to enter the quiz at the start of the week or at the end?

Numbersense_quiz1_eligiblebyday There is weak evidence that those who failed to answer all 3 questions correctly were more likely to enter the contest on Friday (last day of the quiz) while those who entered on Wednesday or Thursday (the lowest response days of the week) were more likely to have 3 correct answers. It makes sense that those readers were more serious about wanting the book.

***

Now, hope you have better luck in round 2 of the Numbersense book quiz. Enter the quiz here.

 

 

 


A gift from the NY Times Graphics team

This post is long over-due. I have been meaning to write about this blog for a long time but never got around to it. It's like the email response you postponed because you want to think before you fire it off. But I received two mentions of it within the last few days, which reminded me I have to get to work on this one.

One of the best blogs to read - that is similar in spirit to Junk Charts - is ChartNThings. This is the behind-the-scenes blog of the venerable New York Times graphics department. They talk about the considerations that go into making specific charts that subsequently showed up in the newspaper. You get to see their sketches. Kind of like my posts here, except with the graphics professional's perspective.

As Andrew Gelman said in his annotated blog roll (link), ChartNThings is "the ultimate graphics blog. The New York Times graphics team presents some great data visualizations along with the stories behind them. I love this sort of insider’s perspective."

***

The other mention is from a friend who reviewed something I wrote about fantasy football. He pointed me to this particular post from the ChartNThings blog that talks about luck and skill in NFL.

They have a perfect illustration of how statistics can help make charts better.

Start with the following chart that shows the value of players picked organized by the round in which they are picked.

Chartsnthings_nfl1

Think of this as plotting the raw data. A pattern is already apparent, which is that on average, the players picked in earlier rounds (on the left) have produced higher value for their clubs. However, there is quite a bit of noise on the page. One problem with dot plots is over-plotting when the density of points is high, as is here. Our eyes cannot judge density properly especially in the presence of over-plotting.

What the NYT team did next is to take the average value for all players picked in each round in each year, and plot those instead. This drastically reduces the number of dots per round, and cleans up the canvass a great deal.

Chartsnthings_nfl2
It's amazing how much more powerful is this chart than the previous one. Instead of the average value, one can also try the median value, or plot percentiles to showcase the distribution. (They later offered a side-by-side box plot, which is also an excellent idea.)

The post then goes into exploring a paper by some economists who wanted to ignore the average and focus on the noise. I'll make some comments on that analysis on my other blog. (The post is now live.)

 ***

One behind-the-scenes thing I'd add about this behind-the-scenes blog is that the authors must have spent quite a bit of time organizing the materials and creating the streamlined stories for us to savor. Graphical creation involves a lot of sketching and exploration, so there are lots of dead ends, backtracking, stuff you throw away. There will be lots of charts with little flaws that you didn't care to correct because it's not your final version. There will be lots of charts which will only be intelligible to the creator since they are missing labels, scales, etc., again because those were supposed to be sketch work. There will even be charts that the creator can't make sense of because the train of thought has been lost by the end of the project.

So we should applaud what the team has done here for the graphics community.


Different pictures of unemployment

Unemployment and job losses being such a worrying social problem in the U.S., one can find many attempts to visualize the predicament. In this post, I will look at two widely circulated charts, and some design decisions behind these charts.

Slate_jobsAug09 First up, Slate uses an interactive map. (Click on the link for interactivity.)

Here, county-level data is being plotted, with the size of the bubbles indicating the number of jobs, red for jobs lost, blue for jobs gained, all of which computed year on year for a given month.

As you play with this display, think about the first question of the Trifecta checkup: what is the practical issue being addressed by this chart? What is the message the designer wants to convey?

Most likely, the answer will be something like the progress of job losses between 2007 and 2009, or which parts of the country are most affected by job losses.

Is this display the best at illuminating these issues? The designer has chosen the map to illustrate geography, and interactivity to illustrate time. These are not controversial -- but they should be controversial.

Maps are over-used objects. We see the biggest circles always in California, along the Eastern seaboard and in the lake region. This is true pretty much 90% of the time. What we are seeing is the distribution of population across the U.S. What we are not seeing is how job losses affect different regions on the right scale. The bubbles in California are almost always larger than those in the Midwest because there are more people in California.

***

On the time dimension, the designer has chosen to use monthly data but only for three years 2007-9. However, when this is multiplied hundreds of times by the county dimension, it is simply impossible for readers to grasp any trends from the interactive chart. We can learn the aggregate trajectory of when job losses start to pile up, when the recession deepens, etc. but since you are living through this recession, you don't need this map to tell you that.

It is in fact alright for the designer to collapse the time dimension! Look at the following chart used by the Calculated Risk blog, which displays a similar data set (unemployment rate rather than jobs gained/lost).

StateUnemploymentRateJuly2010

Notice that this designer collapsed both the time and geography dimensions. Time is partially present inside the boxes, as the maximum, minimum and current unemployment levels being plotted correspond to certain years in the past. The max and min are picked from data stretching back to 1976, a much longer period than the Slate chart. Geography is at the state level, rather than the county level (even though county-level data is available.) The states are sorted by the current level (July 2010) of unemployment.

The purpose of this designer is much easier to identify. For states like Nevada and California, the current situation is at the historical worst while for the Dakotas, they have seen much worse before.

If, for example, we want to know if different regions in the U.S. show discernable patterns, all we need to do is to use different colors of the boxes for different regions.

***

A problem with using the range (maximum and minimum) is outliers. The maximum or minimum values could be outliers. Put differently, the blue boxes shown above, while containing all unemployment rates going back to 1976, may not tell us much about the typical unemployment rate. What we might want to know is what the unemployment rate is like for most years.

For this, we can convert the max-min boxes into Tukey's boxplots.

Jc_StateJobs_boxplot In a boxplot, the box (gray area) contains half of the historical data. So if you look at DC (third from the bottom), unemployment in most years are narrowly constrained to about 6 to 8 percent although the max-min range is from under 5 to above 12.

For this chart, I sorted the states by median unemployment (black line inside the box) and the blue asterisks indicate the current level of unemployment (June 2010). Data comes from the BLS website.

Again, if regional differences need to be exposed, the boxes can be colored differently.

The outliers are plotted as dots on these boxplots; that too is data that may be considered extraneous to our purpose for this chart.

***

Is it a horrible thing for the designer to collapse dimensions like this? The data is available, and shouldn't all of them be used?

The truth is one can never cram all the data into a single chart. Even the Slate chart has collapsed some dimensions. Namely, the unemployment rates by demographics (age, gender, race, etc.) and by industry sector. Arguably those dimensions are as interesting as time and geography. 

The bottom line: don't try to use every piece of data, you can't anyway, you will be making choices as to which dimensions to expose and which to hide, choose wisely.

***

Thanks to Aleks for pointing to the Visualizing Economics blog, which collects graphs about the economy, from where I found these charts.