Graphical advice for conference presenters - demo

Yesterday, I pulled this graphic from a journal paper, and said one should not copy and paste this into an oral presentation.

Example_presentation_graphic

So I went ahead and did some cosmetic surgery on this chart.

Redo_example_conference_graphic

I don't know anything about the underlying science. I'm just interpreting what I see on the chart. It seems like the key message is that the Flowering condition is different from the other three. There are no statistical differences between the three boxplots in the first three panels but there is a big difference between the red-green and the purple in the last panel. Further, this difference can be traced to the red-green boxplots exhibiting negative correlation under the Flowering condition - while the purple boxplot is the same under all four conditions.

I would also have chosen different colors, e.g. make red-green two shades of gray to indicate that these two things can be treated as the same under this chart. Doing this would obviate the need to introduce the orange color.

Further, I think it might be interesting to see the plots split differently: try having the red-green boxplots side by side in one panel, and the purple boxplots in another panel.

If the presentation software has animation, the presenter can show the different text blocks and related materials one at a time. That also aids comprehension.

***

Note that the plot is designed for an oral presentation in which you have a minute or two to get the message across. It's debatable as to whether journal editors should accept this style for publications. I actually think such a style would improve reading comprehension but I surmise some of you will disagree.


Various ways of showing distributions

The other day, a chart about the age distribution of Olympic athletes caught my attention. I found the chart on Google but didn't bookmark it and now I couldn't retrieve it. From my mind's eye, the chart looks like this:

Age_olympics_stackedbars

This chart has the form of a stacked bar chart but it really isn't. The data embedded in each bar segment aren't proportions; rather, they are counts of athletes along a standardized age scale. For example, the very long bar segment on the right side of the bar for alpine skiing does not indicate a large proportion of athletes in that 30-50 age group; it's the opposite: that part of the distribution is sparse, with an outlier at age 50.

The easiest way to understand this chart is to transform it to histograms.

Redo_age_olympics_histo2

In a histogram, the counts for different age groups are encoded in the heights of the columns. Instead, encode the counts in a color scale so that taller columns map to darker shades of blue. Then, collapse the columns to the same heights. Each stacked bar chart is really a collapsed histogram.

***

The stacked bar chart reminds me of boxplots that are loved by statisticians.

Redo_age_olympics_boxplot2b

In a boxplot, the box contains the middle 50% of the athletes in each sport (this directly maps to the dark blue bar segments from the chart above). Outlier values are plotted individually, which gives a bit more information about the sparsity of certain bar segments, such as the right side of alpine skiing.

The stacked bar chart can be considered a nicer-looking version of the boxplot.

 

 


Revisiting the home run data

Note to New York metro readers: I'm an invited speaker at NYU's "Art and Science of Brand Storytelling" summer course which starts tomorrow. I will be speaking on Thursday, 12-1 pm. You can still register here.

***

The home run data set, compiled by ESPN and visualized by Mode Analytics, is pretty rich. I took a quick look at one aspect of the data. The question I ask is what differences exist among the 10 hitters that are highlighted in the previous visualization. (I am not quite sure how those 10 were picked because they are not the Top 10 home run hitters in the dataset for the current season.)

The following chart focuses on two metrics: the total number of home runs by this point in the season; and the "true" distances of those home runs. I split the data by whether the home run was hit on a home field or an away stadium, on the hunch that we'd need to correct for such differences.

Jc_top10hitters_homeaway_splits

The hitters are sorted by total number of home runs. Because I am using a single season, my chart doesn't suffer from a cohort bias. If you go back to the original visualization, it is clear that some of these hitters are veterans with many seasons of baseball in them while others are newbies. This cohort bias explains the difference in dot densities of those plots.

Having not been following baseball recently, I don't know many of these names on the list. I have to look up Todd Frazier - does he play in a hitter-friendly ballpark? His home to away ratio is massive. Frazier plays for Cincinnati, at the Great American Ballpark. That ballpark has the third highest number of home runs hit of all ballparks this season although up till now, opponents have hit more home runs there than home players. For reference, Troy Tulowitzki's home field is Colorado's Coors Field, which is hitter's paradise. Giancarlo Stanton, who also hits quite a few more home runs at home, plays for Miami at Marlins Park, which is below the median in terms of home run production; thus his achievement is probably the most impressive amongst those three.

Josh Donaldson is the odd man out, as he has hit more away home runs than home runs at home. His O.co Coliseum is middle-of-the-road in terms of home runs.

In terms of how far the home runs travel (bottom part of the chart), there are some interesting tidbits. Brian Dozier's home runs are generally the shortest, regardless of home or away. Yasiel Puig and Giancarlo Stanton generate deep home runs. Adam Jones Josh Donaldson, and Yoenis Cespedes have hit the ball quite a bit deeper away from home.  Giancarlo Stanton is one of the few who has hit the home-run ball deeper at his home stadium.

The baseball season is still young, and the sample sizes at the individual hitter's level are small (~15-30 total), thus the observed differences at the home/away level are mostly statistically insignificant.

The prior post on the original graphic can be found here.

 


An overused chart, why it fails, and how to fix it

Reader and tipster Chris P. found this "death spiral" chart dizzying (link).

Piomas_image

It's one of those charts that has conceptual appeal but does not do the data justice. As the name implies, the designer has a strong message, that the arctic sea ice volume has dramatically declined over time. This message is there in the chart but the reader has to work hard to find it.

Why doesn't this spider chart work? We can be more precise.

  • A big problem is the lack of scalability. This chart looks different every year. If you add an extra year to the chart, you either have to increase the density of the years or you have to drop the earliest year.
  • Years are not circular or periodic so the metaphor doesn't quite work.
  • This chart type requires way too many gridlines.
  • Axis labeling is also awkward. Because of the polar coordinates, the axes are radiating so the numbers run up toward the top but run down toward the bottom.
  • This specific instance of spider chart benefits from the well-behaved data: the between-year variability is much lower than the within-year variability. As a result, the lines don't cross each other much. If the variability from year to year fluctuates a lot, we would have seen a bunch of noodles.

This is a pity because the designer did very well in aligning two corners of the Trifecta Checkup, namely what is the question and what does the data show? It is a great idea to control for month of year, and look at year to year changes. (A more typical view would be to look at month to month changes and plot one line per year.)

This is an example of a chart that does well on one side of the checkup but the failure is that the graph isn't in tune with the data or the question being addressed.

Whenever I see a spider chart, I want to unroll the spiral and see if a line chart is better. Thus:

Redo_piomas1

The dramatic decrease in Arctic ice volume (no matter the month) is clear as day. You can actually read off the magnitude of the drop. (Try doing that in the spider chart, say between 1978 and 1995.)

This chart still has issues, namely too many colors. One can color the lines by season of the year, like this:

Redo_piomas_season1

Or switch to a small-multiples set up with three lines per chart and one chart per season.

The seasonal arrangement is not arbitrary. You can see the effect of season by looking at side by side boxplots:

Redo_piomas2

 The pattern is UP-DOWN-DOWN-UP.

In fact, a side-by-side boxplot of the data provides a very informative look:

Redo_piomas3

The monthly series is obscured in this view, built into the vertical variability, which we can see is quite stable. The idea of controlling for month is to make it irrelevant. This view emphasizes the year on year decline of the entire distribution.

If you're worried that dropping too much information, the data can be grouped by season as before in a small-multiples setup like this:

Redo_piomas4

Regardless of season, the trend is down.

 

PS. Alberto reminds me of his post about one example of a spider chart (radar chart) that works. Here's the link. It works because the graphical element is more in tune with the data. While the ice cap data has a linear trend over time, the voting data is all about differences in distribution. Also, the designer is expecting readers to care about the high-level pattern, not about the specifics.


Vanity heights and scary charts

Sometimes I wonder if I should just become a chart doctor. Andrew recently wrote that journals should have graphical editors. Businesses also need those, judging from this submission through Twitter (@francesdonald). Link is here.

You don't know whether to laugh or cry at this pie chart:

Quartz_tallbuildings2

The author of the article complains that all the tall buildings around the world are cheats: vanity height is defined as the height above which the floors are unoccupied. The sample proportions aren't that different between countries, ranging from 13% to 19% (of the total heights). Why are they added together to make a whole?

The following boxplot illustrates both the average and the variation in vanity heights by region, and tells a more interesting story:

Redo_tallbuildings

Recall that in a boxplot, the gray box contains the middle 50% of the data and the white line inside the box indicates the median value. UAE has a tendency to inflate the heights more while the other three regions are not much different.

***

The other graphic included in the same article is only marginally better, despite a much more attractive exterior:

Quartz_ten_tallest-1

This chart misrepresents the actual heights of the buildings. At first glance, I thought there must be a physical limit to the number of occupied floors since the grayed out sections are equal heights. If the decision has been made to focus on the vanity height, then just don't show the rest of the buildings.

Also, it's okay to assume a minimal intelligence on the part of readers - I mean, is there a need to repeat the "non-occupiable height" label 10 times? Similarly, the use of 10 sets of double asterisks is rather extravagant.

 


Book quiz data geekery, plus another free book

The winner of the Numbersense Book Quiz has been announced. See here.

GOOD NEWS: McGraw-Hill is sponsoring another quiz. Same format. Another chance to win a signed book. Click here to go directly to the quiz.

***

Numbersense_quiz1_timingI did a little digging around the quiz data. The first thing I'd like to know is when people sent in responses.

This is shown on the right. Not surprisingly, Monday and Tuesday were the most popular days, combining for 70 percent of all entries. The contest was announced on Monday so this is to be expected.

There was a slight bump on Friday, the last day of the contest.

I'm at a loss to explain the few stray entries on Saturday. This is very typical of real-world data; strange things just happen. In the software, I set the stop date to be Saturday, 12:00 AM, and I was advised that they abide by Pacific Standard Time. This doesn't seem to be the case, unless... the database itself is configured to a different time standard!

The last entry was around 7 am on Saturday. Pacific Time is about 8 hours behind Greenwich Mean Time, which is also the ISO 8601 standard used by a lot of web servers.

That's my best guess. I can't spend any more time on this investigation.

***

The next question that bugs me is how could only about 80% of the entries contained 3 correct answers. The quiz was designed to pose as low a barrier as possible, and I know based on interactions on the blog that the IQ of my readers is well above average.

I start with a hypothesis. Perhaps the odds of winning the book is rather low (even though it's much higher than any lottery), and some people are just not willing to invest the time to answer 3 questions, and they randomly guessed. What would the data say?

Numbersense_quiz_durationeligiblesHaha, these people are caught red-handed. The boxplots (on the left) show the time spent completing the quiz.

Those who have one or more wrong answers are labelled "eligible = 0" and those who have all 3 answers are labelled "eligible = 1".

There is very strong evidence that those who have wrong answers spent significantly less time doing the quiz. In fact, 50 percent of these people sent in their response less than 1 minute after starting the quiz! (In a boxplot, the white line inside the box indicates the median.)

Also, almost everyone who have one or more wrong answers spent less time filling out the quiz than the 25th-percentile person who have three correct answers.

As with any data analysis, one must be careful drawing conclusions. While I think these readers are unwilling to invest the time, perhaps just checking off the answers at random, there are other reasons for not having three correct answers. Abandonment is one, maybe those readers were distracted in the middle of the quiz. Maybe the system went down in the middle of the process (I'm not saying this happened, it's just a possibility.)

***

Finally, among those who got at least one answer wrong, were they more likely to enter the quiz at the start of the week or at the end?

Numbersense_quiz1_eligiblebyday There is weak evidence that those who failed to answer all 3 questions correctly were more likely to enter the contest on Friday (last day of the quiz) while those who entered on Wednesday or Thursday (the lowest response days of the week) were more likely to have 3 correct answers. It makes sense that those readers were more serious about wanting the book.

***

Now, hope you have better luck in round 2 of the Numbersense book quiz. Enter the quiz here.

 

 

 


A gift from the NY Times Graphics team

This post is long over-due. I have been meaning to write about this blog for a long time but never got around to it. It's like the email response you postponed because you want to think before you fire it off. But I received two mentions of it within the last few days, which reminded me I have to get to work on this one.

One of the best blogs to read - that is similar in spirit to Junk Charts - is ChartNThings. This is the behind-the-scenes blog of the venerable New York Times graphics department. They talk about the considerations that go into making specific charts that subsequently showed up in the newspaper. You get to see their sketches. Kind of like my posts here, except with the graphics professional's perspective.

As Andrew Gelman said in his annotated blog roll (link), ChartNThings is "the ultimate graphics blog. The New York Times graphics team presents some great data visualizations along with the stories behind them. I love this sort of insider’s perspective."

***

The other mention is from a friend who reviewed something I wrote about fantasy football. He pointed me to this particular post from the ChartNThings blog that talks about luck and skill in NFL.

They have a perfect illustration of how statistics can help make charts better.

Start with the following chart that shows the value of players picked organized by the round in which they are picked.

Chartsnthings_nfl1

Think of this as plotting the raw data. A pattern is already apparent, which is that on average, the players picked in earlier rounds (on the left) have produced higher value for their clubs. However, there is quite a bit of noise on the page. One problem with dot plots is over-plotting when the density of points is high, as is here. Our eyes cannot judge density properly especially in the presence of over-plotting.

What the NYT team did next is to take the average value for all players picked in each round in each year, and plot those instead. This drastically reduces the number of dots per round, and cleans up the canvass a great deal.

Chartsnthings_nfl2
It's amazing how much more powerful is this chart than the previous one. Instead of the average value, one can also try the median value, or plot percentiles to showcase the distribution. (They later offered a side-by-side box plot, which is also an excellent idea.)

The post then goes into exploring a paper by some economists who wanted to ignore the average and focus on the noise. I'll make some comments on that analysis on my other blog. (The post is now live.)

 ***

One behind-the-scenes thing I'd add about this behind-the-scenes blog is that the authors must have spent quite a bit of time organizing the materials and creating the streamlined stories for us to savor. Graphical creation involves a lot of sketching and exploration, so there are lots of dead ends, backtracking, stuff you throw away. There will be lots of charts with little flaws that you didn't care to correct because it's not your final version. There will be lots of charts which will only be intelligible to the creator since they are missing labels, scales, etc., again because those were supposed to be sketch work. There will even be charts that the creator can't make sense of because the train of thought has been lost by the end of the project.

So we should applaud what the team has done here for the graphics community.


Different pictures of unemployment

Unemployment and job losses being such a worrying social problem in the U.S., one can find many attempts to visualize the predicament. In this post, I will look at two widely circulated charts, and some design decisions behind these charts.

Slate_jobsAug09 First up, Slate uses an interactive map. (Click on the link for interactivity.)

Here, county-level data is being plotted, with the size of the bubbles indicating the number of jobs, red for jobs lost, blue for jobs gained, all of which computed year on year for a given month.

As you play with this display, think about the first question of the Trifecta checkup: what is the practical issue being addressed by this chart? What is the message the designer wants to convey?

Most likely, the answer will be something like the progress of job losses between 2007 and 2009, or which parts of the country are most affected by job losses.

Is this display the best at illuminating these issues? The designer has chosen the map to illustrate geography, and interactivity to illustrate time. These are not controversial -- but they should be controversial.

Maps are over-used objects. We see the biggest circles always in California, along the Eastern seaboard and in the lake region. This is true pretty much 90% of the time. What we are seeing is the distribution of population across the U.S. What we are not seeing is how job losses affect different regions on the right scale. The bubbles in California are almost always larger than those in the Midwest because there are more people in California.

***

On the time dimension, the designer has chosen to use monthly data but only for three years 2007-9. However, when this is multiplied hundreds of times by the county dimension, it is simply impossible for readers to grasp any trends from the interactive chart. We can learn the aggregate trajectory of when job losses start to pile up, when the recession deepens, etc. but since you are living through this recession, you don't need this map to tell you that.

It is in fact alright for the designer to collapse the time dimension! Look at the following chart used by the Calculated Risk blog, which displays a similar data set (unemployment rate rather than jobs gained/lost).

StateUnemploymentRateJuly2010

Notice that this designer collapsed both the time and geography dimensions. Time is partially present inside the boxes, as the maximum, minimum and current unemployment levels being plotted correspond to certain years in the past. The max and min are picked from data stretching back to 1976, a much longer period than the Slate chart. Geography is at the state level, rather than the county level (even though county-level data is available.) The states are sorted by the current level (July 2010) of unemployment.

The purpose of this designer is much easier to identify. For states like Nevada and California, the current situation is at the historical worst while for the Dakotas, they have seen much worse before.

If, for example, we want to know if different regions in the U.S. show discernable patterns, all we need to do is to use different colors of the boxes for different regions.

***

A problem with using the range (maximum and minimum) is outliers. The maximum or minimum values could be outliers. Put differently, the blue boxes shown above, while containing all unemployment rates going back to 1976, may not tell us much about the typical unemployment rate. What we might want to know is what the unemployment rate is like for most years.

For this, we can convert the max-min boxes into Tukey's boxplots.

Jc_StateJobs_boxplot In a boxplot, the box (gray area) contains half of the historical data. So if you look at DC (third from the bottom), unemployment in most years are narrowly constrained to about 6 to 8 percent although the max-min range is from under 5 to above 12.

For this chart, I sorted the states by median unemployment (black line inside the box) and the blue asterisks indicate the current level of unemployment (June 2010). Data comes from the BLS website.

Again, if regional differences need to be exposed, the boxes can be colored differently.

The outliers are plotted as dots on these boxplots; that too is data that may be considered extraneous to our purpose for this chart.

***

Is it a horrible thing for the designer to collapse dimensions like this? The data is available, and shouldn't all of them be used?

The truth is one can never cram all the data into a single chart. Even the Slate chart has collapsed some dimensions. Namely, the unemployment rates by demographics (age, gender, race, etc.) and by industry sector. Arguably those dimensions are as interesting as time and geography. 

The bottom line: don't try to use every piece of data, you can't anyway, you will be making choices as to which dimensions to expose and which to hide, choose wisely.

***

Thanks to Aleks for pointing to the Visualizing Economics blog, which collects graphs about the economy, from where I found these charts.


Eye heart this

Dan at Eye Heart New York has a fantastic post relating to the recent release of restaurant health inspection data by New York City. This has caused a furor among the restaurant owners because they are now required to wear their A/B/C badges front and center. Dan collected some data (which he also posted), made some charts, and reported some interesting insights.

Here is an overview chart that shows the distribution of scores (the higher the score, the lower the grade). He called it a "scatter plot" but it is really a histogram where the bucket size is 1 except for the rightmost bucket.

Chart-scores-colored-nycfood
 

I like the use of green, yellow and red colors to indicate (without words) the conversion scale from scores (violation points) to grades (A/B/C). The legend "Count" is an Excel monstrosity. I'd have used a bucket size of at least 5, which would smooth out the gyrations in the green zone.

A more typical way to summarize numeric data in groups is Tukey's boxplot, as shown below.

Tukey_boxplot 

I use Dan's raw data on this chart. 1 = A, 2 = B, 3 = C. What is group 4?

It turns out Dan has removed this group from all of his analysis. A little research shows that group 4 are restaurants that have been closed by the Dept of Health. Interestingly, the scores of these restaurants are spread widely so the DOH appears to be closing restaurants not just for health violations. (In the rest of this post, I have removed group 4.)

For those not familiar with box plots, the box contains the middle 50% of the data (in this case, the scores of the middle half of the restaurants in the respective group); the line inside the box is the median score; the dots above (or below, though nonexistent here) the vertical lines are outliers. As Dan pointed out, group C has lots of outliers on the high end of the score.

Score111Just for fun, I pulled the violations of the highest scoring restaurant (111 violation points). What I find intriguing is the huge fluctuation in scores over the last 5 inspections. Does this happen to other restaurants too? What does that say about the grading system?

 


***

Next, Dan then attempted to address the questions: did scores vary across the 5 boroughs? and did scores vary across cuisine groups? This is the concept covered in Chapter 1 of my book: always look at the variation around averages, that's where the most interesting stuff is.

He calculated the means and standard deviations of different subgroups. It is simpler to visualize the data, again using boxplots.

Here's one dealing with boroughs, and it is clear that there is not much to pick between them. You could possibly say Staten Island is better than the other 4 boroughs.

Redo_scorebyborough

Here's one dealing with cuisine groups, using Dan's definitions.

Redo_scorebycuisinggroups

The order of the cuisine groups is by median score from lowest on the left to highest on the right. Again, there is no drastic difference. It is certainly not the case that Asian/Latin American restaurants are worse than say European or American ones.

About half of the restaurants under desserts, drinks, misc., african, and others received As while a bit less than half of the other cuisine groups got As. Some of the cuisine groups had few egregious violators (African, Middle East) - but this data is perhaps skewed by the removal of the "closed" restaurants.

One shortcoming of the traditional boxplot is the omission of how large each group is. For groups that are too small, it is difficult to draw any statistical conclusions. We know from Dan's table, for instance, that there were only 17 restaurants classified as "African".

(Unfortunately, Excel does not have built-in capability for generating boxplots.)


Self-sufficient charts

A good example showed up in the New York Times recently of a chart that fails the self-sufficiency test that I often speak about here. First, the doctored chart (with the data removed):

Redo_hometeampies
And for comparison, the chart as originally printed (the chart was found only on the paper edition but not on line):

Nyt_homefield_sm
There is little doubt that the second version, with the data -- all four numbers -- printed on the chart, is much more effective, and that is why the designer thought to include them.

This shows that readers are gravitating to the data rather than the graphical constructs, and thus I consider these types of charts not self-sufficient. The graphical constructs can't stand on their own.

***

The choice of pie charts in a small-multiples arrangement is a mistake for this data set. While indeed in theory the winning percentage could range from 0 to 100%, in practice the winning percentages are rather narrowly dispersed (with the exception of the NFL which has a 16-game regular season).

Just quickly looking up the 2009 regular seasons: MLB teams ranged from 36% (Nationals) to 65% (Yankees); NHL ranged from 32% (Islanders) to 65% (Bruins); NBA from 21% (Sacramento) to 81% (Cleveland).

In order to judge whether 60% or 52% is a large or small number, readers need to have a sense of how teams are dispersed around those averages. A side-by-side boxplot brings this out pretty well (the data is for 2009 seasons).

Redo_homewins

The "box" in a boxplot contains the middle 50% of the teams in each league while the line inside the box depicts the median team (in terms of winning percentage).

The NBA teams showed much higher variability in winning percentages than the NHL or the MLB. The difference in average winning percentage of say, 2% or 5%, from one league to the next is not remarkable, given this fact.

(The original article did not really pertain to such a comparison so the reason for this chart is not clear.)