« July 2010 | Main | September 2010 »

Book review: Interactive Graphics for Data Analysis

I am happy to provide the following review of this interesting book by Martin and Simon, who are readers of Junk Charts. Martin also publishes a blog, and he's the one who has created bumps charts for the Tour de France races (which also appear in the book).

Interactive Graphics for Data Analysis is an advanced book written by two researchers who have deep experience developing graphics software. People who like to go beyond the basics will find it a useful addition to the literature.

To give you an idea of the level of sophistication, just in Chapter 1 (titled Interactivity), the two authors utilize set operations, SQL statements, and parallel coordinate plots. They assume you have some sense of what those are. That said, those sections can be skipped without interrupting the flow of the book.

The following key messages from these authors are worth repeating:

  • There is a distinction between statistical graphics and data graphics. Underlying trends and patterns in the data is often made clear by performing statistical analyses on the data, with the results added to charts (e.g. loess lines). When dealing with very large data sets, statistical charts (such as box plots) are found to be much more scalable, precisely because they do not attempt to put every data point onto the page.
  • The authors stress the need to look at a variety of charts when doing exploratory data analysis. This is because most chart types do certain things well but not others.
  • Igda_img003  Throughout the book, they make much hay of the problem of "over-plotting", that is, overlapping data. This happens when data is abundant, or when values are concentrated in a narrow range. A great illustration of this problem is the parallel coordinates plot, which can look entirely different depending on which lines are plotted on top of which other lines. (The charts on the right are identical except for the order in which the lines are plotted.) Common strategies include "jittering", and varying transparency. Many of these strategies have issues of their own. 
  • They also point out that the look of many multivariate charts (such as mosaic charts) depends on the sorting of the data. This is a key weakness of many such plots. Just think about this the next time you create a stacked column chart.

The book is divided into two sections: Principles and Examples. The second half, the Examples section, consists of case studies in which the authors show examples of how to investigate the structure of a given data set.

Igdaimg002 The example of using the fatty-acid contents of Italian olive oils to deduce their regional origin is a good visualization of how the statistical technique of classification trees work. Here is the telling diagram:

 Notice that data with the same color are oils from the same region, the rectangular sections are results of the statistical classification procedure, and we would like to see most (if not all) of the data within each section having the same color.

***

Without a doubt, graphics designers should be aware of the issues raised by these authors. The book appears to be written for students who are creating statistical software (complete with end-of-chapter exercises.) I'm left wondering what users of graphics software can do with this information because much of this material relates to the design of graphics software. Knowing these issues makes you want to do things the software may not be designed to do efficiently. For example, most software packages I have used do not have a simple toggle to sort categorical variables by various means (alphabetical, increasing or decreasing frequency, increasing or decreasing value of another variable, etc.).


Further views of unemployment

Instead of looking at unemployment rates across the 50 states plus D.C., we can look at patterns of the ranking of the states instead. Such rankings are most effectively visualized as multi-period bumps charts.

Funny thing... the variations in rankings over time are very severe! So much so that if all 50 states are plotted on the same chart, we get a complete entangled mess.

Jc_unemploy_all_bumps

Here, I restricted the time period to 2000 onwards, and only the January unemployment rate for each year is plotted. Otherwise, the mess gets messier still.

But don't give up! The value of such a chart instantly appreciates just by adding a color, as in the following to set aside the Western states against the rest of the union:

Jc_unemploy_west_bumps

In these charts, the worst ranks (higher unemployment) are placed higher. We see that Utah has climbed down the rankings during the recession, indicating that its employment situation has improved relative to other states. On the other hand, California has been a laggard pretty much the entire decade -- while its current rank is bad, it isn't that much worse than earlier in the decade.

It doesn't really matter which chart type one uses; it is a certainty that the designer must make choices as to which data to expose. Instead of plotting every state, here is a manageable chart that takes 10 randomly chosen states, comparing the trajectories of their unemployment rankings in the last decade:

Jc_unemploy_10rand_bumps

What do we see here? Little North Dakota has been a star throughout most of this decade. Michigan has rapidly declined and is lingering at the back of the pack for three straight years. Florida experienced big ups and downs, with Alabama following a very similar trajectory. Poor Mississippi has been behind throughout the decade.

***

I love it when I write a post, and the chart designer pops in and provides his/her point of view. That's one of the things that keep me going. Appreciate the very substantive comments from my last post, and will respond soon with further comments. Thanks for reading!


Different pictures of unemployment

Unemployment and job losses being such a worrying social problem in the U.S., one can find many attempts to visualize the predicament. In this post, I will look at two widely circulated charts, and some design decisions behind these charts.

Slate_jobsAug09 First up, Slate uses an interactive map. (Click on the link for interactivity.)

Here, county-level data is being plotted, with the size of the bubbles indicating the number of jobs, red for jobs lost, blue for jobs gained, all of which computed year on year for a given month.

As you play with this display, think about the first question of the Trifecta checkup: what is the practical issue being addressed by this chart? What is the message the designer wants to convey?

Most likely, the answer will be something like the progress of job losses between 2007 and 2009, or which parts of the country are most affected by job losses.

Is this display the best at illuminating these issues? The designer has chosen the map to illustrate geography, and interactivity to illustrate time. These are not controversial -- but they should be controversial.

Maps are over-used objects. We see the biggest circles always in California, along the Eastern seaboard and in the lake region. This is true pretty much 90% of the time. What we are seeing is the distribution of population across the U.S. What we are not seeing is how job losses affect different regions on the right scale. The bubbles in California are almost always larger than those in the Midwest because there are more people in California.

***

On the time dimension, the designer has chosen to use monthly data but only for three years 2007-9. However, when this is multiplied hundreds of times by the county dimension, it is simply impossible for readers to grasp any trends from the interactive chart. We can learn the aggregate trajectory of when job losses start to pile up, when the recession deepens, etc. but since you are living through this recession, you don't need this map to tell you that.

It is in fact alright for the designer to collapse the time dimension! Look at the following chart used by the Calculated Risk blog, which displays a similar data set (unemployment rate rather than jobs gained/lost).

StateUnemploymentRateJuly2010

Notice that this designer collapsed both the time and geography dimensions. Time is partially present inside the boxes, as the maximum, minimum and current unemployment levels being plotted correspond to certain years in the past. The max and min are picked from data stretching back to 1976, a much longer period than the Slate chart. Geography is at the state level, rather than the county level (even though county-level data is available.) The states are sorted by the current level (July 2010) of unemployment.

The purpose of this designer is much easier to identify. For states like Nevada and California, the current situation is at the historical worst while for the Dakotas, they have seen much worse before.

If, for example, we want to know if different regions in the U.S. show discernable patterns, all we need to do is to use different colors of the boxes for different regions.

***

A problem with using the range (maximum and minimum) is outliers. The maximum or minimum values could be outliers. Put differently, the blue boxes shown above, while containing all unemployment rates going back to 1976, may not tell us much about the typical unemployment rate. What we might want to know is what the unemployment rate is like for most years.

For this, we can convert the max-min boxes into Tukey's boxplots.

Jc_StateJobs_boxplot In a boxplot, the box (gray area) contains half of the historical data. So if you look at DC (third from the bottom), unemployment in most years are narrowly constrained to about 6 to 8 percent although the max-min range is from under 5 to above 12.

For this chart, I sorted the states by median unemployment (black line inside the box) and the blue asterisks indicate the current level of unemployment (June 2010). Data comes from the BLS website.

Again, if regional differences need to be exposed, the boxes can be colored differently.

The outliers are plotted as dots on these boxplots; that too is data that may be considered extraneous to our purpose for this chart.

***

Is it a horrible thing for the designer to collapse dimensions like this? The data is available, and shouldn't all of them be used?

The truth is one can never cram all the data into a single chart. Even the Slate chart has collapsed some dimensions. Namely, the unemployment rates by demographics (age, gender, race, etc.) and by industry sector. Arguably those dimensions are as interesting as time and geography. 

The bottom line: don't try to use every piece of data, you can't anyway, you will be making choices as to which dimensions to expose and which to hide, choose wisely.

***

Thanks to Aleks for pointing to the Visualizing Economics blog, which collects graphs about the economy, from where I found these charts.


Some links

If you have submitted links to me in the past few months, you will see them posted in the next few weeks; I just spent some time looking at all the submissions.

***

Here are some links that are slightly off-topic (though still interesting), and others I don't intend on writing full posts about:

  

Slate_wardead  Daniel L. sent us to Slate, where they posted this chart counting up the human cost of the Afghan War. Applying the Trifecta checkup, he gave this evaluation:

What is the practical question:  I have no idea
What does the chart say:  I have no idea
What does the data say:  I have no idea

The time series thing coupled with poor use of color obscures whatever patterns you could pick up. 

Daniel is right about the last point - by plotting the disaggregated data, readers are forced to stare at the variability of casualties over time, and the progress of the war, which distracts from the idea of "accounting for the dead".

Daniel also argues, and I agree, that this math is meaningless even if done properly.

  

Pagerank Understanding Google PageRank - Nick calls this an infographic but it contains zero data. Not the kind of thing for this blog but it does a decent job explaining PageRank.

The part about circular links canceling each other out confuses me; it would seem like good blogs should be able to link to each other without being penalized.


AssistedlivingThe Ins and Outs of Assisted Living Homes - Ellen G. created this "infographic" explaining what "assisted living homes" are like. Again, not stuff for this blog, as the two bar charts are just tag-alongs that are not well integrated with the rest.

In terms of the charts, please remove 3-D, remove the colors, order the data from largest to smallest, consider a horizontal bar chart with data labels on the left, and title it "the top needs for assisted living residents".








Stone-age graphic

This Economist chart on the history of world GDP throws the art of graphics back several hundred years. (Thanks Tyler A. for the link.)

Econ_china

And I can't really re-make it since I can't make heads or tails of it.

  • How are the columns sorted? (on second thought, maybe the 70 should read 1870, 13 is 1913, and so on?)
  • Why are there differing gaps between columns?
  • Italy was not a country, and the US was definitely not in existence in AD1 so what does it mean to have values for those on the chart? If this is created by taking current-day boundaries and projecting back in time, why are today's boundaries treated as sacrosanct?
  • If the columns are sorted chronologically, a line chart would be much more readable. At the minimum, it will reduce the number of colors to 1. Note that multiple colors are necessitated by the choice of a stacked column chart.
  • A stacked column chart with percentages should always extend to 100%. The current chart is very misleading if we want to know the percentage of world GDP produced by "other countries".
  • How are the countries ordered within a column? It's neither alphabetical, nor by the starting or ending distributions.
  • Don't challenge readers by having vertically stacked categories and a horizontal legend.
  • It would also be much better if there are annotations to help the reader understand the chart, e.g. collapse of the Roman Empire, Renaissance, Great Depression, Big Fire, etc.

PS. [8/18/10] Dustin linked to a line-chart version of this chart, from the World Bank site, via Chartporn.

New_worldgdp

I think the evidence is right here as to why the Economist execution leaves a ton to be desired. The use of lines allows the reader to easily trace the rise and fall of different economies, which is the point of the data set. The stacked-column chart draws attention to a point-in-time distribution of GDP among different countries, which is of secondary importance.

There are other differences: this plots the share of "growth" as opposed to the share of total GDP. It also plots regions rather than countries (well, except for China and Japan). It does not presuppose that the US was in existence before its founding. It could have (should have) included an "rest of the world" line.

The spacing of the years is still problematic but it's an Excel inconvenience, really. But it's ok to stretch the axis on a line chart, it's a problem to do it with a column chart, as demonstrated above. The gaps between columns should be proportional to the years between the data but this is impossible to do in a column chart.


Unrecoverable error

This belongs to the light entertainment category. My former classmate Alan decided to give me an impossible challenge, how to improve this hopeless chart, from a Chinese publication... a study of the consumption patterns of Chinese born after 1980.

Here it is (with my translation):

Cyzone2b 

In the article, they stated that "Singles born after 1980 have different values from those born before, mainly in three areas: desire of independence, seeking adventure, and valuing friendship". (I guess they decided to call people born before 1980 traditionals.)


Eye heart this

Dan at Eye Heart New York has a fantastic post relating to the recent release of restaurant health inspection data by New York City. This has caused a furor among the restaurant owners because they are now required to wear their A/B/C badges front and center. Dan collected some data (which he also posted), made some charts, and reported some interesting insights.

Here is an overview chart that shows the distribution of scores (the higher the score, the lower the grade). He called it a "scatter plot" but it is really a histogram where the bucket size is 1 except for the rightmost bucket.

Chart-scores-colored-nycfood
 

I like the use of green, yellow and red colors to indicate (without words) the conversion scale from scores (violation points) to grades (A/B/C). The legend "Count" is an Excel monstrosity. I'd have used a bucket size of at least 5, which would smooth out the gyrations in the green zone.

A more typical way to summarize numeric data in groups is Tukey's boxplot, as shown below.

Tukey_boxplot 

I use Dan's raw data on this chart. 1 = A, 2 = B, 3 = C. What is group 4?

It turns out Dan has removed this group from all of his analysis. A little research shows that group 4 are restaurants that have been closed by the Dept of Health. Interestingly, the scores of these restaurants are spread widely so the DOH appears to be closing restaurants not just for health violations. (In the rest of this post, I have removed group 4.)

For those not familiar with box plots, the box contains the middle 50% of the data (in this case, the scores of the middle half of the restaurants in the respective group); the line inside the box is the median score; the dots above (or below, though nonexistent here) the vertical lines are outliers. As Dan pointed out, group C has lots of outliers on the high end of the score.

Score111Just for fun, I pulled the violations of the highest scoring restaurant (111 violation points). What I find intriguing is the huge fluctuation in scores over the last 5 inspections. Does this happen to other restaurants too? What does that say about the grading system?

 


***

Next, Dan then attempted to address the questions: did scores vary across the 5 boroughs? and did scores vary across cuisine groups? This is the concept covered in Chapter 1 of my book: always look at the variation around averages, that's where the most interesting stuff is.

He calculated the means and standard deviations of different subgroups. It is simpler to visualize the data, again using boxplots.

Here's one dealing with boroughs, and it is clear that there is not much to pick between them. You could possibly say Staten Island is better than the other 4 boroughs.

Redo_scorebyborough

Here's one dealing with cuisine groups, using Dan's definitions.

Redo_scorebycuisinggroups

The order of the cuisine groups is by median score from lowest on the left to highest on the right. Again, there is no drastic difference. It is certainly not the case that Asian/Latin American restaurants are worse than say European or American ones.

About half of the restaurants under desserts, drinks, misc., african, and others received As while a bit less than half of the other cuisine groups got As. Some of the cuisine groups had few egregious violators (African, Middle East) - but this data is perhaps skewed by the removal of the "closed" restaurants.

One shortcoming of the traditional boxplot is the omission of how large each group is. For groups that are too small, it is difficult to draw any statistical conclusions. We know from Dan's table, for instance, that there were only 17 restaurants classified as "African".

(Unfortunately, Excel does not have built-in capability for generating boxplots.)


Update: Seminar at Columbia

This is a cross-post from the book blog.

For those in the New York area, I will be giving a talk tomorrow (Aug 11, Wed) at noon at Columbia's EdLab. The talk will cover a topic from the book, and what about it is not typically discussed in statistics courses. See here for an abstract.


Light entertainment: a world of spikes

Something light to start the week... this infographics poster submitted by Curtis R. has to be seen to be believed. Prepared by the Rate Rush website, it compares Digg and Reddit, two services that rank and track the popularity of web pages. They were in the news a few years back; do people still use Digg or Reddit?

Here is a section of the chart:

Diggvsreddit 

Oh, and if you scroll down further, the designers received some appropriate feedback and re-did this chart:

Raterush-digg-reddit-number-per-hour 

I have to commend them for responding to reader suggestions. This line chart is obviously much better, and we can see that Digg has more front-page stories than Reddit at any time of the day. (Please put the data-series labels next to the lines on the right. And fix the nonsensible decimals on the gridlines.)

The fact that there is a huge gap between Digg and Reddit during the early morning hours could indicate that Reddit users tend to visit the site at work, or it could indicate that Reddit's algorithm realizes that there is no need to update the front page as often when the traffic is slow, or it could be some data processing error. It's something worth investigating.

Here's one more for further amusement:

Raterush_pies

Curtis reacts to this spiky chart:

2 pie charts, tilted at different angles (making it impossible to accurately judge the size of each sector), with a color legend that switched from chart to chart (e.g. imgur is blue in the reddit chart, gray in the digg chart).