« January 2010 | Main | March 2010 »

Auditory aid

This effort by the NYT graphics team is breath-taking.


They use dot plots to visualize the closeness of the finishes at many of the Winter Olympics races.

A small improvement is to organize the plots into two columns (men, women) so that readers can compare men and women across a row, and compare different events within a gender down a column.

What really sets this chart apart is the appeal to auditory aid.  Click Play and see what I mean.


Reference: "Fractions of a second: an Olympics musical", New York Times, Feb 26 2010.

Cousin misfit

Un_pipedwaterStef, who had a hand in the inkblot charts that many loved, sent in the following chart, with the note that he hasn't seen these line/area charts before.

This chart is interesting indeed. The objective of the chart is to compare the state of drinking water in different regions of the developing world. It tries to emphasize the amount of improvement attained between 1990 and 2006.

I can't quite figure out how the regions are ordered. It's not by any of proportions depicted in the chart, nor position on the map.

Next, with the areas catching much attention, I wanted to figure out what the areas mean.

To help in this exercise, I computed the key piece of information, i.e. the increase or decrease in proportion of each water source, and placed on in each piece of area, as shown below.



Based on this evidence, one has to conclude that the area has nothing to do with the change in proportions over time. The brown areas (unimproved sources) are negative changes while the blue and light blue areas are positive changes.  Negative area is not a visually depictable concept, unfortunately.

Also, note the dark blue areas of Latin America vs Western Asia.  The Western Asia one is a bit larger than the Latin America but the change in proportions is exactly the reverse, 7% against 23%.

Is this a new type of chart? It took me a few days to figure it out.




How is the following chart related to the above chart?


The original chart is a cousin misfit of the above chart, as we can see below.

The key piece of data is embedded in the slopes of the connecting lines, and this cousin of the column chart with connecting lines draws our attention away from those lines and to the areas. The colored areas are in no way proportional to the slopes of the connecting lines and so the information has been distorted.

Nice-looking chart but needs rethinking.


PS. Some commentators seem to think that I suggested that the paired column charts would be a better alternative than the original. No -- I am using charts to analyze charts. An improved chart would be like the following, in which the areas are de-emphasized in favor of the lines. (Please imagine the vertical axis.)



The graphs in this BBC article comparing several recent earthquakes hit us like aftershocks.

Bbc_quakes1a This chart tries to inform us the size of the quake in China was by far the largest. (The Richter scale is a power scale.)

The spirals feel like the Austin Powers time machine, disorienting, and also distracting because the bubble chart uses the entire area of the circles to represent magnitude.  Try to guess what the relative amplitudes are before I disclose them below. (The red spiral for Italy was arbitrarily chosen as the index, with relative amplitude 1.) Bubbles are just horrible constructs, and for such a simple chart, they are worse than printing the data.

Amazingly, this is a double-axes bubble chart! The spirals hide the fact that the three gray circles are of different sizes, presumably color-coded to fit the "Strength" of the quakes.  The other axis is "Relative Amplitude" represented by the red circles. Even though the two metrics are on hugely different scales, both the gray circles and the red spirals were anchored off the Italy red spiral (area = 1).

The following junkart version, which places the three quakes relative to the underlying relationship between strength and amplitude, is more informative with less fuss.



In the next chart, the Italians are shown to have no math skills (when in fact they have a strong tradition in math). How is it that 295 and 2000 have equal-sized bars? That's because the selected scale does not fit the data.

It's a mystery why Deaths and Injuries make friends while they ostracize the Homeless.  The three series (deaths, injuries and homeless) can be displayed separately.

A simple data table, with appropriate highlighting, gets this information across without the confusion.


Bbc_quakes3 This next chart is decent.  It is more effective if they make the Italy and Haiti blocks 20 across (same as the China blocks), stacking them one over the other. By doing so, the chart reduces to one dimension and we do not need to judge areas.

I think there is a calculation error with the Italy numbers. If 1 in every 190 affected died, then the number of affected is 190 x deaths, which from the above bar chart, equals 56,000. If only 56,000 were affected, how could 1.5 million be left homeless? (Wikipedia said 65,000 were made homeless.)

Bbc_quakes4 Overlapping non-concentric bubbles are also in need of rescue. Bubbles encode data in areas, areas are a square function of radii, the distance from the center to the circumference. When circles are not concentric, the centers do not coincide. This makes judging radii harder, which makes judging areas harder.

Look at Haiti vs. Italy. According to the printed data, the light gray area is about 60, which would be 40% of the dark gray circle. Who would have guessed? (I checked the areas, and indeed the Haiti area was 40% larger than the Italy area.)

By the way, in the first chart, the relative amplitudes were 40, 1 and 5.  Who would have guessed?

Andrew Gelman recently talked about good graphics being hard. Graphs are easy to make but hard to perfect. These examples show the need for care.

Reference: "Why did so many people die in Haiti's quake?", BBC News, 14 February, 2010.

The flattened staircases

Elizabeth left a comment on a previous post, pointing to this NYT chart comparing this snow season with last on the East Coast.


Quite a lot can be read from this small multiples chart of cumulative snowfall:

  • Washington, Baltimore and Philadelphia endured extraordinary amounts of snow that ranged from 4 to 8 times those of the last season
  • New York pretty much had the same amount of snow up to this point in time although the snow came earlier this season
  • Boston had a milder season. By the way, because of regional differences, it is just wrong to conclude that there is "global cooling". Evidence of that must take into account what is happening across the globe, not just in the east, or southeast, of the U.S.
  • Snow also came much earlier in Washington, Baltimore and Philadelphia although not so in Boston.

I find the placement of the city names odd. But the one thing keeping this a good chart, not a great chart, is the fumbling of the average. They need to plot the historical snowfall accumulation, not the total average snowfall.  As it is, the red line makes the unconvincing case that historically, the entire season's snow came in November. 

The chart can be fixed by either turning the red lines into red dots at the end of the time window, or by plotting another series of staircase charts which can be directly contrasted with the blue lines. Best yet, plot the past 10 to 20 seasons' staircases in the background to show both the historical average and variance.


Welcome to the new Junk Charts. 

Apart from the facelift (thanks to my friend Amanda), I have added a sister blog which will focus on statistical thinking in everyday life, the theme of my book. The original Junk Charts will continue to examine graphics and data presentation in the mass media.


Color coodination

Brad Delong called this "Graph of the Day", and it's been circulating in the blogosphere (Jeff Weintraub, Washington Post).


This innocent-looking thing does a good job hiding its defects.

Readability can be much improved by merely moving the subgroup labels ("All", "Democrats", etc.) to the left border of the chart.

The proportion of respondents who are Democrats, Independents and Republicans, if printed on the chart, will help us understand how the bottom three bars relate to the top one. As it is, we can reason that there were roughly equal proportions of Democrats and Republicans (the Independents are somewhat like the overall average, and the overall average is roughly at the midpoint between the Republican and the Democrat numbers).

And the colors!! Where to begin?

It used to be that color was banished from "good" graphics because it was considered unnecessary, and more trouble than its worth.  We now embrace color with moderate enthusiasm.  This chart shows why colors should be used judiciously.

If asked to provide a color key to this chart, one may come up with the following:


It appears that 8% of Democrats have been banned from wearing party colors because of their white flag on health care; ditto the 42% of Republicans who wanted health care reform.

The colors should be coordinated with the structure of the data. Here, the data has three dimensions: the answer to the question, the party affiliation, and party v. overall average.  For this graph, I reckon the answer to the question is the most important dimension.

Finally, I think by highlighting the 88% of Democrats and 55% of Republicans, the chart actually missed the information in the data: that 42% of Republicans actually supported health care reform -- my impression is that the 55/42 split among Republicans makes this a much more bipartisan issue than most other issues in U.S. politics.  In other words, the fact that most Democrats support the liberal position and most Republicans support the conservative position is just not news.

Reference: "Poll: bipartisanship popular, compromise tricky", Washington Post, Feb 9, 2010.

Four challenges, five principles

At one level, Numbers Rule Your World (see here) is a primer on statistical thinking. If you are reading Junk Charts, you already know its importance.

In putting together the book, I gave myself these four challenges:

  • No equations

In order to make the book accessible to as many as possible, I borrowed the story-telling style of Freakonomics and The Tipping Point. This, the need to transform numbers into words, comes naturally having worked in business analytics for a long time; readers of Junk Charts will recognize how I always look for the message behind the numbers. And no equations means no equations.

  • No toy problems
No Monty Hall problem, no birthday problem, no urn problems, no St. Petersburg's paradox.  Not only have these topics been covered well by others, they are good for teaching but ultimately unrealistic.  I want to cover statisticians who have harnessed real data to make socially important decisions, such as telling us what makes us sick, setting insurance rates, evaluating SAT questions, catching thiefs.
  • Long-form stories
The book is organized around five statistical principles, with a pair of stories illuminating broad aspects of each principle. Each story is developed in rich detail, bringing out the players, the background, the numbers, the conflicts that form the process of applying science. In this way, the book's structure is different from The Tipping Point and Freakonomics, which can sometimes feel episodic.
  • Contemporary examples

Nearly everything in the book occurred in the last two or so decades.  Statistical thinking is evergreen.  I left the standard historical examples on the cutting floor -- Fisher's tea-tasting ladies, Student's brewery experiments, Galton's regression studies, etc. (For those who enjoy history, and can read math, Stigler's two books on the history of statistics are not to be missed.)


Here are the five key principles:

1. The discontent of being averaged: Always ask about variability

2. The virtue of being wrong: Pick useful over true

3. The dilemma of being together: Compare like with like

4. The sway of being asymmetric: Heed the give-and-take of two errors

5. The power of being impossible: Don't believe what is too rare to be true

Come back for more on these, or get the book.


Book available from Amazon, B&N, Borders, and also worldwide

The Book

One of the reasons I started Junk Charts five years ago was to create an outlet for my writing.  To manufacture a reason to write regularly.  Writing has been important to me since young, but for a long time, I did not have the discipline to sit down and produce.  Once the blog found its legs, the need to satisfy a responsive, attentive, demanding audience dissolves the problem. 

Then I got more ambitious: the blog experiment soon parlayed into a book project.  The book project started close to Junk Charts but eventually took its own path.  While Junk Charts focuses on how data is, and should be, presented in the mass media, and how the media influences public opinion through such presentation, the book takes a more expansive look at how data-driven decision-making occurs in our world: how statistical thinking shapes our work life, our health, our economy, our education, etc., often in hidden and surprising ways.

The title is "Numbers Rule Your World: The hidden influence of probability and statistics on everything you do."  It is published by McGraw-Hill; the official publication date is February 19.  It is already available from various online retailers. (Amazon, B&N, Borders)  My neighborhood bookstore has it out already.


This is a timely topic.  So many current events are influenced strongly by statistical thinking:
  • Counter-terrorism: how do statistical technologies, such as profiling and prediction via data mining, work? what are the benefits and costs of such methods? what is the risk of dying in a terrorist attack relative to other risks?
  • Toyota recall: do recalls work: how many deaths do they prevent, how much economic losses do they cause? what is the risk of dying due to the brake defect relative to other risks? how can one establish the brake defect as the cause of prior accidents?
  • Mammogram guidelines: what is the statistical science behind the revised guidelines? why is this not a black and white issue? how should we evaluate research studies of this kind?
  • Economic statistics: what is the "seasonal adjustment" used to modify employment data? what is the birth death model? should one look at quarter to quarter comparisons or year to year? why is it useless to compare same-store sales statistics for retailers in the past few years? 
  • Efficacy of anti-depressants: what is the placebo effect? do anti-depressants work for some but not for others? how are clinical trials conducted, funded and reported? 
  • Swine flu response: what is the risk of dying from swine flu? how should one evaluate the level of risk? How do vaccines work? what is "herd immunity"? what is the "precautionary principle"? why do responses in different countries differ?
  • Autism clusters: when do statisticians tell apart "something small" from "nothing at all"? how can one draw confident conclusions with only a few samples? do they know the cause of such clusters?

Numbers Rule Your World explains, using stories (no formulas), how we can use statistical thinking to think smartly about these issues and more.

In the acknowledgement, I wrote:

throughout this project, I was inspired by fans of my Junk Charts blog 

So you are the first to learn about the book. 

In the next few days, I will offer perspectives on how the book came into being, why I wrote it, and who I wrote it for.  I hope you'll be as excited about the book as I am!

Convention and function 2

Nicholas_giniReader Nicholas tried a best-of-both-worlds approach by putting maps and line charts side by side. This will satisfy those who cannot bear to drop the geographical graph paper.

The overlapping lines bring out nicely which states within a given region behave differently from the others.

This is work in progress.  Obviously, the cluster labels (A, G, F, etc.) can be replaced by region names.  The uninformative axis labels can be removed from the maps.

Also, if each state can be colored within a map, and coordinated with the lines on the right, we will be learn more about individual states.

Please email me if you would like to get his code.

The tweeting crowd

This work by David McCandless, via the Innovations in Newspapers blog, is fantastic.


Much of its power comes from the delightful use of short, precise data labels: "20 dead", "50 lazy", "5 loud mouths". And I love the "subjective" title.

A few considerations.  The current choice of color, and to some extent the location of subgroups, makes the pinks (dead) and the blues (5% with over 100 followers) stand out. Probably not the intent. The grays are not labeled - not a big deal here since they are not the focus of the chart, and there won't be any short, precise labels for the grays (perhaps the average). Because of the color choice, the grays appeared as if they don't belong.

What might work better is to have darker colors on the right side of the chart, and have the colors fade out towards the left (the lazy and the dead).

Also try a 5x20 grid with five blocks. This allows the height of the chart to represent the relative proportions.

David has recently published a beautiful-looking book, only available in the UK currently.  An older book - on visualizing trivia - is available in the US. He has done work for the Guardian and Wired.