Light entertainment: impossible is nothing!

Sep 26, 2012

Reader Julien D. sent in this image, which apparently has been in circulation. This company has a whole series of these ads running.

More illusions

Sep 25, 2012

A reader sent over another example of an optical illusion that chart designers should think about. Here's a 3D column chart:

And the illusion:

Gelman remakes a grouped bar chart

Sep 21, 2012

Like Gelman, I also cannot stand grouped bar charts. In this recent post, he tries making the chart several ways.

A general lesson learned here is that there really is no one-size-fits-all charts. Multiple line charts often work well as a replacement for grouped bar charts but in this instance, Andrew didn't like the lines being too crowded together. If the lines (i.e. data) look different, then the line chart might suffice. The point is: just make the chart, look at it, and decide if it's good enough.

I left a comment suggesting that in his last chart, it might be more telling if the data are turned into indices so that every line starts from the same level, and then we observe the percent changes over time.

Insufficiency and illusions

Sep 15, 2012

This WSJ graphic gives me a reason to talk about the self-sufficiency test: go ahead, and block out the data labels on the chart, you are left with concentric circles but no way to learn anything from the chart, not the absolute dollar values, nor the relative dollar values. In other words, the only way to read this chart is to look at the data labels.

The online article does not include the graphic. It's an article talking about Neil Armstrong's death. Here's the same data using bar charts:

The chart would be much improved if a longer time series is included giving us values for each year. It's pretty clear that this data is subject to sudden jumps (e.g. Armstrong's death) and so picking arbitrary years will likely cause is to miss important events.

***

Circles are also subject to various types of optical illusion. Before you use bubble plots, give the following a look:

Can we judge the size of circles in relation to other circles? (credit)

Can we judge the relative distance between circles? (credit)

Can we judge the relative sizes of circles within circles? (credit)

Speaking analytics

Sep 11, 2012

(This is a cross-post from my other blog, as it also relates to data graphics.)

I was a guest on the Analytically Speaking series, organized by JMP. In this webcast (link, registration required), I talk about the coexistence of data science and statistics, why my blog is called "Junk Charts", what I look for in an analytics team, the tension between visualization and machine algorithms, two modes of statistical modeling, and other things analytical.

A winning graphic of early voting

Sep 10, 2012

This pair of WSJ charts I like very much.

The article talks about the effect of early voting during Presidential elections in the States. People are allowed to mail in their votes as early as 2 months before the November 6 election.

The chart on the right identifies all the states that allow early voting, and in particular, it highlights (in orange) the seven battleground states that allow early voting. This shows the designer keenly aware of what's important and what's not important on the chart. The states are ordered by the first date of voting, instead of alphabetically. (I do have a question about why several of the gray lines towards the bottom of the chart do not reach November 6. Probably because mail-in voting is closed prior to Election Day in some states...)

If the data were to be available, a nice addition to this chart is to include the distribution of early votes over time. It's useful to see if North Carolina voters tend to spread their mail-in votes evenly over the 2 month period, or if most of them get sent close to Election Day, or some other pattern. Changing the bar chart to a dot plot and using the density of dots to indicate frequency would work fine here.

Instead of the first date of voting, the chart would be more informative if it plots the average date of voting (among mail-in voters). This is because the first date of voting is an extreme value and there may be few voters who vote on that day. If we have to pick one number to represent all early voters, we should pick the one with the average (or median) voting time. Again, this is constrained by whether such data is publicly released.

***

The chart on the left is also well executed. The title should include the additional fact that only battleground states are depicted. I'd also extend the vertical axis to 100% since the data are proportions. The beauty of this presentation is that it functions on several levels, whether you are interested in knowing that not much changed in Iowa from 2004 to 2008, or the fact that almost 8 of 10 mail-in votes in Colorado were early votes, or that in both Colorado and North Carolina, the proportion of mail-in votes more than doubled between 2004 and 2008.

Neither of these are fancy charts, but they pack quite a bit of useful information.

Mountain, molehill

Sep 07, 2012

Reader Jordan G. wasn't impressed by an attempt to visualize medal counts by country and sport in the Olympics over 112 years by Christian Gross at Visualizing.org (link).

The author chose to use the metaphor of "mountains" to portray the cumulative medals earned by each country. Each country is treated as a unit in a small-multiples-style presentation (see right). The bars represent different sports, and they are arranged as if arranging lanes in a swimming contest, with the largest haul in the middle, and second largest on the left, third largest on the right, etc.

This exercise highlights two important considerations from the designer's perspective.

The first is scaling. You'll notice that the first page (for Athens, 1896; excerpted below) is essentially unreadable. This is because the designer uses the same scale for every single page, and because he is plotting the cumulative number of medals over time. These two decisions mean that the initial pages would have much lower values than the latter pages.

It also means that on other pages, the extreme values walk off the edge of the chart area. (I think the reason is that if the scale has been tailored to present these extreme values, then pretty much every chart that doesn't contain extreme values would become unreadable.)

The choice of making countries units (discussed further below) makes for some awkwardness in latter years as the medals became more spread out among more countries. In the first Olympics, only 10 countries won any medals but in 2008, 127 different countries won at least one medal, among which 50 countries or so had never won more than 20 medals in all sports combined. This skewed distribution causes the designer to break one of the cardinal rules of small multiples, which is that the design of each unit must be the same, with only the data varying. Here, the top countries have their data plotted on a different scale from that of the other countries, as we can see from the different sized squares. When you mouse over a particular bar for a particular country, that sport is now colored red and corresponding bars are highlighted for every other country -- the problem is that the scales are not the same so the lengths of the bars give us misleading comparisons.

***

The second consideration is pagination. The data set has three dimensions: country, sport and time. In this presentation, the designer places sport within country within time. Put differently, the time categories are placed furthest apart - in fact, the reader must load a different page to see the evolution from one Olympics to another. The sport categories are placed closest to each other, in the same chart unit and so it requires the least effort for the reader to compare the number of medals won by the US in athletics (say) compared to gymnastics.

This goes back to the top corner of our Trifecta Checkup. What is the most important question the designer is trying to address? If it is the evolution over time, then the time dimension should not be placed furthest apart. If it is comparison across sports, then the sport dimension should be placed innermost.

For me the country dimension is the least important because everyone knows the US typically wins the most medals, and the top 5-10 countries are quite stable. Within a sport, I might wonder if certain countries are dominant in certain periods, and if certain countries started developing particular sports from a certain time period onwards. In this case, I'd place country and time within sport.

The following gives an idea of an alternative way of visualizing this data:

Apologies for not completing the dataset. Both charts are missing countries as well as years of history. But you can see where I'm going with this. There would be one chart per sport.

In gymnastics, we see that the US and China are latecomers, Russia has been the superpower until recently while Japan and Germany have stagnated.