Note: I contributed the following post to Statistics Forum, which is a new blog sponsored by the American Statistical Association (ASA), curated by Andrew Gelman.
A reader from Twitter @meprieb suggested that I discuss a particular set of "infographics", one of which is shown below:
This chart is unquestionable easy on the eyes, and engages our brain cells. The use of a real-world object to simulate a pie chart is cute and even ingenious. According to the description at PSFK, the data tell how "Danish people feel about publicly wearing religious symbols".
The key to reading this chart is to read it as an illustration, an art piece. The fact that this is described as "infographics" reveals a wide divide between the artists and the scientists who work in visualizing data.
This chart fails completely as data graphics. The size of the pie quadrants has no relationship with the data at all, and the four percentages on the chart add up to much more than 100%, and obviously not proportions. The same problem plagues every one of these charts in the set.
Further reading: Andrew Gelman has recently made comments about the divide between the statistical graphics and infographics communities here.
Not sure how tetris-shaped pieces are better than a standard stacked bar chart, or a line chart
Adding a one-liner for each analysis summarizing the key insight is essential, and much more engaging than dry titles like "by gender"
Ordering each section of the poster in a sensible way would help bring out the message; maintaining the same order in all four sections has little benefit but adds to the confusion
Many of the corporate logos are not popular enough to yield recognition; they do not resemble their company names enough to elicit free association
But the chart also fails to ask the right question. In thinking about "who uses which sites?", it would be much more informative to cut the data in a different way -- tell us among males, what proportion uses Digg v. Stumbleupon v. Facebook, etc. The problem with the current graphic is that it offers no information about scale. For example, Ning may have 1/1000-th of the total traffic compared to Facebook (I made this up) but you wouldn't know since everything is expressed as a proportion of each site's user base.
Besides, what is the objective behind asking the question, who uses which sites? Are readers asked to draw conclusions about the relative viability of the business models of these companies? Is there some significance associated with an elderly skew or female skew?
Finally, the chart hits the trifecta! It also fails from the data collection perspective. While it discloses the source of the data as "Google Ad Planner", it is impossible for readers to make sense of the data. How reliable is this data? Did the income levels come from surveys of users (self-reported and probably biased)? Or from users associated with a specific advertising campaign? Did they come from matching users' IP addresses to Census data? If so, how much actual household-level data are used? Or perhaps a statistical model was built to predict income levels? Of which period is the data representative? Does that period generalize to other periods? Were there any (or many) missing values? Were these values imputed or set to the average? If a sample was used, how do we know that it is unbiased?
In this form, the infographics poster is nothing more than a done-up data dump.
Received a wonderful link via reader Lonnie P. to this website that presents a historical reconstruction of W.E.B. DuBois's exhibit of the "American negro" at the 1900 Paris Expo. Amusingly, DuBois presented a large series of data graphics to educate the world on the state (plight) of blacks in America over a century ago.
You can really spend a whole afternoon examining these charts (and more); too bad the charts have poor resolution and it is often hard to make out the details.
Judging from this evidence, we must face up to the fact that data graphics have made little progress during these eleven decades. Ideas, good or bad, get reinvented. Disappointingly, we haven't learned from the worst ones.
Mint produced a set of charts (they call this an infographic) about the state of our retail sector prior to entering the all-important 4th quarter. There are things I like, and things I don't.
What they did well is to produce a separate chart for each key message.
Let's start with the last chart on the poster, which is the simplest and so the easiest to grasp.
The Bumps chart (shown on the right) compares the growth performance of four luxury retailers over the first three quarters of 2010. The text correctly summarizes the pattern as a "slow decline in growth over 3 quarters".
The chart focuses on the change of a change, the "second derivative"; thus, a downward-sloping line above zero indicates positive growth that decreases in magnitude. Just like the Co2 emissions charts (see Stefan's comment), the designer makes a conscious choice to invert our conventional sense of up is positive, down is negative, and would have done well by readers if he had slipped a brief note under the chart.
There is one extra detail -- better to label the vertical axis as "annualized growth rate (seasonally-adjusted)". The seasonal adjustment, as explained here recently, is needed so that every number on this chart can be directly compared to every other number. (The simplest form of seasonal adjustment is taking the difference of this year from the last year, which is what they did here; thus, Neiman Marcus's Q1 sales grew by 40% in 2010, relative to their Q1 sales in 2009.)
The current label, "change in spending per user '09-'10", juxtaposed with the time axis going from Q1 to Q3 2010, confuses and does not communicate.
How can we make a world of difference with a minor shuffle? Put the retailer names in the same order as the lines on the chart. Like this:
At the top of the poster sits a similarly-styled, slightly busier chart, comparing the change in growth rates of retail sales by type of retailer.
Four little things stick out for improvement.
Placing the line labels beside each line eliminates the need for a legend, and also removes the need to use six colors on the same page.
The 3% increment on the sales axis is unusual, as if to defy convention for the sake of defiance. I'd stay glued to a 5% increment.
The text sings a different tune from that of the chart. The only year visible to the reader is 2010 while the text refers to 2009 and 2008. This forces readers to crawl around the plumbing (to use Andrew Gelman's term); to see the connection to 2009, readers have to know the formula for seasonal adjustment. There is no linkage to 2008 as far as I can tell. Better to hide the pipes.
And then we have the little issue of the negative growth rate. In a chart like this, the negative numbers should shake up readers because they represent not just declining growth but contraction. Shoving this into a little corner does not do it justice.
The same chart with minor changes:
Thanks to reader Chris P. for sending in the link. Chris suggests that a longer time horizon be plotted.
In the next post (second mint), I will deal with the interpretation of these charts.
Daniel L. submitted this infographics with some positive comments:
-There's quite a bit of information. The objects on the chart have some depth beyond dots on a page. -I'm
kind of willing to overlook the size of the infographic because I think
there's pretty good use of the page. Yeah....I know...a lot of blank
space in there, but I think that's the price of doing the 3 column
schtick - and I think it works in this case.
One weird thing: Top to bottom it goes from 2010 back to 2001.
I am less impressed with this chart. (The full version is here. It's big.) There are many problems:
It has all the signs of having tried too hard. There is indeed a trove of information. We are presented with each of Google's acquisitions, the time of each deal, the value of each deal, whether the deal happened in a busy or not busy period, the type of deal by synergy with Google, type of business the acquired company is in, the impact on the financials of Google. As if this is not enough, the chart includes the months without any acquisitions in small gray letters.
But the designer seems to have no idea what the plot is. I can't figure out what I am supposed to read from this poster. While I often dislike the graphical details on this genre of posters, I usually can enjoy the attempt to tell a story using the data but on this one, I just don't know what to make of it.
Also notice that the foremost dimension on this chart is the chronology. To me, it is the least important dimension. If one wants to understand what Google's acquisition strategy is, for example, the chronology is not important - it would have been much more informative to group the acquisitions by type of business, or impact on financials, or any number of other dimensions.
This announcement arrived in my inbox. Although I have criticized any number of such infographics, there may be some of you who would like to enter this contest. You get a set of Tufte books if you win 2nd place.
If you have submitted links to me in the past few months, you will see them posted in the next few weeks; I just spent some time looking at all the submissions.
Here are some links that are slightly off-topic (though still interesting), and others I don't intend on writing full posts about:
Daniel L. sent us to Slate, where they posted this chart counting up the human cost of the Afghan War. Applying the Trifecta checkup, he gave this evaluation:
What is the practical question: I have no idea What does the chart say: I have no idea What does the data say: I have no idea
The time series thing coupled with poor use of color obscures whatever patterns you could pick up.
Daniel is right about the last point - by plotting the disaggregated data, readers are forced to stare at the variability of casualties over time, and the progress of the war, which distracts from the idea of "accounting for the dead".
Daniel also argues, and I agree, that this math is meaningless even if done properly.
Understanding Google PageRank - Nick calls this an infographic but it contains zero data. Not the kind of thing for this blog but it does a decent job explaining PageRank.
The part about circular links canceling each other out confuses me; it would seem like good blogs should be able to link to each other without being penalized.
The Ins and Outs of Assisted Living Homes - Ellen G. created this "infographic" explaining what "assisted living homes" are like. Again, not stuff for this blog, as the two bar charts are just tag-alongs that are not well integrated with the rest.
In terms of the charts, please remove 3-D, remove the colors, order the data from largest to smallest, consider a horizontal bar chart with data labels on the left, and title it "the top needs for assisted living residents".
Something light to start the week... this infographics poster submitted by Curtis R. has to be seen to be believed. Prepared by the Rate Rush website, it compares Digg and Reddit, two services that rank and track the popularity of web pages. They were in the news a few years back; do people still use Digg or Reddit?
Here is a section of the chart:
Oh, and if you scroll down further, the designers received some appropriate feedback and re-did this chart:
I have to commend them for responding to reader suggestions. This line chart is obviously much better, and we can see that Digg has more front-page stories than Reddit at any time of the day. (Please put the data-series labels next to the lines on the right. And fix the nonsensible decimals on the gridlines.)
The fact that there is a huge gap between Digg and Reddit during the early morning hours could indicate that Reddit users tend to visit the site at work, or it could indicate that Reddit's algorithm realizes that there is no need to update the front page as often when the traffic is slow, or it could be some data processing error. It's something worth investigating.
Here's one more for further amusement:
Curtis reacts to this spiky chart:
2 pie charts, tilted at different angles (making it impossible to
accurately judge the size of each sector), with a color legend that
switched from chart to chart (e.g. imgur is blue in the reddit chart,
gray in the digg chart).