I feel a little weird about featuring this item. Helen E., who created the chart/poster, urged me to write about it. The link seems to connect to a commercial site but doesn't look too commercial -- and since the iPad fever is upon us, I thought why not.
There are two elements on this poster that qualify it as data graphics.
The surreptitious blue bars that are sized to match the inflation-adjusted prices... except for the Apple Lisa when the bar was chopped prematurely.
And the image equation gimmick at the bottom.
The 43 iPads = 1 Lisa visualization is definitely effective. I'm not so sure if anyone should care about this particular comparison though.
The blue bars are a super, light touch on a chart that could be otherwise quite boring. The varying heights of the bars exaggerate the larger prices, and can cause confusion. For example, Apple III, costing $11,412.88, is a midget next to Macintosh Portable, costing $11,358.59.
The choice of which products to feature images and bolded text appears somewhat arbitrary... are those keystone products? Nice use of foreground/background though.
I sent a note back to Helen about the (ab)use of decimals, indicating that dropping decimals and rounding off to the nearest $10 would improve readability.
She replied, saying "moving forward, they will be a little cleaner without such a paranoid focus on accuracy".
She explained further:
We were really concerned with accuracy as we knew that the Apple fan
base would be tough on us if our calculations were even a little out.
So legendary Apple fanboys, take your decimals with you on your way out!
The gulf between infographics and statistical graphics, that is.
Stan at Mashable praised "5 Amazing Infographics for the Health Conscious". They belong to the class of "pretty things" that are touted all over the Web but from a statistical graphics perspective, they are dull.
Reader Mike L. poked me about the snake oil chart (right) while I was writing up this post. The snake oil chart is by David McCandless whose Twitter chart I liked quite a bit.
This one, not very much.
If the location and cluster membership of the substances depicted have some meaning, I might even feel ok about the effervescence. But I don't think so.
I continue to love his pithy text labels though; the "worth it line", truly.
The data (if verified) is pretty useful though since there are so many health supplements out there, and as a consumer, it's impossible to know which ones are sham. (Ben Goldacre's site may help.)
Now, let's run through the low lights of the rest:
I'm still trying to figure out what plus-minus means in the Dirty Water graphic.
The fact that the four buildings are not considered one complete unit also trips me up. The Truckee Meadows is depicted as 7 buildings, not divisible by 4. In addition, if 2 short buildings + 1 tall + 1 medium = 200,000 people, how many people live in 2 tall + 1 medium + 4 short buildings?
The obesity charts are pinatas.
The cost of health care chart is boring, just a prettied up data table. Why are life expectancy statistics expressed in 2 decimal places, and not in years and months?
Why 78.11 years and not 78 years (or 78 years, 1 month)?
The scatter chart relating survival rates of people with various ailments and the survival rates of virues/bacteria left outside our bodies is alright but do we care about this correlation?
I hate to be so negative but I can't believe these are examples of good infographics.
My appeal for readers to send in positive examples still stand!
The biggest strength of this book is the material on data collection and selection, which is an overlooked aspect of statistical graphics. The content of p.103, for example, is not typically found in similar
books: on this page, Wong works through how to determine the scales for
two stock-price charts in such a way that the distances represent
relative changes in stock prices (rather than absolute changes). Chapter 3 ("Ready Reference"), which covers this type of material, is almost as big as Chapter 2, which runs through basic rules of making graphs that should be familiar to our readers. Her philosophy, then, leans toward Tukey's as espoused in his seminal book EDA, although Wong keeps to the most basic elements (percentages, indices, log scales, etc.), obviously aiming for a different audience than Tukey.
The guidelines relating to making charts are prescriptive and concise. The following snippet (pp.72-73) is typical of the style:
Wong focuses on saying what to do, but (usually) not why. Perhaps for this reason, the book has no references or notes, except for mentioning Ed Tufte as Wong's thesis adviser. Almost all the best practices described in the book would meet with our approval. One that has not been featured much on this blog is the preference for shades of the same color to many different colors of the same shade.
Despite the title, the book actually discusses statistical graphics (same as Junk Charts), not "infographics" (as covered by Information Aesthetics, for example). Almost all the graphical examples are conceptual, and not based on real-life examples. This editorial decision has the advantage of sharpening the educational message but the disadvantage of being less engaging.
A unique feature of Wong's book is Chapter 5 ("Charting Your Course"), which covers business charts used to organize operational data, rather than present insights -- things like Gantt charts (which she calls work plans), org charts, flow charts, 2-by-2 matrices, and so on. Things that are in the toolkit of management consultants. This is an under-studied area, and deserves more attention. I am reminded of Tufte's re-design of bus schedules. This type of charts is different in the need to print all pieces of data onto the chart, the prevalence of text data (and the difficulty of incorporating them into charts), and efficient search as a primary goal. And it is in this chapter that the decision to stay conceptual diminishes the impact: it would be very valuable for readers to see a complete Gantt chart based on a real project, and how it evolves over the course of the project. I have always found these types of charts to start out nicely but gradually sink as details and detours pile up.
There is one chart on p.59 I would like to discuss.
Here, Wong allows the use of double axes in certain cases, basically
when the two data series have linearly-related scales. She appends the
advice: "Adhere to the correct chart type for each series -- lines for
continuous data and bars for discrete quantities... The only exception is when both data series call for a chart with vertical bars. In such instances, convert one to a line." (Regular readers know I don't think much of this rule.)
Based on the chart above, Wong either considers both revenue and market share to be discrete quantities, or considers revenue to be discrete and market share to be continuous. In my mind, both series are continuous data and a chart with two lines is appropriate here.
Much of its power comes from the delightful use of short, precise data labels: "20 dead", "50 lazy", "5 loud mouths". And I love the "subjective" title.
A few considerations. The current choice of color, and to some extent the location of subgroups, makes the pinks (dead) and the blues (5% with over 100 followers) stand out. Probably not the intent. The grays are not labeled - not a big deal here since they are not the focus of the chart, and there won't be any short, precise labels for the grays (perhaps the average). Because of the color choice, the grays appeared as if they don't belong.
What might work better is to have darker colors on the right side of the chart, and have the colors fade out towards the left (the lazy and the dead).
Also try a 5x20 grid with five blocks. This allows the height of the chart to represent the relative proportions.
David has recently published a beautiful-looking book, only available in the UK currently. An older book - on visualizing trivia - is available in the US. He has done work for the Guardian and Wired.
Stefan S. at the UNEP GEO Data Portal sent me some intriguing charts, made from data about the environment. The following shows the amount of CO2 emissions by country, both in aggregate and per capita. We looked at some of their other charts before.
These "inkblot" charts are visually appealing, and have some similarities with word clouds. It's pretty easy to find the important pieces of data; and while in general we should not sort things alphabetically, here, as in word clouds, the alphabetical order is actually superior as it spaces out the important bits. If these were sorted by size, we'll end up with all the big blots on top, and a bunch of narrow lines at the bottom - and it will look very ugly.
The chart also breaks another rule. Each inkblot is a mirror image about a horizontal line. This arrangement is akin to arranging a bar chart with the bars centered (this has been done before, here). It works here because there is no meaningful zero point (put differently, many zero points) on the vertical scale, and the data is encoded in the height of each inkblot at any given time.
Breaking such a rule has an unintended negative. The change over time within each country is obscured: the slope of the upper envelope now only contains half of the change, the other half exists in the lower envelope's slope. Given that the more important goal is cross-country comparison, I think the tradeoff is reasonable.
Colors are chosen to help readers shift left and right between the per capita data and the aggregate data. Gridlines and labels are judicious.
As with other infographics, this chart does well to organize and expose interesting bits of data but doesn't address the next level of questions, such as why some countries contribute more pollution than others.
One suggestion: restrict the countries depicted to satisfy both rules (per capita emissions > 1000 kg AND total emissions > 10 million tonnes). In this version, a country like Albania is found only on one chart but not the other. This disrupts the shifting back and forth between the two charts.
Jeff W made some astute comments on the New York Times Netflix visualization, which I praised in the last post. He pointed out that there is so much more to the underlying data than what can be shown within the confines of these maps. For example, he wanted to know the relationship between Metacritic scores and Netflix ranks (or rentals), explore the heavy-tailed distribution of titles, expose regional differences, etc.
What he is hitting on is the shortcoming of the current approach to infographics... an approach which is about putting order to messy data, rather than summarizing, extracting and generalizing. And it is also the difference between "data graphics" and "statistical graphics".
This is related to the modelers versus non-modelers dichotomy Andrew Gelman just discussed in this blog post. (He cites Hal Stern as the source of the quote.)
Basically, non-modelers have the same philosophy as infographics designers - they want to make as few assumptions as possible, to rely exclusively on the data set. By contrast, modelers want to reduce the data, their instinct is to generalize. The stuff that Jeff wanted all require statistical modeling. As I mentioned before (say, here), I believe infographics has to eventually move in this direction to be successful.
Take the correlation betwen Metacritic score and Netflix ranking... the designers actually thought about this and they tried to surface the correlation, in a way that is strait-jacketed by the infographics aesthetics. What they did was to allow the movies to be sorted by Netflix ranking, or by Metacritic score, using the controls on the top right. And when the Netflix ranking is chosen for sorting, the Metacritic score is printed next to the map, so as the reader scrolls along, he or she can mentally evaluate the correlation. Of course, this is very inefficient and error-prone but we should give the designers props for trying.
Building a model for this data is no simple matter either because multiple factors are at play to determine the Netflix ranking. A good model is one that can somewhat accurately predict the Netflix ranking (color) based on various factors included in the model, such as the type of movie, the cost of movie, the number of screens it's played, any affinity of a movie to a locale (witness "New in Town"), regions (at different levels of specificity), recency of the movie, whether it's been released on multiple format, etc. etc.
Jeff's other point about ranking vs number of rentals raises another interesting statistical issue. I suspect that it is precisely because the number of rentals is highly skewed with a long tail that the analyst chose to use rank orders. If an untransformed number of rentals is used, the top few blockbuster films will dominate pretty much every map.
This graphic feature is the best from the NYT team yet. I particularly love the two columns on the right which allows us to see regional differences. For example, this "New in Town" movie was much popular in Minneapolis than any of the other metropolitan areas, and was particularly unwatched in New York. Also, note the choice of sorting allowed on the top right.
I have not yet been fully convinced of the direction of infographics until now -- I find too narrow the focus on organizing, structuring and visualizing large datasets; often times, we get pretty pictures with extremely high data-ink ratios but more often than not, these very dense graphics fail to speak directly to readers. We see a lot of information; we find hardly any insights.
I think I have seen the future. My friend Adam has been working on a web service called Empirasign, which I will describe as a form of data democracy - he takes boatloads of financial data, runs all sorts of analyses and models, and presents these results in a variety of formats, including on-line reports and tweets. He does not attempt to visualize all the data, or all possible relationships. Each analysis or model focuses on specific matters and he presents the result in tables and charts.
For example, a business problem might be as follows (timely for the year-end): in my portfolio, I am carrying some loser stock which I'd like to sell by year end so I can take a tax deduction on the loss, perhaps to cover some investment gains I have realized last year; however, I also believe that the loser stock may be near bottom, and if I sell now, I'd want to buy it back in short order - alas, this may be considered a "wash sale" and prohibited. What if one can find a hedge (another stock or a portfolio of stocks) that replicates the performance of the loser stock so now I can get the best of both worlds - I sell the loser stock for the tax deduction, but keep the performance by taking a position in the hedge, then unwind when the regulation allows me to buy back into the loser stock? (If you are interested in this trade, you should consult the experts: Adam's tutorial or wikipedia on "wash sale" or IRS-ese (pdf file).)
There are lots of stocks out there, and lots of possible hedges. An unsophisticated investor like myself would have to spend a lot of effort to find the right hedge. Also, it's very unlikely that staring at an infographics chart will uncover such hedges. What Adam has done is he has collected all the required data and run analyses to find the right hedge for pretty much every (loser) stock out there. And instead of presenting all the underlying data, he presents the results. See below.
These data displays are not sexy - and can be improved (the explanation for the columns of the table is found on a separate page, e.g.), but for the target audience looking for trade ideas, they get to the point. This is the gift of statistical data reduction.
What is also worth noting is through the magic of R, and Web technologies, Adam makes all this run automatically, so the insights from the data are uncovered in real time. The wash sale avoidance strategy is not the only analysis he provides; there are tons more on the website that implements all sorts of other techniques (of which I am no expert) but it appears that users can pick and choose whatever strategy they like to follow, and Empirasign saves them any of the analytical work.
As I said at the start of this post, I see this as a promising direction for infographics, moving from visualizing data to visualizing insights.
P.S. As with previous years, I have updated my Amazon wish list (click on button on top right). If you'd like to show your support for this blog, please help me build out my library. Thanks to those readers who have contributed in past years - since Amazon does not always provide me your contact information, I have not been able to thank each of you personally. Happy holidays!
Stefan pointed us to his work for the UN GEO (United Nations Global Environment Outlook) data portal. This set of information posters highlights a vexing issue that crops up on Junk Charts from time to time, that is, the proper balance between information and entertainment value of data displays. While this blog concerns itself primarily with the former, it does not mean that we are blind to the flashier side of the enterprise.
Let's take Stefan's recycling spiral chart as an example. One must admit that visually this presentation is more appealing than either a data table or a set of bar charts. The reader can obtain the primary piece of information, which is the ranking of different countries in terms of the proportion of collected waste that is recycled.
And if the reader is curious enough, the chart also provides the data on the per-capita amount of waste collected in each of these countries. (Like the table and bar chart, this display also has the problem that it is one-dimensional, thus the countries can be sorted by proportion of recycling but then the waste collected data will be out of order.)
For those readers who would like to understand the data better, they would want to know some of the following:
Is there a relationship between amount of waste collected and amount of waste recycled?
Are there differences in culture resulting in different recycling rates?
Is the level of development of a country predictive of its recycling rate?
Why are some countries recycling more of its waste, and others less?
To address these types of questions, one can start with the following scatter plot.
With the exception of South Korea, there is a general pattern of positive correlation: the more waste collected per capita, the larger proportion of such waste recycled. Any dots that are not in the bottom left or top right quadrant are exceptions to the rule. These countries are labeled in red or blue, the former indicating that the amount of collection is above average while the rate of recycling is below average.
Because there is sampling error, dots that are close to the average dot (the center of this scatter plot) are probably just average. Roughly speaking, dots in the gray circle are close enough to the center that I would not consider them exceptional cases. That leaves Spain and Iceland in the red corner, and South Korea in the blue corner. If both data series are considered together, these three countries should merit attention; if only the proportion of recycling is considered, then one would pay attention to Italy, Turkey and Slovak Republic on the lower end and South Korea on the high end.
Scatter plots are very versatile. The following one explores the issue of development level. Surprisingly, the level of recycling seems to have little to do with development; the countries are quite widely scattered.
Technical note: The data on both axes are expressed in "standardized" units. So the zeroes represent the average per-capita waste collected, and the average proportion of waste recycled (only of those countries depicted in the original chart). +1 indicates an amount that is one standard deviation above the average. Think of "standardized units" as measuring how extreme is a particular country with respect to the average.