Today we look at an example of a powerful visualization of some unstructured data. The data team at Guardian (UK) organized the Wikileaks data concerning reported incidence of IEDs in Afghanistan.
A scatter plot on a map provides an overview of the intensity of attacks from a spatial perspective. (A part of this map is shown on the right.) The background data -- the relief map of Afghanistan, and the major thoroughfares -- add to our understanding of why attacks were concentrated in certain parts of the country. It is always a great idea to add (con)textual data to help readers grasp the information shown on the chart.
Readers may want to understand the temporal pattern of attacks as well. The designer chose a small-multiples format to show this data, disaggregated by year of occurrence. This graphical construct is very versatile, and illustrates this data well... even though there has been little change over time, apart from a general increase in the number of reported attacks across the country.
It is a good idea to track the total number of attacks over time -- but not with those bubbles! The bubble chart almost always fails the self-sufficiency test; our eyes are not equipped to read relative areas of circles, and so any information we obtain about the aggregate number of attacks comes from reading the data directly. Switching to a bar chart, or removing the bubbles, leaving just the data, is recommended.
The major problem with a dataset like this is reporting bias: only attacks that were reported by U.S. personnel were included. The following chart helps close the gap a little by also showing the number of defused attacks, reported in the U.S. database. I'd have preferred a stacked column chart here since the total of defused (gray) and detonated (red) IEDs is an interesting statistic.
A stock trading volume type chart would also be nice, something like this:
Reader Tyson A. serves dessert for dinner, and stacked pancakes are on the menu!
According to the St. Louis Beacons that published these charts (and more):
These pie charts take the individual states' percentages, split them up and then stack them. In this way, you can see how the proportion of taxes in each category collected by each state compares with the states around it.
This presentation fails our self-sufficiency test: one is completely lost if the entire data set was not printed on the chart itself.
The pie pieces apparently lost shape as they got stacked on top of each other. The top green slice labeled Tennessee represents 2.1% but look at the difference between the green Nebraska (40%) and the green Kansas (40.8%), for example.
Also, the red pieces and the green pieces are ordered on their own so that the Tennessee red is near the bottom of the stack while the Tennessee green is at the top.
This data can be shown clearly in a pair of line charts.
To really learn something about the data, we can create a scatter plot.
From this plot, we see that most of these states (clustered in the middle) have similar taxation policies.
The exceptions are Illinois and Tennessee, and to a lesser extent, Missouri.
Starting 2011 with shark attacks courtesy of Julien D.
It seems like the good chart did not survive a shark attack if one were to judge from what's left of it.
It's a distorted pie chart with some kind of 3D hemispherical add-on, or it's a cross-sectional chart with the top of a sphere lopped off.
Charts of the USA Today variety do not usually feature here but this one has the aspiration to inform readers -- the chart appears on a web page that purports to correct some myths about shark attacks.
The biggest casualty of the shark attack is the ordering of the data labels. It is a brain teaser as to what criterion was used to order the pie slices: it's not the total number of accidents, nor the number of deaths, nor the death rate, nor alphabetical.
It's also unclear why the data labels were made vertical. The palette of colors is, however, typical of pie charts.
Since a death rate is usually defined as deaths / total accidents, not the other way round, even the numerical data labels is harder to read than necessary.
There are two primary questions this chart is intended to address: the prevalence of different types of shark attacks, and the death rate of each type of attack (as a proportion of reported accidents).
We try a scatter plot showing these two metrics, adding notes to point out where the interesting data is.
In a prior post, I showed a chart of Pisa test scores that can be used to investigate differences between any pair of countries. At least one reader found it confusing, containing too much data. I then realize that if the objective of the chart is re-stated as "How the UK fared relative to other OECD countries", which was the intent of the original Guardian chart, the chart could be presented in the following simplified fashion:
Simplification can be achieved in many ways, one of which is simplifying the objective. In fact, I'd not be opposed to showing just the left side of the chart, which addresses an even more general question, which is how the countries fared in a general sense.
While the lines in the Guardian chart display correlations of math, reading and science scores within specific countries, essentially a parallel coordinates plot, the same correlation can be visualized in a scatterplot matrix (see this post).
Each scatter plot here relates the scores of two subject areas as indicated by the axis labels. The simplest observation is the high degree of positive correlation on all three panels: in other words, countries in general do well in all three subjects, or poorly in all three subjects.
This pattern confirms why it isn't very productive to focus readers' attention on this set of correlations when dealing with this data set.
You'll notice the use of colored dots on the scatter plots. Imagine that I have put the countries into groups based on overall scores (rather than just reading scores) as in my earlier analysis. The dots of the same color represent countries that are deemed to have performed similarly. The black cross indicates the "average country".
Focusing on the colors for the moment, you can confirm yet again that a country doing well in one subject is highly predictive of it doing well in the other subjects.
As I pointed out at the start of the prior post, using a little statistical technique allows us to understand the data better, and plotting summaries of the data allows us to draw more interesting conclusions than putting all the data, unperturbed, onto a canvass.
Stefan S., whose team created the inkblot charts featured here, has updated his page of charts with a bunch of new ones, including some bubble charts. I had fun looking at a few of his experiments, and learnt a bit from them.
This chart deals with the geographical distribution of CO2 emissions and of wealth, and their correlation. The "standard" format would be to put pairs of columns onto a world map. That format has various weaknesses, and it is great to see Stefan try to reimagine the chart.
I especially love how he broke up the map into continents and subregions and arranged these pieces in a clean manner. He also recognized that stacked columns are not great for comparisons as the two pieces being compared don't usually sit at the same level. So in this chart, most pieces are level at their base. The solution is not perfect though, as for instance in the European section, it was very hard to put everything level without introducing white space.
Stefan also realized that you can't make it easy to compare distribution within a continent and across the globe in one chart, so he created the right-side column to solve this problem. Again, it's a good effort, not entirely successful but a very good start.
Another chart I like is the inkblot chart that deals with all levels of the data simultaneously. These, I think, manage to both engage our brains and entertain us.
*** Stefan specifically asked for feedback on this bubble chart:
As a whole, the scatter plot is effective at showing the inverse positive [oops, see comment below] correlation between development and per-capita emissions.
The effort is enormously ambitious in terms of stuffing as many dimensions as possible onto the same chart. I feel like the data can benefit from being shown in a set of charts, rather than one.
For example, having the continental averages plotted with all the individual countries doesn't work for me. I'd rather see individual plots for each continent, with the continental average plotted against a background of all the individual countries. This proposed view is reminiscent of the Gapminder presentation from what seems like eons ago.
Also, many metrics are found here: population, per-capita emissions, total emissions, human development index. Anyone familiar with this business knows that there are controversies around different pairs of metrics, e.g. per-capita versus total emissions, total emissions v. population, per-capita emissions v. population. Thus, a panel of scatter plots that focuses on different pairs of metrics would encourage readers to ponder these practical questions. Otherwise, we may feel like being dumped in the deep end.
I'm sure Stefan would appreciate other comments on this or other charts.
Dave S. achieved a rare feat, which is to send in a great-looking set of charts. This post at Asymco is worth reading in its entirety; the author Horace discusses the process by which he worked through several charts, arriving at the one he's most happy with.
The secret to the success here is the careful framing of the question, and the collection of the appropriate data to address that question. The question is the competition between wireless phone vendors in the last three years. It was established that the right way to view this competition is in two dimensions: share of revenues, and share of profits. Note the word "share". Share of profits is not a metric that is often discussed but it is the right metric to compare to the share of revenues -- getting both numbers onto the same comparable scale is what makes this work.
Needless to say, the raw data one would collect come from the financial statements of the eight individual vendors. Plotting these numbers directly would be a mistake. So you take the numbers, making sure that you're really counting wireless revenues and wireless profits, and then compute the shares. (I am not actually sure that they have wireless profit data because large companies like Apple and Nokia typically don't break out their profit shares, even if they provide the revenue shares by line of business.)
Horace also avoided the plague of plotting all time-series data as line charts (similar to the plague of plotting all geographic data on maps). By plotting revenues and profits simultaneously, he no longer can plot time (years) on one of the two axes, and that is a good thing.
This is the final graph Horace landed on. It puts all the vendors at the origin in 2007 and then tells us where they landed in 2010 in terms of revenue and profit share growth/decline.
It would be even better if he makes the scales work harder: e.g. have equal lengths for the 10% change along both the vertical and horizontal axes. Alternatively, you can scale it such as each unit on either axis represent equal dollars.
This is a very focused chart that answers the question about the relative change in positioning of each vendor. What it doesn't answer is the starting position or ending position of each. Note, while Nokia is depicted as losing share on both revenues and profits, Nokia still has twice the revenue share of the other vendors, and out-earns everyone except Apple!
I am not saying this is a bad chart. It is designed to answer the relative question, not the absolute question. That's all.
There is one way to have the cake and eat it too. Horace almost created that chart. He showed two scatter plots, one for 2007 and one for 2010.
If he just overlays one on the other, and use lines to connect the dots for each phone vendor, he will have a chart that shows absolute and relative values all at once. Here's a crude illustration of this: (missing the labels to show that the arrow end of the line represents 2010 positions)
I like this kind of chart a lot. It is great for showing dynamics in a set of variables, without actually making the chart dynamic.
(Even on this chart, it is better to harmonize the two scales.)
The scatter-plot matrix is one of the lesser known graphical tools beloved by statisticians. A scatter plot displays the correlation between a pair of variables. Given a set of n variables, there are n-choose-2 pairs of variables, and thus the same numbers of scatter plots. These scatter plots can be organized into a matrix, making it easy to look at all pairwise correlations in one place.
Since Nate Silver's feature article about New York neighborhoods came out, I have been working on capturing the data because so much was left unsaid in that article. His ranking formula takes 12 factors (housing affordability, transit, green space, nightlife, etc.) and combines individual scores into an overall score based on chosen weights (e.g. housing affordability counted for 25%). Scores are then converted to ranks.
Silver's discussion focuses on explaining which factors caused which neighborhoods to be ranked high (or low). I'm interested in whether the individual factors are correlated. For example, do neighborhoods with more expensive housing also tend to have higher-quality housing? what about better schools? are more diverse neighborhoods also more creative? and so on. There is really a treasure trove of information locked up in this data.
A scatter-plot matrix neatly organizes all of the pairwise correlation information. See below.
Each small chart shows the correlation between the given pair of variables (one listed on the right, the other listed below). The dots represent the neighborhoods. The pink patch contains the "middle 75%" of the nieghborhoods, and we can use the orientation of these patches to get a sense of whether the two variables are positively, negatively or not correlated.
There are lots to see in this chart. I just picked a random few things for illustration:
In the top left corner, the slant shows that the more affordable the homes are, the worse is the transit.
The better the shopping, the better the dining.
Interestingly, more diversity seems to mean lower creative capital (also the correlation is only moderate).
Wellness scores fall within a rather narrow range compared to other categories, and they seem to be almost completely unrelated to any of the other factors.
(Note: I used JMP to generate this matrix. Excel unfortunately does not make scatter-plot matrices natively. JMP is great for such exploration... if the developers are reading this, please make it easier to man-handle the category labels! I made a mess of rotating the text on the right.)
P.S. I had an adventure processing the data from New York magazine. There appears to have been quite a few typos. For more, see my writeup on the book blog.
The gulf between infographics and statistical graphics, that is.
Stan at Mashable praised "5 Amazing Infographics for the Health Conscious". They belong to the class of "pretty things" that are touted all over the Web but from a statistical graphics perspective, they are dull.
Reader Mike L. poked me about the snake oil chart (right) while I was writing up this post. The snake oil chart is by David McCandless whose Twitter chart I liked quite a bit.
This one, not very much.
If the location and cluster membership of the substances depicted have some meaning, I might even feel ok about the effervescence. But I don't think so.
I continue to love his pithy text labels though; the "worth it line", truly.
The data (if verified) is pretty useful though since there are so many health supplements out there, and as a consumer, it's impossible to know which ones are sham. (Ben Goldacre's site may help.)
Now, let's run through the low lights of the rest:
I'm still trying to figure out what plus-minus means in the Dirty Water graphic.
The fact that the four buildings are not considered one complete unit also trips me up. The Truckee Meadows is depicted as 7 buildings, not divisible by 4. In addition, if 2 short buildings + 1 tall + 1 medium = 200,000 people, how many people live in 2 tall + 1 medium + 4 short buildings?
The obesity charts are pinatas.
The cost of health care chart is boring, just a prettied up data table. Why are life expectancy statistics expressed in 2 decimal places, and not in years and months?
Why 78.11 years and not 78 years (or 78 years, 1 month)?
The scatter chart relating survival rates of people with various ailments and the survival rates of virues/bacteria left outside our bodies is alright but do we care about this correlation?
I hate to be so negative but I can't believe these are examples of good infographics.
My appeal for readers to send in positive examples still stand!
Stefan pointed us to his work for the UN GEO (United Nations Global Environment Outlook) data portal. This set of information posters highlights a vexing issue that crops up on Junk Charts from time to time, that is, the proper balance between information and entertainment value of data displays. While this blog concerns itself primarily with the former, it does not mean that we are blind to the flashier side of the enterprise.
Let's take Stefan's recycling spiral chart as an example. One must admit that visually this presentation is more appealing than either a data table or a set of bar charts. The reader can obtain the primary piece of information, which is the ranking of different countries in terms of the proportion of collected waste that is recycled.
And if the reader is curious enough, the chart also provides the data on the per-capita amount of waste collected in each of these countries. (Like the table and bar chart, this display also has the problem that it is one-dimensional, thus the countries can be sorted by proportion of recycling but then the waste collected data will be out of order.)
For those readers who would like to understand the data better, they would want to know some of the following:
Is there a relationship between amount of waste collected and amount of waste recycled?
Are there differences in culture resulting in different recycling rates?
Is the level of development of a country predictive of its recycling rate?
Why are some countries recycling more of its waste, and others less?
To address these types of questions, one can start with the following scatter plot.
With the exception of South Korea, there is a general pattern of positive correlation: the more waste collected per capita, the larger proportion of such waste recycled. Any dots that are not in the bottom left or top right quadrant are exceptions to the rule. These countries are labeled in red or blue, the former indicating that the amount of collection is above average while the rate of recycling is below average.
Because there is sampling error, dots that are close to the average dot (the center of this scatter plot) are probably just average. Roughly speaking, dots in the gray circle are close enough to the center that I would not consider them exceptional cases. That leaves Spain and Iceland in the red corner, and South Korea in the blue corner. If both data series are considered together, these three countries should merit attention; if only the proportion of recycling is considered, then one would pay attention to Italy, Turkey and Slovak Republic on the lower end and South Korea on the high end.
Scatter plots are very versatile. The following one explores the issue of development level. Surprisingly, the level of recycling seems to have little to do with development; the countries are quite widely scattered.
Technical note: The data on both axes are expressed in "standardized" units. So the zeroes represent the average per-capita waste collected, and the average proportion of waste recycled (only of those countries depicted in the original chart). +1 indicates an amount that is one standard deviation above the average. Think of "standardized units" as measuring how extreme is a particular country with respect to the average.