For information on upcoming meetings in which I am presenting, see this post on the sister blog.
Through twitter, Antonio Rinaldi sent the following chart that accompanied a New York Times piece talking about the CPI (inflation index). The article concerns a very important topic--that many middle- to lower-income households have barely any saving after spending on necessities--and only touches upon the issue raised by this chart, which is that the official CPI is an average of prices of a basket of goods, and there is much variability in the price changes of different categories of goods.
I cover this subject in much greater detail in Chapter 7 of Numbersense (link). There are many reasons why the official inflation rate seems to diverge from our own experiences. One of the reasons is that we tend to notice and worry about price increases but we don't notice or take for granted price decreases. In the book, I cover the fascinating subject of the psychology of remembering prices. Obviously, this is a subject of utmost importance if we are to use surveys to understand perceived prices.
The price of a T-shirt (unbranded) has remained the same or may have declined in the last decades. Besides, the chart reveals that phone and accessories, computers and televisions have all enjoyed deflation over the last decade. Actually, much of the "deflation" is due to a controversial adjustment known as "hedonics". This is to claim that part of any price change is attributed to product or technology improvements. So, if you pay the same price today for an HDTV as in the past for a standard definition TV, then in reality, the price you paid today is cheaper than that in the past.
That adjustment is reasonable only to a certain extent. For instance, my cell phone company stuffs my plan with hundreds of unused and unusable minutes so on a per-minute basis, I am sure prices have come down substantially but on a per-used-minute basis, I'm not so sure.
Let's get to what we care about on this blog... the visual. There is one big puzzle embedded in this chart. Look at the line for televisions. It dipped below -100 percent! Like Antonio, many readers should be scratching their heads--did the price of television go negative? did the hedonic adjustment go bonkers?
As an aside, I don't like the current NYT convention of hiding too many axis labels. What period of time is this chart depicting? You'd only find out by reading the label of the vertical axis! I mentioned something similar the other day.
The key to understanding a chart like this is to learn what is being plotted. The first instinct is to think the change in prices over time. A quick glance at the vertical axis label would correct that misunderstanding. It said "Change in prices relative to a 23% increase in price for all items, 2005-2014".
This label is doing a lot of work--probably too much for its inconspicious location and unbolded, uncolored status.
Readers have to know that the official CPI is a weighted average of changes in prices of a specified basket of goods. Some but not all of the components are being graphed.
Then readers have to understand that there is an index of an index. The prices of each "item" (i.e. category or component of the CPI) are indiced to 1984 levels. So the prices of television is first re-indiced to 2005 as the baseline. This establishes a growth trajectory for television. But this is not what is being depicted.
The blue line reflects the 23% average increase in prices in that 10-year period. Notice that the red line does not exhibit any weirdness--television prices have gone down by 90 percent. It's not negative.
What the designer tried to do is to index this data another time. Think of pulling the blue line down to the horizontal axis, and then see what happens to the gray and red lines.
Now, even this index on an index should not present a mathematical curiosity. If all items moved to 1.23 while apparel moved to 1.10, you might compute 110%/123% which is roughly 0.. You'd say the apparel index is 90% of the way to where the all-item index went. Similarly for TVs, you would compute 10%/123% which is 0.08. That would be saying the TV index ended up 8% of where the all-item index landed.
That still doesn't yield -100%. The clue here is that the baseline is zero percent, not 100, not 1.0, etc. So if there is an item that moved in sync with all items, its trajectory would have been horizontal at zero percent. That means that the second index is not a division but a subtraction. So for TV, it's -90% - 23% = -113%. For apparel, it's +10%-23% = -13%.
Even though I reverse-engineered the chart, I don't understand the reason for using subtractions rather than division for the second layer of indicing. It's strange to me to add or subtract the two indices that have different baseline quantities.
Here is the same chart but using division:
I usually avoid telescoping indices. They are more trouble than it's worth. Here is an old post on the same subject.
Back in 2009, I wrote about a failed attempt to visualize regional dialects in the U.S. (link). The raw data came from Bert Vaux's surveys. I recently came across some fantastic maps based on the same data. Here's one:
These maps are very pleasing to look at, and also very effective at showing the data. We learn that Americans use three major words to describe what others might call "soft drinks". The regional contrast is the point of the raw data, and Joshua Katz, who created these maps while a grad student at North Carolina State, did wonders with the data. (Looks like Katz has been hired by the New York Times.)
The entire set of maps can be found here.
What more evidence do we need that effective data visualization brings data alive... the corollary being bad data visualization takes the life out of data!
Look at the side by side comparisons of two ways to visualize the same data. This is the "soft drinks" question:
And this is the "caramel" question:
The set of maps referred to in the 2009 post can be found here.
Now, the maps on the left is more truthful to the data (at the zip code level) while Katz applies smoothing liberally to achieve the pleasing effect.
Katz has a poster describing the methodology -- at each location on the map, he averages the closest data. This is why the white areas on the left-side maps disappear from Katz's maps.
The dot notation on the left-side maps has a major deficiency, in that it is a binary element: the dot is either present or absent. We lost the granularity of how strongly the responses are biased toward that answer. This may be the reason why in both examples, several of the heaviest patches on Katz's maps correspond to relatively sparse regions on the left-side maps.
Katz also tells us that his maps use only part of the data. For each point on his maps, he only uses the most frequent answer; in reality, there are proportions of respondents for each of the available choices. Dropping the other responses is not a big deal if the responses are highly concentrated on the top choice but if the responses are evenly split, or well-balanced say among the top two choices, then using only the top choice presents a problem.
Reader and tipster Chris P. found this "death spiral" chart dizzying (link).
It's one of those charts that has conceptual appeal but does not do the data justice. As the name implies, the designer has a strong message, that the arctic sea ice volume has dramatically declined over time. This message is there in the chart but the reader has to work hard to find it.
Why doesn't this spider chart work? We can be more precise.
This is a pity because the designer did very well in aligning two corners of the Trifecta Checkup, namely what is the question and what does the data show? It is a great idea to control for month of year, and look at year to year changes. (A more typical view would be to look at month to month changes and plot one line per year.)
This is an example of a chart that does well on one side of the checkup but the failure is that the graph isn't in tune with the data or the question being addressed.
Whenever I see a spider chart, I want to unroll the spiral and see if a line chart is better. Thus:
The dramatic decrease in Arctic ice volume (no matter the month) is clear as day. You can actually read off the magnitude of the drop. (Try doing that in the spider chart, say between 1978 and 1995.)
This chart still has issues, namely too many colors. One can color the lines by season of the year, like this:
Or switch to a small-multiples set up with three lines per chart and one chart per season.
The seasonal arrangement is not arbitrary. You can see the effect of season by looking at side by side boxplots:
The pattern is UP-DOWN-DOWN-UP.
In fact, a side-by-side boxplot of the data provides a very informative look:
The monthly series is obscured in this view, built into the vertical variability, which we can see is quite stable. The idea of controlling for month is to make it irrelevant. This view emphasizes the year on year decline of the entire distribution.
If you're worried that dropping too much information, the data can be grouped by season as before in a small-multiples setup like this:
Regardless of season, the trend is down.
PS. Alberto reminds me of his post about one example of a spider chart (radar chart) that works. Here's the link. It works because the graphical element is more in tune with the data. While the ice cap data has a linear trend over time, the voting data is all about differences in distribution. Also, the designer is expecting readers to care about the high-level pattern, not about the specifics.
For someone who doesn't know genetics, it is very hard to make sense of this chart. It seems like there are five characteristics that each unit of analysis can have (listed on the left column) and each unit possesses one or more of these characteristics.
There is one glaring problem with this visual display. The area of each subset is not proportional to the count it represents. Look at the two numbers in the middle of the chart, each accounting for a large chunk of the area of the green tree. One side says 5,724 while the other say 13 even though both sides have the same areas.
In this respect, Venn diagrams are like maps. The area of a country or state on a map is not related to the data being plotted (unless it's a cartogram).
If you know how to interpret the data, please leave a comment. I'm guessing some kind of heatmap will work well with this data.
A quick search on Google reveals the extent of this PIe pollution. Click this link to check it out!
So, I confess I don't know much about editing Wikipedia but it is easy!
Find a chart. Make your chart. Create a Wikipedia account. Use the upload wizard to get an image tag. Go to the Wiki page, click Edit, and paste the image tag there. And you're done.
There are way too many colors; the labels on the smaller pie pieces bleed onto their neighbors and the separation between natural and anthropogenic sources isn't as clear as it could be.
Here's what my revised version looks like:
In this case, I merely shrank the pie, using it as a legend. I also fixed the typo in the word anthropogenic.
You can do it too. Get started now! #onelesspie
The class practised doing critiques on the famous Wind Map by Fernanda Viegas and Martin Wattenberg.
Click here for a real-time version of the map.
I selected this particular project because it is a heartless person indeed who does not see the "beauty" in this thing.
Beauty is a word that is thrown around a lot in data visualization circles. What do we mean by beauty?
The discussion was very successful and the most interesting points of discussion were these:
Readers: would love to hear what you think.
Jens M., a long-time reader, submits a good graphic! This small-multiples chart (via Quartz) compares the consumption of liquor from selected countries around the world, showing both the level of consumption and the change over time.
What they did right:
The nicest feature was the XL scale applied only to South Korea. This destroys the small-multiples principle but draws attention to the top left corner, where the designer wants our eyes to go. I would have used smaller fonts throughout.
Having done so much work to simplify the data and expose the patterns, it's time to look at whether we can add some complexity without going overboard. I'd suggest using a different color to draw attention to curves that are strangely shaped -- the Ukraine comes to mind, so does Brazil.
I'd also consider adding the top liquor in each country... the writeup made a big deal out of the fact that most of the drinking in South Korea is of Soju.
One way to appreciate the greatness of the chart is to look at alternatives.
Here, the Economist tries the lazy approach of using a map: (link)
For one thing, they have to give up the time dimension.
A variation is a cartogram in which the physical size and shape of countries are mapped to the underlying data. Here's one on Worldmapper (link):
One problem with this transformation is what to do with missing data.
Wikipedia has a better map with variations of one color (link):
The Atlantic realizes that populations are not evenly distributed on the map so instead of coloring countries, thay put bubbles on top of the map (link):
Unfortunately, they scaled the bubbles to the total consumption rather than the per-capita consumption. You guess it, China gets the biggest bubble and much larger than anywhere else but from a per-capita standpoint, China is behind many other countries depicted on the map.
PS. A note on submissions. I welcome submissions, especially if you have a good chart to offer. Please ping me if I don't reply within a few weeks. I may have just missed your email. Also, realize that submissions take even more time to research since it is likely in the area I have little knowledge about, and mostly because you sent it to me since you hope I'll research it. Sometimes I give up since it's taking too much time. If you ping me again, I'll let you know if I'm working on it.
The above does not apply to emails from people who are building traffic for their infographics.
PPS. Andrew Gelman chimes in with his take on small multiples.
Let's start with the tab labeled "Regional Price" which contains a well-executed map of the average gas prices by county:
The color scale is wonderful. It's just one color and yet the gradations are easily discerned. The general spatial pattern jumps out at you, with prices being higher in the Pacific coast, and lower in New England all the way down south. The Lakes region also has higher prices so does New Mexico and Colorado and Hawaii.
The legend is just superb. Take a closer look:
What sets this legend apart is varying lengths of the segments. In particular, the darkest blue also corresponds to a wide range of prices (3.45-3.94). One can also easily figure out the lowest and highest price in the nation--the designers located exactly in which counties those prices were recorded, which is another nice touch.
To determine the breakpoints on the legend, one can use a statistical methodology: a standardized scale anchored on both sides of the national average price (from the other chart, the average price was $3.22). Then, we have each color mapping to the length of one standard deviation of prices in both directions. What this does is to put counties into standardized groups: for example, all counties whose prices were within one standard deviation above the average are given one tint while those that were one to two standard deviations above the average has a darker blue, and so on. In effect, we would have created a contour map.
I see the designers' intention in clearly labeling the areas where they do not have data, with the diagonal stripes on white. My own preference is to put those areas in a mild gray, in effect blending them into the surroundings. In this way, the missing data do not distract the average reader, while the fastidious reader can still figure out where the data holes are.
This is a key learning for most research scientists. We have a tendency to train our eyes on the outliers and the data holes because they are like imperfections in diamonds. This leads us to the tendency of highlighting the least important message up front. And it's a bad habit.
In the following, I put the county and state level views side by side. The NYT graphic allows users to switch between the two views via a tab.
Much like the recent post on the age of buildings in Brooklyn, the state aggregates tell a simpler story but still capture almost all of the spatial pattern. The average prices per state are now printed directly on the chart. The question the designer should ask is what the readers want to learn from such a chart, and which one delivers more of such requirements. It's possible the Times is catering to two types of readers. Perhaps one can strike a middle ground, which is to break out certain states like Texas into contiguous "regions".
One of the dangers of "Big Data" is the temptation to get lost in the details. You become so absorbed in the peeling of the onion that you don't realize your tear glands have dried up.
Hans Rosling linked to a visualization of tobacco use around the world from Twitter (link to original). The setup is quite nice for exploration. I'd call this a "tool" rather than a visual.
I appreciate the designer's concept -- the typical visualization of this type of data is looking at relative rates, which obscures the fact that China and India have far and away the most smokers even if their rates are middling (24% and 13% respectively).
This circular chart is supposed to show the absolute distribution of smokers across so-called "super-regions" of the world.
Unfortunately, the designer decided to pile on additional details. The concentric circles present a geography lesson, in effect. For example, high-income super-region is composed of high-income North America, Western Europe, high-income Asia Pacific, etc. and then high-income North America is composed of USA, Canada, etc.
Notice something odd? The further out you go, the larger the circular segments but the smaller the amount of people they represent! There are more people in the super-region of high-income worldwide than in high-income North America and in turn, there are more people in the high-income North American region than in USA. But the size of the graphical elements is reversed.
In principle, the "bumps"-like chart used to show the evolution of tobacco prevalence in individual countries make for a nice visual. In fact, Rosling marvelled that the global rate of consumption has fallen in recent years.
However, I'm often irritated when the designer pays no attention to what not to show. There are probably well above 200 lines densely packed into this chart. It is almost for sure that over-plotting will cause some of these lines to literally never see the light of day. Try hovering over these lines and see for yourself.
The same chart with say 10 judiciously chosen lines (countries or regions) provides the reader with a lot more profit.
The discerning reader figures out that the best visual actually does not even show up on the dashboard. Go ahead, and click on the tab called "Data" on top of the page. You now see a presentation of each country's "data" by age group and by gender. This is where you can really come up with stories for what is going on in different countries.
For example, the British have really done extremly well in reducing tobacco use. Look at how steep the declines are across the board for British men (in most parts of the world, the prevalence of smoking is much higher among men than women.)
Bulgaria on the other hand shows a rather odd pattern. It is one of the few countries in the bumps chart that showed a climb in smoking rates, at least in the early 2000s. Here the data for men is broken down into age groups.
This chart exposes a weakness of the underlying data. The error bars indicate to us that what is being plotted is not actual data but modeled data. The error bars here are enormous. With the average at about 40% to 50% for many age groups, the confidence interval is also 40% wide. Further, note that there were only three or four observations (purple dots) and curves are being fitted to these three or four dots, plus extrapolation outside the window of observation. The end result is that the apparent uplift in smoking in the early 2000s is probably a figment of the modeler's imagination. You'd want to understand if there are changes in methodologies around that time.
As a responsible designer of data graphics, you should focus less on comprehensiveness and focus more on highlighting the good data. I'm a firm believer of "no data is better than bad data".