Darin M. points us to this speedometer chart, produced by IBM (larger version here). They call it the "Commuter Pain Index". I call it a prickly eyebrow eyelashes chart. You be the judge.
The "eyebrows" on this chart are purely ornaments. The only way to read this chart is to read the data labels, so it is a great example of failing the self-sufficiency test.
The simplest way to fix this chart is to unwrap the arc, turning this into a bar chart. The speedometer is a cute idea but very difficult to pull off because the city names are long text fields, and variable in length.
First, we must fix the vertical scale. For column charts, one must start at zero, without exceptions. The effect of not starting at zero is to chop off an equal length piece from the bottom of each column, and in doing so, the relative lengths/areas of the columns are distorted. The amount of distortion can be very severe. For example look at the fourth set of columns as shown below:
In both charts, I made the length of the first column the same so we are staring at comparative charts. The data plotted is exactly the same; the only difference is that the left chart starts the axis at zero. Notice that the huge difference seen on the right chart for the 4th pair of columns does not appear as extraordinary when the proper scale is used.
A multitude of other problems exist, not the least this is a chart that is highly redundant. The same data (10 numbers) show up three times, once as data labels, once as column lengths (distorted), and once as levels on the vertical scale.
An alternative way to look at this data is the Bumps chart. Like this:
What this chart brings out is the variability of the estimated vehicle densities. In theory, the density estimate should be quite accurate for the "today" numbers. You'd think that in surveying 2,000+ people about how many vehicles they currently own, most people should be able to provide accurate counts.
The data paint a different picture. From quarter to quarter, the estimated "today" density shows a range of 1.90x to 2.00x in the 5 periods analyzed, which is roughly 5%, a difference which, according to the analyst, equates to 5 million vehicles! Given current vehicle sales of about 13 million per year, 5 million is almost 40% of the market.
So, one wonders how this survey was done, and one wants to know how large is the margin of error of this estimate. I also want to know if the survey produces estimates of number of households as well since the vehicle per household metric has two variable components.
(Here's something especially for those like me who are stuck in their homes in the Northeast USA this weekend.)
A few readers weren't impressed by Nielsen's presentation of the smartphone marketplace:
This chart type is very popular, both among business consultants and statisticians. Consultants call them "marimekko charts" while statisticians call them "mosaic charts". It's got multiple names as it has been reinvented multiple times. I have nightmares from having to produce this sort of charts in Powerpoint by hand (deconstructing and reconstructing column charts), and I have written before about my dislike of them (see here, and here).
Supporters point to two advantages of this type of chart:
Equality: it puts the two dimensions of the market place -- operating system/software, and producer/brand -- on equal footing. As an added bonus, the areas of the rectangles are meaningful: they correspond to the relative market shares.
Structure: the chart often reveals interesting aspects of the structure of the data. For instance, here it shows that certain smartphones have "closed" systems where the OS and producer forms a one-to-one relationship while some producers like HTC makes phones with different operating systems.
A little thought exposes these as false promises.
The two dimensions are, in fact, not equal. Look at the one contiguous column for Apple versus two separate sections for HTC. In order to know the market share of HTC, the reader needs to do additions... in his/her head. While this is not so hard when HTC appears only twice, your reader would not be amused if HTC appears seven times on the same mosaic. It is a limitation of this chart type that one cannot get the column sections to be of one piece without destroying the one-piece structure of the row sections.
In addition, I don't think it is easy to compare the areas of fat rectangles versus narrow rectangles, or squares versus long strips, etc. On consulting style charts, you almost always find the entire data set printed, which is to say, this chart is rendered not self-sufficient. On statistical charts, you typically find axis labels; this is not much better because of the difficulty in estimating relative areas.
The extent to which one can learn the structure of the data is restricted by our ability to estimate and sum areas.
In the junkart version, I use a flow chart. Special attention is paid to expressing as clearly as possible the structure of the marketplace, thus the separate sections for the "open" versus "closed" systems, as well as the many-to-many relationships among the "open"-system players.
The thickness of the flows is proportional to the market shares. I added a few data points to anchor the scale. The two dimensions of the data are treated symmetrically.
There is also no need to startle readers with a kaleidoscope of colors so typical of marimekkos.
Craig N. sent us to this infographic from Fast Company about MTV's 30th anniversary, nominating it as the worst infographic ever.
Apply the self-sufficiency test to this chart. Wish away the printed data. Now, does the chart convey any message? Where is the data embedded? Is it in the white dot, the black dot, the gold ring, the gold disc, the black ring, the eye-white? All of the above?
Now, do the same test on this chart (I removed the sales data, replacing it with years):
How would one compare the white to the orange? If one measures the lengths of the sides, the ratio of white to orange is about 1.32. If one compares areas of the squares, then the ratio is 1.73. Note that this requires the reader to see through the orange area to size up the area of the large white square. Alternatively, we can compute the ratio of the white area as observed to the orange square, and that ratio is 0.73.
The real ratio between 1980 and 2010 sales is given as 3.9/2.7 = 1.44. Given rounding errors, it seems like the designer may have used a ratio of lengths of the sides.
The problem is the same whether sides or areas are used. Can the reader figure out that the 1980 sales is about 40% higher than the 2010 sales?
I suspect that most of us react primarily to the visible areas, which means that we'd have gotten the direction of the change wrong, let alone the magnitude.
Craig really dislikes this one. It's a variant of the racetrack chart. As any athlete knows, inner tracks are shorter than outer tracks. Could it be that days have gotten longer in the last 30 years? Apparently, the editors at Fast Company think so.
Felix Salmon spoke highly of this Wall Street Journal chart, and I agree.
Why do I like this? Although it's a basic chart, they did many little things well.
They are brave enough to not print any of the actual data on the chart. In other words, no loss aversion.
The legend is integrated onto the chart, not banished to some corner or border, requiring readers to stray from the graph. For added effect, the A, B, C labels imitate the actual signs posted outside the restaurants.
Sensible scales. It's even better if they would thin out the horizontal scale for the C rating, say make it 10-point intervals instead of 5-point intervals. Although this is hard to accomplish using conventional software, an axis with different intervals in different regions is surprisingly effective.
Using pencil-thin columns. The same chart with thicker columns would be both uglier and less effective.
(I'm not sure I like the up and right arrows on the axis titles. Is it better to remove the arrows and center the text?)
The New York Times Magazine published an article about marriage infidelities, which I didn't read, but it was popular enough that they did an online poll to obtain some instant feedback from readers. The result was shown in this cutesy graphic:
Note that they plotted the number of responses rather than proportion of responders even though all the numbers are between 0 and 100 and could easily have been misread as percentages.
This chart is another good illustration of the self-sufficiency principle. There is no need to create a chart if all the data are printed onto the chart, and readers must look at the data to learn anything from it. Imagine the above chart without the data, and you'll see why the data labels are critical to this chart.
Below is a version in which I removed all the data labels, replacing them with an axis:
The two pink slabs were thrown in for a little chart-check. According to the designer, 6+6+6 is larger than 20. How is this so? Look at a blow-up of the "God says otherwise" bar of hearts:
The one whole heart in each bar ruins the string of half hearts. Little things can introduce infidelities into charts.
Also from Consumer Strategist magazine comes the following chart about "PotatoPacks". (To their credit, the magazine uses a lot of charts, nost of which are completely harmless.)
This is a good example of what Tufte calls a low data-to-ink ratio. There are exactly 10 pieces of data on this chart, the number of potato packs and the market share for a five-year period.
Much resources have been thrown at the problem of showing growth: it's a surround-sound treatment with loud speakers. The potatoes, the gridlines, the axis, the data labels. And yet, it's unclear what the message is.
According to the title, "both sales volume and market share have steadily increased since the introduction of the PotatoPack." It would have been a very nice touch to add a little arrow letting us know when PotatoPack was introduced. Was it in 2006 the starting point of the data set? Or was it in 2007 when the sales volume started to increase?
*** There is an unintended message. It's that all potatoes are not born equal!
Take a look at the two stacks labeled 572 and 493. How is it that 572 gets us 4 potatoes, and 493 gets us 3 potatoes? So for 2006, each potato is worth 143 packs while in 2007, it's worth 164 packs.
For 2010, they plotted projected data, which is exactly how it should be done.
The following chart shows the year-to-year growth rate of the PotatoPack sales relative to the growth rate of the entire market. This may be the more interesting aspect of this data set.
Phil, over at the Gelman blog, nominates this jaw-dropping graphic as the worst of the year. I have to agree:
Should we complain about the "pie chart"/4 quadrants representation with no reference to the underlying data? Or the "pie within a pie within a pie" invention, again defiantly not to scale? Or the creative liense to exaggerate the smallest numbers in the chart ($2 billion, $0.3 billion) making it disproportionate with the other pieces? Or the complete usurping of proportions (e.g. the $0.2 billion green strip on the top right quadrant compared to the $0.3 billion tiny blue arc on the top left quadrant)?
Or the random sprinkling of labels and numbers around the circle even though if one takes the time, one notices that the entire chart contains only 8 numbers, as follows:
Instead, we can display the data with a small multiples layout showing readers how the data is structured along two dimensions.
Guess what the designer at Nielsen wanted to tell you with this chart:
Reader Steven S. couldn't figure it out, and chances are neither can you.
The smartphone (OS) market is dominated by three top players (Android, Apple and Blackberry) each having roughly 30% share, while others split the remaining 10%.
The age-group mix for each competitor is similar (or are they?)
Maybe those are the messages; if so, there is no need to present a bivariate plot (the so-called "mosaic" plot, or in consulting circles, the Marimekko). Having two charts carrying one message each would accomplish the job cleanly.
Trying to do too much in one chart is a disease; witness the side effects.
The two columns, counting from the right, contain rectangles that appear to be of different sizes, and yet the data labels claim each piece represents 1%, and in some cases "< 1%". The simultaneous manipulation of both the height and the width plays mind tricks.
Also, while one would ordinarily applaud the dropping of decimals from a chart like this, doing so actually creates the ugly problem that the five pieces of 1% (on the left column shown here) have the same width but clearly varying heights!
What about this section of the plot shown on the left? Does the smaller green box look like it's less than 1/3 the size of the longer green box? This chart is clearly not self-sufficient, and as such one might prefer a simple data table.
The downfall of the mosaic plot is that it gives the illusion of having two dimensions but only an illusion: in fact, the chart is dominated by one dimension, as all proportions are relative to the grand total.
For instance, the chart says that 6% of all smartphone users are between the ages of 18 and 24 AND uses an Android phone. It also tells us that 2% of all smartphone users are between 35 and 44 AND uses a Palm phone. Those are not two numbers anyone would desire to compare. There are hardly any practical questions that require comparing them.
Sometimes, the best way to handle two dimensions is not to use two dimensions.
The original article notes that "Of the three most popular smartphone operating systems, Android seems to attract more young consumers." In the chart shown below, we assume that the business question is the relative popularity of phone operating systems across age groups.
The right metric for comparison is the market share of each OS within an age group.
For example, tracing the black line labeled "Android", this chart tells us that Android has 37% of the 18-24 market while it has about 20% of the 65 and up market.
Android has an overall market share of about 30%, and that average obscures a youth bias that is linear with age.
On the other hand, the iPhone (green line) has also an average market share of about 30% but its profile is pretty flat in all age groups except 65 and up where it has considerable strength.
Further, the gap between Android and iPhone at the older age group actually opens up at 55 years and up. In the 55-64 age group, the iPhone holds a market share that is similar to its overall average while the Android performs quite a bit worse than its average. We note that Palm OS has some strength in the older age groups as well while the Blackberry also significantly underperforms in 65 and over.
Why aren't all these insights visible in the mosaic chart? It all because the chosen denominator of the entire market (as opposed to each age group) makes a lot of segments very small, and then the differences between small segments become invisible when placed beside much larger segments.
Now, the reconstituted chart gives no information about the relative sizes of the age groups. The market size for the older groups is quite a bit smaller than the younger groups. This information should be provided in a separate chart, or as a little histogram tucked under the age-group axis.
That sounds like a silly question. Isn't the answer self-evident? Am I suggesting that we banish the discipline of charting?
Maybe I won't go so far. But it's difficult not to have such a destructive thought when one stares at charts like this:
Now, compare the above with this version shown on the right ... and it's clear all the squares and bubbles and colors gave us nothing. Readers have to read the fine print in order to take in the unequal distribution of income. This chart violates the notion of self-sufficiency we often speak about.
Peering back at the original chart, we find that the entire square grid edifice only serves to explain that 0.01% is one-tenth of 0.1%, which is one-tenth of 1%, etc. On the other hand, the part that has a chance of conveying the main message -- the relative size of the biggest bubble versus the smallest bubble -- is shoved off the screen. The gigantic yellow bubble being mostly off the chart, readers are essentially asked to read the data labels.
The same article (via Yahoo!) contains other charts that are well executed.
This one, for instance, shows the increasing inequality very well. (The legend is on the left panel which I did not include here: the top red line is the top 1%, the other five lines are the quintiles or 20% buckets). At least four-fifths of the country is worse-off now than in 1980 in terms of their share of after-tax income.