« July 2011 | Main | September 2011 »

Chart of the day improved, needs better data

Felix Salmon chose the following chart as his chart of the day. The chart originally appeared in this American Banker article.

AB083011CHECKING2 The grouped column chart is one of the most over-used, least effective charts out there. While the message is simple -- that large banks have slashed offering free checking accounts between 2009 and 2011, this chart doesn't make it easy on the reader.

You have to jump over two columns to compare 96% and 34.6%. Then you have to find the legend to decode the colors. Then you have to jump through columns again to see the relative change for the other two types of institutions. (By the way, the practice of using font color in a legend is cute but perilous. There are lots of people who still use grayscale printers out there!)


A simple change to a line chart almost always solves the problem.

There is no way to miss the message looking at this chart:


There is an even better way to convey this information. Instead of grouping by year, group by type of institution. Like this:



Now moving to a different corner of our Trifecta checkup, we find that the wrong data was used to analyze the situation.

The numbers given on the chart are proportions of institutions offering free checking accounts. So, if there are 10,000 banks in the U.S., and 8,000 offer free checking, the proportion is 80%.

Well, not all banks are created equal. In fact, we have "too big to fail" banks, and we have lots of small banks. This table here tells us that there are only 34 banks that have assets over $50 billion, and each of these banks likely have millions of checking accounts. This Bloomberg article tells us Chase has 10.8 million checking accounts under one roof. If Chase doesn't offer free checking, that impacts 10.8 million customers while a local credit union not offering free checking may only affect thousands of accounts.

So, to paint the proper picture, we'd need to divide the number of free checking accounts by the total number of checking accounts, or the number of customers who have at least one free checking account by the total number of customers who have at least one checking account.

Since it is the mega banks who are rushing to take away free checking, this chart based on number of banks rather than number of accounts severely under-reports the trend.


False promises of equality and structure

(Here's something especially for those like me who are stuck in their homes in the Northeast USA this weekend.)

A few readers weren't impressed by Nielsen's presentation of the smartphone marketplace:


This chart type is very popular, both among business consultants and statisticians. Consultants call them "marimekko charts" while statisticians call them "mosaic charts". It's got multiple names as it has been reinvented multiple times. I have nightmares from having to produce this sort of charts in Powerpoint by hand (deconstructing and reconstructing column charts), and I have written before about my dislike of them (see here, and here).


Supporters point to two advantages of this type of chart:

  • Equality: it puts the two dimensions of the market place -- operating system/software, and producer/brand -- on equal footing. As an added bonus, the areas of the rectangles are meaningful: they correspond to the relative market shares.
  • Structure: the chart often reveals interesting aspects of the structure of the data. For instance, here it shows that certain smartphones have "closed" systems where the OS and producer forms a one-to-one relationship while some producers like HTC makes phones with different operating systems.

A little thought exposes these as false promises.

The two dimensions are, in fact, not equal. Look at the one contiguous column for Apple versus two separate sections for HTC. In order to know the market share of HTC, the reader needs to do additions... in his/her head. While this is not so hard when HTC appears only twice, your reader would not be amused if HTC appears seven times on the same mosaic. It is a limitation of this chart type that one cannot get the column sections to be of one piece without destroying the one-piece structure of the row sections.

In addition, I don't think it is easy to compare the areas of fat rectangles versus narrow rectangles, or squares versus long strips, etc. On consulting style charts, you almost always find the entire data set printed, which is to say, this chart is rendered not self-sufficient. On statistical charts, you typically find axis labels; this is not much better because of the difficulty in estimating relative areas.

The extent to which one can learn the structure of the data is restricted by our ability to estimate and sum areas.


In the junkart version, I use a flow chart. Special attention is paid to expressing as clearly as possible the structure of the marketplace, thus the separate sections for the "open" versus "closed" systems, as well as the many-to-many relationships among the "open"-system players.


The thickness of the flows is proportional to the market shares. I added a few data points to anchor the scale. The two dimensions of the data are treated symmetrically.

There is also no need to startle readers with a kaleidoscope of colors so typical of marimekkos.






Did days get longer in the last 30 years? Fast Company thinks so.

Craig N. sent us to this infographic from Fast Company about MTV's 30th anniversary, nominating it as the worst infographic ever.


Apply the self-sufficiency test to this chart. Wish away the printed data. Now, does the chart convey any message?  Where is the data embedded? Is it in the white dot, the black dot, the gold ring, the gold disc, the black ring, the eye-white? All of the above?

Now, do the same test on this chart (I removed the sales data, replacing it with years):


How would one compare the white to the orange? If one measures the lengths of the sides, the ratio of white to orange is about 1.32. If one compares areas of the squares, then the ratio is 1.73. Note that this requires the reader to see through the orange area to size up the area of the large white square. Alternatively, we can compute the ratio of the white area as observed to the orange square, and that ratio is 0.73.

The real ratio between 1980 and 2010 sales is given as 3.9/2.7 = 1.44. Given rounding errors, it seems like the designer may have used a ratio of lengths of the sides.

The problem is the same whether sides or areas are used. Can the reader figure out that the 1980 sales is about 40% higher than the 2010 sales?

I suspect that most of us react primarily to the visible areas, which means that we'd have gotten the direction of the change wrong, let alone the magnitude.



Mtv_racetrack Craig really dislikes this one. It's a variant of the racetrack chart. As any athlete knows, inner tracks are shorter than outer tracks. Could it be that days have gotten longer in the last 30 years? Apparently, the editors at Fast Company think so.

Nielsen's cross-platform crossing diagram crosses up readers

My friend Augustine F., who's a data-savvy guy, couldn't figure out what's going on with this chart in Nielsen's cross-platform report.


It's a case of a Bumps chart done poorly.

The reader must first read the beginning pages of the report to find one's bearing. The two charts are supposed to investigate the correlation between streaming video and regular TV. What causes the confusion is that the populations being analyzed are different between the two charts.

In the left chart, they exclude anyone who do not watch streaming video (35% of the sample), and then divide those who watch streaming video into five equal-sized segments based on how much they watch. Then, they look at how much regular TV each segment watches on average.

In the right chart, they exclude anyone who do not watch regular TV (just 0.5% of the sample), and then divide those who watch regular TV into five equal-sized segments based on how much they watch. Then, they look at how much online streaming video each segment watches on average.


What crosses us up is the relative scales. The scale for regular TV viewing is tightly clustered between 212 and 247 daily minutes on the left chart but has a wide range between 24 and 522 on the right chart. The impression given by the designer is that the same population (18-34 year olds) is divided into five groups (quintiles) for each chart, albeit using different criteria. It just doesn't make sense that the group averages do not match.

The reason for this mismatch is the hugely divergent rates of exclusion as described above. What the chart seems to be saying is that the 65% who use streaming video have very similar TV viewing behavior (about 220 daily minutes). In other words, we surmise that most of those people on the left chart map to groups 2 and 3 on the right chart.

Who are the people in groups 1, 4 and 5 on the right chart? It appears that they are the 35% who don't watch streaming video. Thus, the real insight of this chart is that there are two types of people who don't watch streaming video: those who watch very little regular TV at all, and those who watch twice the average amount of regular TV.


Here's another puzzle: Nielsen claims that high streaming = low TV and low streaming = high TV. Is it really true that high streaming = low TV? Take the segment of highest streaming (#1 on the left chart). This group, which is 13% of the survey population, accounts for 83% of the streaming minutes -- almost 71,000 out of 86,000 minutes. Now look at the right chart. It turns out that the streaming minutes are quite evenly distributed among those TV-based quintiles, ranging from 15,000 minutes to 23,000 minutes each.

So, it is impossible to fit all of the top streaming quintile into any one TV quintile - they have too many streaming minutes. In fact, the top streaming quintile must be quite spread out among the TV quintiles since each of the TV quintiles is 1.5 times the size of a streaming quintile!

So, we must conclude that customers who stream a lot include both fervent TV fans as well as those who watch little TV.


In a return-on-effort analysis, this is a high-effort, low-reward chart.


The return on effort in data graphics

I contributed the following post to the Statistics Forum. They are having a discussion comparing information visualization and statistical graphics. I use the following matrix to classify charts in terms of how much work they make readers do, and how much value readers get out of doing said work.



To read the rest of it, click here.