« February 2013 | Main | April 2013 »

The state of charting software

Andrew Wheeler took the time to write code (in SPSS) to create the "Scariest Chart ever" (link). I previously wrote about my own attempt to remake the famous chart in grayscale. I complained that this is a chart that is easier to make in the much-maligned Excel paradigm, than in a statistical package: "I find it surprising how much work it would be to use standard tools like R to do this."

Andrew disagreed, saying "anyone saavy with a statistical package would call bs". He goes on to do the "Junk Charts challenge," which has two parts: remake the original Calculated Risk chart, and then, make the Junk Charts version of the chart.

I highly recommend reading the post. You'll learn a bit of SPSS and R (ggplot2) syntax, and the philosophy behind these languages. You can compare and contrast different ways to creating the charts. You can compare the output of various programs to generate the charts.

I'll leave you to decide whether the programs he created are easier than Excel.


Unfortunately, Andrew skipped over one of the key challenges that I envision for anyone trying to tackle this problem. The data set he started with, which he found from the Minneapolis Fed, is post-processed data. (It's a credit to him that he found a more direct source of data.) The Fed data is essentially the spreadsheet that sits behind the Calculated Risk chart. One can just highlight the data, and create a plot directly in Excel without any further work.

What I started with was the employment level data from BLS. What such data lacks is the definition of a recession, that is, the starting year and ending year of each recession. The data also comes in calendar months and years, and transforming that to "months from start of recession" is not straightforward. If we don't want to "hard code" the details, i.e. allowing the definition of a recession to be flexible, and make this a more general application, the challenge is more severe.


Another detail that Andrew skimmed over is the uneven length of the data series. One of the nice things about the Calculated Risk chart is that each line terminates upon reaching the horizontal axis. Even though more data is available for out years, that part of the time series is deemed extraneous to the story. This creates an awkward dataset where some series have say 25 values and others have only 10 values. While most software packages will handle this, more code needs to be written either during the data processing phase or during the plotting.

By contrast, in Excel, you just leave the cells blank where you want the lines to terminate.


In the last section, Andrew did a check on how well the straight lines approximate the real data. You can see that the approximation is extremely well. (The two panels where there seems to be a difference are due to a disagreement between the data as to when the recession started. If you look at 1974 instead of 1973, and also follow Calculated Risk's convention of having a really short recession in 1980, separate from that of 1981, then the straight lines match superbly.)



I'm the last person to say Excel is the best graphing package out there. That's not the point of my original post. If you're a regular reader, you will notice I make my graphs using various software, including R. I came across a case where I think current software packages are inferior, and would like the community to take notice.

Mix percent metaphors, add average confusion, and serve

Sometimes, a chart just strains your mind. Such is the case with the following, a tip from Augustine F. (@acfou)


There are just so many percentages on the chart it's really hard to figure out which is which.

Under the title, it hints that they are showing results from a poll. The legend implies that the poll asks for estimates of budget and revenue allocations: one imagines the questions were what proportion of your marketing budget is allocated to digital? and what proportion of your revenues is attributed to digital? On top of the bars are some percentages, presumably percentages of respondents. Perhaps, or perhaps not. The column labels clearly add up to over 100% since there are two columns in the 30-35% range.

Under the axis, we have buckets of percentages. Are they percentages of people, of budgets or of revenues? Why and how are they bucketed?

My best guess is that the survey is a multiple-choice with 11 choices corresponding to the groups of columns. The axis labels refer to both percentage of budget and percentage of revenues, depending on which column you're looking at.

What is maximally confusing is the last set of columns, labeled "Average", with values in the 35% range. It is most likely not a choice in the survey. They somehow came up with an average based on the responses. So maybe I was wrong about the multiple-choice format: if the raw data comes in buckets like 61 to 70%, there is no easy way to average these responses. Maybe they asked for two exact percentages, and then grouped them afterwards.


To sum all that up, the percentages on top of the columns are percentages of respondents, except in the last set of columns, where they are percentages of budget (or revenues). The percentages of budget (or revenues) are sitting on the horizontal axis, except in the last label, called "Average", where it means the average respondent.


There is a problem with my interpretation. It makes the chart completely worthless!

What use is it to learn that "16% of the respondents say they allocate 11-20% of their budget on digital while 12% of the respondents say they derive 11-20% of their budget from digital"?

You might be interested in whether there is a return on investment to the money spent on digital marketing. You'd then need to know for a given company, what proportion of budget was spent on marketing versus what proportion of revenues was attributed to that marketing. In this chart, there is no linkage -- the companies who say they spend 11-20% on digital may or may not be the same set of companies who say they derive 11-20% from digital spend.

If the survey asked for exact percentages, then I'd prefer to see a scatter plot, showing proportion of budget on one axis, and proportion of revenues on the other axis, each dot representing a respondent.


A final note: it is worth asking what types of people answer this survey. Pretty much the only people in a company who can answer this question accurately are the heads of marketing. If you are working for the head of marketing, you likely know the details of a particular segment of marketing but not the aggregate numbers. If you work in a different department, there is little to no chance that you have any useful knowledge about marketing budgets and revenue allocations.

One would also appreciate it if all such pictures include the sample size.

Getting inside my head

[This is a cross-post from the sister blog, Numbers Rule Your World]

Some interviews with me or snippets of such have surfaced recently. Here is a list:

Kate Meersschaert interviewed me for New Learning Times (link; registration required). I talked about my teaching philosophy, and why I write books.

Jay Ulfelder, a political scientist who keeps an interesting blog, recommends Numbers Rule Your World, and a few other books for political scientists (link).

If you haven't heard already, 2013 is the International Year of Statistics. I was one of the talking heads here.

Here, I talk about the history of Junk Charts, and the new paradigm of interactively building graphs, as opposed to the template paradigm popularized by Excel.


Cat and dog food, for thought

My friend Rhonda (@RKDrake) sends me to this pair of charts (in BusinessWeek). They are fun to look at, and ponder at. 

Bw_catdogHere's the first chart:

 Should the countries be colored according to the distance from the Equator?

Is this implying that cats and dogs have different preferential habitats?

Is there a lurking variable that is correlated with distance from equator?

What is the relationship between cat and dog owners?

Is there any significance to countries sitting on that diagonal, whereby the porportion of households owning dogs is the same as that owning cats?

In particular, what proportion of these households have both dogs and cats?

If 20% of households have cats, and 20% of households have dogs, how much of these households are the same ones?

How are the countries selected?

Where does the data come from?

The data provider is named but is the data coming from surveys? Are those randomized surveys?

Are the criteria used to collect data the same across all these countries?

The other chart is about cat and dog food. Again, nice aesthetics, clean execution. Lots of questions but worth looking at. Enjoy.


Italy burning, by poor timing

Here's a chart from one of the Italian dailies I picked up in Rome last August . It apparently plots the number of hectares of farmland that was burnt during various fires over time.


While the chart is clean and pleasing to the eye, it has a malformed time axis. In the side-by-side comparison shown below, you can see how the evenly-spaced time axis completely distorts the cadence of the data.


In fact, the data should be put into a bar chart, rather than a line chart. Lines are used primarily to denote trends, and sometimes to compare profiles. Neither of these cases apply here.


 The bar chart requires proper spacing too to present the years in which no hectares were burnt by fires.


Blowing the whistle at bubble charts

The bubble chart is one of the most hopeless data graphics ever invented. It is sometimes useful for conceptual charts but trying to express data with it is a lost cause.

The Wall Street Journal used a bubble chart to show the trend in whistle-blower lawsuits in the U.S. The original chart looks like this:


Focus on the top part of the chart. Now apply the self-sufficiency test (link), as follows:


First, cover up the data labels. You'll notice that no information is conveyed by the bubbles in and of themselves.

Second, give yourself a hint. The size of the first bubble corresponds to 363 suits. What does that tell you about the second bubble? Unfortunately, the answer is still nothing.

Third, give yourself two hints. The second bubble from the left has size 311. Now try to estimate the size of the rightmost bubble given those two pieces of data. This exercise is still extremely taxing.

Thus, the conclusion about bubble charts is:


That is to say, it fails the self-sufficiency test (link). The chart cannot exist without the data labels. The graphical elements do not provide any additional value.

Which software is responsible for this?

@guitarzan wants us to see this chart from north of the border, and read the comments. Please hold your nose first.



Here's one insightful comment: "I think it's insane to debate the ages 18 or 19. Why not cap it off at the much more rounded and sensible numbers 18.2 or 19.4??"

Reminds me of signs that say this elevator holds 13 people, or this auditorium holds 147 people safely.


I mean, which software package enables this chart?

For the vertical axis, it appears that the major gridlines are specified to 0.4 with minor gridlines at 0.2 apart. The lower limit of the vertical axis was specifically set to 17, which violates the start-at-zero rule for bar charts.

The software also allows the yx-axis labels to be printed twice, one in super tiny font in the expected locations, and the other turned sideways and printed into the bars.

And Canadians, please tell us why the provinces were ordered in this way.


This data calls for a simple map, with two colors.