The state of charting software

Andrew Wheeler took the time to write code (in SPSS) to create the "Scariest Chart ever" (link). I previously wrote about my own attempt to remake the famous chart in grayscale. I complained that this is a chart that is easier to make in the much-maligned Excel paradigm, than in a statistical package: "I find it surprising how much work it would be to use standard tools like R to do this."

Andrew disagreed, saying "anyone saavy with a statistical package would call bs". He goes on to do the "Junk Charts challenge," which has two parts: remake the original Calculated Risk chart, and then, make the Junk Charts version of the chart.

I highly recommend reading the post. You'll learn a bit of SPSS and R (ggplot2) syntax, and the philosophy behind these languages. You can compare and contrast different ways to creating the charts. You can compare the output of various programs to generate the charts.

I'll leave you to decide whether the programs he created are easier than Excel.

***

Unfortunately, Andrew skipped over one of the key challenges that I envision for anyone trying to tackle this problem. The data set he started with, which he found from the Minneapolis Fed, is post-processed data. (It's a credit to him that he found a more direct source of data.) The Fed data is essentially the spreadsheet that sits behind the Calculated Risk chart. One can just highlight the data, and create a plot directly in Excel without any further work.

What I started with was the employment level data from BLS. What such data lacks is the definition of a recession, that is, the starting year and ending year of each recession. The data also comes in calendar months and years, and transforming that to "months from start of recession" is not straightforward. If we don't want to "hard code" the details, i.e. allowing the definition of a recession to be flexible, and make this a more general application, the challenge is more severe.

***

Another detail that Andrew skimmed over is the uneven length of the data series. One of the nice things about the Calculated Risk chart is that each line terminates upon reaching the horizontal axis. Even though more data is available for out years, that part of the time series is deemed extraneous to the story. This creates an awkward dataset where some series have say 25 values and others have only 10 values. While most software packages will handle this, more code needs to be written either during the data processing phase or during the plotting.

By contrast, in Excel, you just leave the cells blank where you want the lines to terminate.

***

In the last section, Andrew did a check on how well the straight lines approximate the real data. You can see that the approximation is extremely well. (The two panels where there seems to be a difference are due to a disagreement between the data as to when the recession started. If you look at 1974 instead of 1973, and also follow Calculated Risk's convention of having a really short recession in 1980, separate from that of 1981, then the straight lines match superbly.)

  Wheeler_JunkChallenge4

***

I'm the last person to say Excel is the best graphing package out there. That's not the point of my original post. If you're a regular reader, you will notice I make my graphs using various software, including R. I came across a case where I think current software packages are inferior, and would like the community to take notice.


Which software is responsible for this?

@guitarzan wants us to see this chart from north of the border, and read the comments. Please hold your nose first.

Cbc-drinking-graph

 

Here's one insightful comment: "I think it's insane to debate the ages 18 or 19. Why not cap it off at the much more rounded and sensible numbers 18.2 or 19.4??"

Reminds me of signs that say this elevator holds 13 people, or this auditorium holds 147 people safely.

***

I mean, which software package enables this chart?

For the vertical axis, it appears that the major gridlines are specified to 0.4 with minor gridlines at 0.2 apart. The lower limit of the vertical axis was specifically set to 17, which violates the start-at-zero rule for bar charts.

The software also allows the yx-axis labels to be printed twice, one in super tiny font in the expected locations, and the other turned sideways and printed into the bars.

And Canadians, please tell us why the provinces were ordered in this way.

***

This data calls for a simple map, with two colors.

 


Remaking a great chart

One of the best charts depicting our jobs crisis is the one popularized by the Calculated Risk blog (link). This one:

JobLossesJan2013

I think a lot of readers have seen this one. It's a very effective chart.

The designer had to massage the data in order to get this look. The data published by the government typically gives an estimated employment level for each month of each year. The designer needs to find the beginning and ending months of each previous recession. Then the data needs to be broken up into unequal-length segments. A month counter now needs to be set up for each segment, re-setting to zero, for each new recession. All this creates the effect of time-shifting.

And we're not done yet. The vertical axis shows the percentage job losses relative to the peak of the prior cycle! This means that for each recession, he has to look at the prior recession and extract out the peak employment level, which is then used as the base to compute the percentage that is being plotted.

One thing you'll learn quickly from doing this exercise is that this is a task ill-suited for a computer (so-called artificial intelligence)! The human brain together with Excel can do this much faster. I'm not saying you can't create a custom-made application just for the purpose of creating this chart. That can be done and it would run quickly once it's done. But I find it surprising how much work it would be to use standard tools like R to do this.

***

Let me get to my point. While this chart works wonders on a blog, it doesn't work on the printed page. There are too many colors, and it's hard to see which line refers to which recession, especially if the printed page is grayscale. So I asked CR for his data, and re-made the chart like this:

FIGURE 5A-1

You'd immediately notice that I have liberally applied smoothing. I modeled every curve as a V-shaped curve with two linear segments, the left arm showing the average rate of decline leading to the bottom of the recession, while the right arm shows the average rate of growth taking us out of the doldrums. If you look at the original chart carefully, you'd notice that these two arms suffice to represent pretty much every jobs trend... all the other jittering are just noise.

I also chose a small-multiples to separate the curves into groups by decades. When you only have one color, you can't have ten lines plotted on top of one another.

One can extend the 2007 recession line to where it hits the 0% axis, which would really make the point that the jobs crisis is unprecedented and inexplicably not getting any kind of crisis management.

(Meanwhile, New York City calls a crisis with every winter storm... It's baffling.)


The coming commodization of infographics

An email lay in my inbox with the tantalizing subject line: "How to Create Good Infographics Quickly and Cheaply?" It's a half-spam from one of the marketing sites that I signed up for long time ago. I clicked on the link, which led me to a landing page which required yet another click to get to the real thing (link). (Now, you wonder why marketers keep putting things in your inbox!)

Easelly_walkwayThe article was surprisingly sane. The author, Carrie Hill, suggests that the first thing to do is to ask "who cares?" This is the top corner of my Trifecta Checkup, asking what's the point of the chart. Some of us not so secretly hope that answer to "who cares?" is no one.

Carrie then lists a number of resources for creating infographics "quickly and cheaply".

Easel.ly caught my eye. This website offers templates for creating infographics. You want time-series data depicted as a long, hard road ahead, you have this on the right.

You want several sections of multi-colored bubble charts, you have this theme:

Easelly_angel

 

In total, they have 15 ready-made templates that you can use to make infographics. I assume paid customers will have more.

infogr.am is another site with similar capabilities, and apparently for those with some data in hand.

***

Based on this evidence, the avanlanche of infographics is not about to pass. In fact, we are going to see the same styles repetitively. It's like looking at someone's Powerpoint presentation and realizing that they are using the "Advantage" theme (one of the less ugly themes loaded by default). In the same way, we will have a long, winding road of civil rights, and a long, winding road of Argentina's economy, and a long, winding road of Moore's Law, etc.

But I have long been an advocate of drag-and-drop style interfaces for producing statistical charts. So I hope the vendors out there learn from these websites and make your products ten times better so that it is as "quick and cheap" to make nice statistical charts as it is to make infographics.

 


Look what I found: two amazing charts

While doing some research for my statistics blog, I came across a beauty by Lane Kenworthy from almost a year ago (link) via this post by John Schmitt (link).

How embarrassing is the cost effectiveness of U.S. health care spending?

Americasinefficienthealthcaresystem-figure1-version2

When a chart is executed well, no further words are necessary.

I'd only add that the other countries depicted are "wealthy nations".

***

Even more impressive is this next chart, which plots the evolution of cost effectiveness over time. An important point to note is that the U.S. started out in 1970 similar to the other nations.

Americasinefficienthealthcaresystem-figure2-version5

Let's appreciate this beauty:

  • Let the data speak for itself. Time goes from bottom left to upper right. As more money is spent, life expectancy goes up. However, the slope of the line is much smaller for the US than the other countries. There is no need to add colors, data labels, interactivity, animation, etc.
  • Recognize what's important, what's not. The US line is in a different color, much thicker and properly made the foreground of the chart.
  • Rather than clutter up the chart, the other 19 lines are anonymized. They all have the same color and thickness, and all given one aggregate label. This is an example of overcoming loss aversion (see this post for more): it is ok to suppress some of the data.
  • The axis labeling is superb. Tufte preaches this clean style. There is no need to use regularly-spaced axis labels... use data-informed labels. Unfortunately, software is way behind on this issue. You can do this in R but that's about it.

 


The "data" corner of the Trifecta

TrifectaIn the JunkCharts Trifecta checkup, we reserve a corner for "data". The data used in a chart must be in harmony with the question being addressed, as well as the chart type being selected. When people think about data, they often think cleaning the data, processing the data but what comes before that is collecting the data -- specifically, collecting data that directly address the question at hand.

Our previous post on the smartphone app crashes focused on why the data was not trustworthy. The same problem plagues this "spider chart", submitted by Marcus R. (link to chart here)

Qlikview_Performance

Despite the title, it is impossible to tell how QlikView is "first" among these brands. In fact, with several shades of blue, I find it hard to even figure out which part refers to QlikView.

The (radial) axis is also a great mystery because it has labels (0, 0.5, 1, 1.5). I have never seen surveys with such a scale.

The symmetry of this chart is its downfall. These "business intelligence" software are ranked along 10 dimensions. There may not be a single decision-maker who would assign equal weight to each of these criteria. It's hard to imagine that "project length" is equally important as "product quality", for example.

Take one step backwards. This data came from responders to a survey (link). There is very little information about the composition of the responders. Are they asked to rate all 10 products along 10 dimensions? Do they only rate the products they are familiar with? Or only the products they actively use? If the latter, how are responses for different products calibrated so that a 1 rating from QlikView users equals a 1 rating from MicroStrategy users? Given that each of these products have broad but not completely overlapping coverage, and users typically deploy only a part of the solution, how does the analysis address for the selection bias?

***

The "spider chart" is, unfortunately, most often associated with Florence Nightingale, who created the following chart:

Nightingale

This chart isn't my cup of tea either.

***

Also note that the spider chart has so much over-plotting that it is impossible to retrieve the underlying data.

 

 


Want a signed book?

JMP is giving away signed copies of Numbers Rule Your World.  See details here.

JMP is a great piece of software for those who like to point and click, drag things around, interactively build models. People I hire who are analytical but don't have proper statistical training seem to enjoy using it and produce good work from it. There are other similar software on the market; I haven't tried them out so I don't know if they are better or worse but I can say I have had a pleasant time with JMP.

***

Speaking of which, if you haven't already, do subscribe to my sister blog, where I discuss the  statistical thinking behind everything that's happening around us.

The RSS feed: here. The twitter feed combines the two blogs.

 


Quick fix for word clouds: area not length

Here is one of my several suggestions for word-cloud design: encode the data in the area enclosed by each word, not the length of each word.

Every word cloud out there contains a distortion because readers are sizing up the areas, not the lengths of words. The extent of the distortion is illustrated in this example:

Wordcloud_distortion

The word "promise" is about 3.5 times the size of "McCain" but the ratio of frequency of occurrence is only 1.6 times.

This is a quick fix that Wordle and other word cloud software can implement right away. There are other more advanced issues I bring up in my presentation (see here).

 


Who are you talking to?

Reader Daniel L. points us to this "dashboard" of statistics concerning downloads of a piece of software presumably called "maven". This sort of presentation has unfortunately become standard fare.

Maven-stats

Daniel was shocked by the pie chart. Just for laughs, here is the pie chart:

Maven_pie
Something else is worth noting -- ever wondered who the chart designer is talking to?

Is it an accountant who cares about every single download (thus needing the raw data)?

Is it a product manager who cares about the current run rates, and the mix of components downloaded?

Is it an analyst who is examining trends over time, aggregate of all components?

***

In other words, the first order of business is to identify the user. 


Answering an open call

Dan Goldstein, who writes the Decision Science News blog, relays an internal debate occurring at Yahoo! about the relative merits of some simple charts. From what I can tell, they used three types of methods (known as "Search", "Baseline", and "Combined") on four sample data sets with different subjects ("flu", "movies", "music", "games") and compared the performance of the methods. I imagine the underlying practical question to be: does having search data improve the performance of some kind of predictive model that can be applied to the different data sets? There exists an existing baseline model that does not use search data.

(I noticed that Dan has since put up the final version of the chart they decided to use for publication. I will ignore that for the moment, and put up my response. Their final version is similar to my revised version.)

Dsn_charts

I'd like to use this data to reiterate a couple of principles that I have championed here over the years.

First, we must start a bar chart at zero. There were some back and forth on Dan's blog about whether this should be an iron-clad rule, and some comments about it being not a big deal. It is a big deal; just take a look:

Redo_barcharts2
The left chart has the full scale while the right chart chopped off the parts below 0.4. The result of the chopping off is that the relative lengths of the bars become distorted. For the music data, the search method appears (on the right) to be half as effective as the baseline, which is far from reality as shown on the left chart.

I used R to generate these charts, and was pleasantly surprised that the barplot function automatically assumes that bars start at zero. If you try to start the vertical axis above zero, the bars would literally walk off the chart, making it extremely ugly! (I had to pull some tricks to create the version shown above.)

***

Andrew Gelman suggested using a line chart. He also recently wrote that he has become a fan of line charts. Long-time readers know I am a fan of line charts, too... and I have tifosi who come here to complain about my over-use of line charts, especially when we have categorical data (as here!).

In particular, I have written about grouped bar charts before, and most of the time, they can be made into line charts and made clearer and better. (See here, or here.)

Some of the readers of Dan's blog complained that the dot plot makes it difficult to compare the performance of say the search method across different subjects (data sets). They think the bar charts do this better.

If comparing across subjects is the key activity for the reader of this chart, then a line chart is even better. Imagine you are reading the bar chart and comparing across subjects. Follow your eyes. You are essentially tracing lines across the top of the bars. The line chart makes this explicit. That, to me, is the key argument for using line charts in place of grouped bar charts.

I have made this argument before. Here is an illustration of the argument. The broken red lines are the same as the lines in the line chart.

Redo_dsn_charts0

***

On the line chart shown above, it's easy to see that the Combined method has the best average performance, and is never worse than the other two methods for any of the subjects. It also shows that the music subject differentiates the three methods most, primarily because the search data was not adding much to the effort. There is also no need to add colors, which can quickly make the bar charts unwieldy and disorienting.

In the final chart, shown on the right below, I flipped the two axes, changed the plot characters, used colors, shifted emphasis slightly to dots rather than lines, and started the chart at 0.5 (!)

Redo_dsn_charts
 

Line charts are more flexible in that they can make sense even when the axis does not start at zero. In particular, when the point of the chart is to make comparisons, that is, to look at the gaps between dots or lines, rather than the absolute values, then it is fine to start the axis at some place other than zero.

Take again the example of the performance on music by the three methods (red line). The drop in performance between combined and baseline and that between baseline and music are indeed roughly equal. The vertical distances to the bottom of the chart are still distorted as in the bar chart but in a line chart, readers are less likely to get distracted by those distances because the bars are not there.