## The state of charting software

##### Mar 28, 2013

Andrew Wheeler took the time to write code (in SPSS) to create the "Scariest Chart ever" (link). I previously wrote about my own attempt to remake the famous chart in grayscale. I complained that this is a chart that is easier to make in the much-maligned Excel paradigm, than in a statistical package: "I find it surprising how much work it would be to use standard tools like R to do this."

Andrew disagreed, saying "anyone saavy with a statistical package would call bs". He goes on to do the "Junk Charts challenge," which has two parts: remake the original Calculated Risk chart, and then, make the Junk Charts version of the chart.

I highly recommend reading the post. You'll learn a bit of SPSS and R (ggplot2) syntax, and the philosophy behind these languages. You can compare and contrast different ways to creating the charts. You can compare the output of various programs to generate the charts.

I'll leave you to decide whether the programs he created are easier than Excel.

***

Unfortunately, Andrew skipped over one of the key challenges that I envision for anyone trying to tackle this problem. The data set he started with, which he found from the Minneapolis Fed, is post-processed data. (It's a credit to him that he found a more direct source of data.) The Fed data is essentially the spreadsheet that sits behind the Calculated Risk chart. One can just highlight the data, and create a plot directly in Excel without any further work.

What I started with was the employment level data from BLS. What such data lacks is the definition of a recession, that is, the starting year and ending year of each recession. The data also comes in calendar months and years, and transforming that to "months from start of recession" is not straightforward. If we don't want to "hard code" the details, i.e. allowing the definition of a recession to be flexible, and make this a more general application, the challenge is more severe.

***

Another detail that Andrew skimmed over is the uneven length of the data series. One of the nice things about the Calculated Risk chart is that each line terminates upon reaching the horizontal axis. Even though more data is available for out years, that part of the time series is deemed extraneous to the story. This creates an awkward dataset where some series have say 25 values and others have only 10 values. While most software packages will handle this, more code needs to be written either during the data processing phase or during the plotting.

By contrast, in Excel, you just leave the cells blank where you want the lines to terminate.

***

In the last section, Andrew did a check on how well the straight lines approximate the real data. You can see that the approximation is extremely well. (The two panels where there seems to be a difference are due to a disagreement between the data as to when the recession started. If you look at 1974 instead of 1973, and also follow Calculated Risk's convention of having a really short recession in 1980, separate from that of 1981, then the straight lines match superbly.)

***

I'm the last person to say Excel is the best graphing package out there. That's not the point of my original post. If you're a regular reader, you will notice I make my graphs using various software, including R. I came across a case where I think current software packages are inferior, and would like the community to take notice.