## The state of charting software

##### Mar 28, 2013

Andrew Wheeler took the time to write code (in SPSS) to create the "Scariest Chart ever" (link). I previously wrote about my own attempt to remake the famous chart in grayscale. I complained that this is a chart that is easier to make in the much-maligned Excel paradigm, than in a statistical package: "I find it surprising how much work it would be to use standard tools like R to do this."

Andrew disagreed, saying "anyone saavy with a statistical package would call bs". He goes on to do the "Junk Charts challenge," which has two parts: remake the original Calculated Risk chart, and then, make the Junk Charts version of the chart.

I highly recommend reading the post. You'll learn a bit of SPSS and R (ggplot2) syntax, and the philosophy behind these languages. You can compare and contrast different ways to creating the charts. You can compare the output of various programs to generate the charts.

I'll leave you to decide whether the programs he created are easier than Excel.

***

Unfortunately, Andrew skipped over one of the key challenges that I envision for anyone trying to tackle this problem. The data set he started with, which he found from the Minneapolis Fed, is post-processed data. (It's a credit to him that he found a more direct source of data.) The Fed data is essentially the spreadsheet that sits behind the Calculated Risk chart. One can just highlight the data, and create a plot directly in Excel without any further work.

What I started with was the employment level data from BLS. What such data lacks is the definition of a recession, that is, the starting year and ending year of each recession. The data also comes in calendar months and years, and transforming that to "months from start of recession" is not straightforward. If we don't want to "hard code" the details, i.e. allowing the definition of a recession to be flexible, and make this a more general application, the challenge is more severe.

***

Another detail that Andrew skimmed over is the uneven length of the data series. One of the nice things about the Calculated Risk chart is that each line terminates upon reaching the horizontal axis. Even though more data is available for out years, that part of the time series is deemed extraneous to the story. This creates an awkward dataset where some series have say 25 values and others have only 10 values. While most software packages will handle this, more code needs to be written either during the data processing phase or during the plotting.

By contrast, in Excel, you just leave the cells blank where you want the lines to terminate.

***

In the last section, Andrew did a check on how well the straight lines approximate the real data. You can see that the approximation is extremely well. (The two panels where there seems to be a difference are due to a disagreement between the data as to when the recession started. If you look at 1974 instead of 1973, and also follow Calculated Risk's convention of having a really short recession in 1980, separate from that of 1981, then the straight lines match superbly.)

***

I'm the last person to say Excel is the best graphing package out there. That's not the point of my original post. If you're a regular reader, you will notice I make my graphs using various software, including R. I came across a case where I think current software packages are inferior, and would like the community to take notice.

You can follow this conversation by subscribing to the comment feed for this post.

Thanks for taking the time to respond, exactly which chart is scary? Part of the reason I used the pre-processed data from the Minn. Fed is that I don't know what tables from BLS you are talking about! (this isn't my field, so I'm largely ignorant, and gave up after looking for a few minutes on the BLS website)

As long as you can define in numerical terms what a recession is you can do the data processing you suggest. The spreadsheet I used determined the start of recessions as defined by NBER. I don't know what definition you or the calculated risk blog are using.

I feel conflicted about the flexibility remark. Flexibility is good for ad-hoc analysis, I'm not sure if recessions should be defined in an ad-hoc manner though!

Knowing SPSS code really well, I don't feel that SPSS is inferior to Excel for this particular task or really any data manipulation task (ditto for the majority of major stat packages). I can fully admit though it took probably 4 years of using SPSS on a daily basis to get that saavy. I can confidently say I'm a much more productive analyst for it though. There are some things Excel is better at (such as a well designed interactive dashboard), but this IMO isn't one of them.

Thanks again for taking the time to respond!

PS - Having the lines terminate when they reach zero values is pretty trivial, so that certainly isn't a reason to prefer Excel. The "more code" you talk about is two lines in SPSS. I would roughly guess any data manipulation you perform in Excel amounts to between 1-5 lines of code in SPSS.

In software development, good programmers know that there are different tools for different jobs. Just because you can do something in language X doesn't mean you should. Those of us in data analysis should try to look at software the same way.

Excel gets a bad rap because lots of people misuse it to create bad graphs or use it to just store text information in a grid. But when used properly, Excel has some great features. I work in R and Excel almost every day. I find that Excel is great at reshaping, sorting, and subsetting data. It's also where I create most simple bar or line graphs.

I'm happy to know I'm not alone! You, Nate Silver and I'm sure many, many other good data analysts use Excel too.

Andy: Great to see you here. The "scary" moniker is applied to the jobs losses chart. It's just a name you see in the blogosphere. I was not referring to your charts.

The raw data from BLS comes from the CES survey. Here is the link for anyone interested.

The comments to this entry are closed.