The state of charting software

Andrew Wheeler took the time to write code (in SPSS) to create the "Scariest Chart ever" (link). I previously wrote about my own attempt to remake the famous chart in grayscale. I complained that this is a chart that is easier to make in the much-maligned Excel paradigm, than in a statistical package: "I find it surprising how much work it would be to use standard tools like R to do this."

Andrew disagreed, saying "anyone saavy with a statistical package would call bs". He goes on to do the "Junk Charts challenge," which has two parts: remake the original Calculated Risk chart, and then, make the Junk Charts version of the chart.

I highly recommend reading the post. You'll learn a bit of SPSS and R (ggplot2) syntax, and the philosophy behind these languages. You can compare and contrast different ways to creating the charts. You can compare the output of various programs to generate the charts.

I'll leave you to decide whether the programs he created are easier than Excel.

***

Unfortunately, Andrew skipped over one of the key challenges that I envision for anyone trying to tackle this problem. The data set he started with, which he found from the Minneapolis Fed, is post-processed data. (It's a credit to him that he found a more direct source of data.) The Fed data is essentially the spreadsheet that sits behind the Calculated Risk chart. One can just highlight the data, and create a plot directly in Excel without any further work.

What I started with was the employment level data from BLS. What such data lacks is the definition of a recession, that is, the starting year and ending year of each recession. The data also comes in calendar months and years, and transforming that to "months from start of recession" is not straightforward. If we don't want to "hard code" the details, i.e. allowing the definition of a recession to be flexible, and make this a more general application, the challenge is more severe.

***

Another detail that Andrew skimmed over is the uneven length of the data series. One of the nice things about the Calculated Risk chart is that each line terminates upon reaching the horizontal axis. Even though more data is available for out years, that part of the time series is deemed extraneous to the story. This creates an awkward dataset where some series have say 25 values and others have only 10 values. While most software packages will handle this, more code needs to be written either during the data processing phase or during the plotting.

By contrast, in Excel, you just leave the cells blank where you want the lines to terminate.

***

In the last section, Andrew did a check on how well the straight lines approximate the real data. You can see that the approximation is extremely well. (The two panels where there seems to be a difference are due to a disagreement between the data as to when the recession started. If you look at 1974 instead of 1973, and also follow Calculated Risk's convention of having a really short recession in 1980, separate from that of 1981, then the straight lines match superbly.)

  Wheeler_JunkChallenge4

***

I'm the last person to say Excel is the best graphing package out there. That's not the point of my original post. If you're a regular reader, you will notice I make my graphs using various software, including R. I came across a case where I think current software packages are inferior, and would like the community to take notice.


Speaking analytics

(This is a cross-post from my other blog, as it also relates to data graphics.)

I was a guest on the Analytically Speaking series, organized by JMP. In this webcast (link, registration required), I talk about the coexistence of data science and statistics, why my blog is called "Junk Charts", what I look for in an analytics team, the tension between visualization and machine algorithms, two modes of statistical modeling, and other things analytical.


Three lessons from Jobs

I feel like I know Steve Jobs even though I don't know him. I know him through the Apple products I have used through the years.


My first exposure to Apple coincided with coming to the States for college. Before the move, I had only ever used PCs, assembled by my Dad. HappymacThe first week of college, I found myself in a room of Macintoshes: in those days, they were off-white cubic blocks, slightly smaller than shoeboxes, with black-and-white, low-resolution screens. A "happy Mac" was always there to greet you. It only took 15 or 20 minutes to fall in love. In this time, I figured out how to use a mouse, the difference between single and double clicking, minimizing windows, file directories, etc. etc. When my friends tell me today that their six-month-old baby could instinctively learn to start their favorite game on the iPad, I believe them. I believe them because I experienced it myself.

By all accounts, Apple products bear the fingerprints of Steve Jobs's dogged vision. His vision offers three important lessons for graphics designers:

1) Never take your eyes off the user experience.

The product is in service of the user. Charts serve readers. What are the key questions to answer? How can we help deliver their needs effortlessly?

2) Maintain the producer's control.

Knowing the user does not mean relinquishing control. Apple products are very tightly designed. The email application on the iPhone works beautifully out of the box but it doesn't try to replicate every feature available online. It doesn't have to. Good graphics are never neutral; their producers have a point of view.

3) Balance form and function.

Distractors often mock Apple for false "innovations": they ask, why should a white iPhone cost more than a black one? how can rainbow-color iPods be considered an innovation? But we all react to beauty, to form. One shouldn't elevate form at the expense of function but function without form is hardly enough. The same holds for graphics.


The return on effort in data graphics

I contributed the following post to the Statistics Forum. They are having a discussion comparing information visualization and statistical graphics. I use the following matrix to classify charts in terms of how much work they make readers do, and how much value readers get out of doing said work.

Returnoneffort

 

To read the rest of it, click here.


Have data graphics progressed in the last century?

Received a wonderful link via reader Lonnie P. to this website that presents a historical reconstruction of W.E.B. DuBois's exhibit of the "American negro" at the 1900 Paris Expo. Amusingly, DuBois presented a large series of data graphics to educate the world on the state (plight) of blacks in America over a century ago.

You can really spend a whole afternoon examining these charts (and more); too bad the charts have poor resolution and it is often hard to make out the details.

***

Judging from this evidence, we must face up to the fact that data graphics have made little progress during these eleven decades. Ideas, good or bad, get reinvented. Disappointingly, we haven't learned from the worst ones.

Exhibit A 

  Dubois_a

(see discussion here)

Exhibit B

Dubois_b

 (see discussion here)

Exhibit C 

  Dubois_c

(See discussion here.)

Exhibit D

Dubois_dd
 (see the Vampire chart here)

Exhibit E

Dubois_e
(see the discussion here.)

Exhibit F

Dubois_f
(see discussion here.)


Audio bookmarks

I look at a fair number of online videos, especially those embedded on blogs. But I haven't seen this feature implemented broadly. It is a wow feature.

Look at the dots above the progress bar: they tell you what topic is being discussed and allow you to jump back and forth between segments. (the particular dot I moused over said "Randy Moss") The video I saw came from this link.

Audio_bookmarks2

This simple-looking feature is immensely useful to users. You can efficiently search through the audio file and find the segments you're interested in. It's like bookmarks students might put on pages of a textbook for easy reference, except these are audio bookmarks.

Why isn't this feature more prevalent? I think it's because of the amount of manual effort needed to set this up. Imagine how the data has to be processed. In the digital age, the audio file is a bunch of bits (ones and zeroes) so no computer or humans will be able to identify topics from data stored in that way. So, someone would need to listen to the audio file, and mark off the segments manually, and tag the segments. Then, the audio bookmarks can be plotted on the progress bar... basically a dot plot with time on the horizontal axis.

In theory, you can train a computer to listen to an audio file and approximate this task. The challenge is to attain the required accuracy so you don't need to hire an army of people to correct mistakes.

A very simple concept but immensely functional. Great job!


Showing dynamics on a business chart

Dave S. achieved a rare feat, which is to send in a great-looking set of charts. This post at Asymco is worth reading in its entirety; the author Horace discusses the process by which he worked through several charts, arriving at the one he's most happy with.

***

The secret to the success here is the careful framing of the question, and the collection of the appropriate data to address that question. The question is the competition between wireless phone vendors in the last three years. It was established that the right way to view this competition is in two dimensions: share of revenues, and share of profits. Note the word "share". Share of profits is not a metric that is often discussed but it is the right metric to compare to the share of revenues -- getting both numbers onto the same comparable scale is what makes this work.

Needless to say, the raw data one would collect come from the financial statements of the eight individual vendors. Plotting these numbers directly would be a mistake. So you take the numbers, making sure that you're really counting wireless revenues and wireless profits, and then compute the shares. (I am not actually sure that they have wireless profit data because large companies like Apple and Nokia typically don't break out their profit shares, even if they provide the revenue shares by line of business.)

Horace also avoided the plague of plotting all time-series data as line charts (similar to the plague of plotting all geographic data on maps). By plotting revenues and profits simultaneously, he no longer can plot time (years) on one of the two axes, and that is a good thing.

***

Screen-shot-2010-10-05-at-10-5-10.04.22-PM This is the final graph Horace landed on. It puts all the vendors at the origin in 2007 and then tells us where they landed in 2010 in terms of revenue and profit share growth/decline.

It would be even better if he makes the scales work harder: e.g. have equal lengths for the 10% change along both the vertical and horizontal axes. Alternatively, you can scale it such as each unit on either axis represent equal dollars.

This is a very focused chart that answers the question about the relative change in positioning of each vendor. What it doesn't answer is the starting position or ending position of each. Note, while Nokia is depicted as losing share on both revenues and profits, Nokia still has twice the revenue share of the other vendors, and out-earns everyone except Apple!

I am not saying this is a bad chart. It is designed to answer the relative question, not the absolute question. That's all.


***

There is one way to have the cake and eat it too.  Horace almost created that chart. He showed two scatter plots, one for 2007 and one for 2010.

If he just overlays one on the other, and use lines to connect the dots for each phone vendor, he will have a chart that shows absolute and relative values all at once. Here's a crude illustration of this: (missing the labels to show that the arrow end of the line represents 2010 positions)

Redo_wirelessrevpro

I like this kind of chart a lot. It is great for showing dynamics in a set of variables, without actually making the chart dynamic.

(Even on this chart, it is better to harmonize the two scales.)

 


Book review: Interactive Graphics for Data Analysis

I am happy to provide the following review of this interesting book by Martin and Simon, who are readers of Junk Charts. Martin also publishes a blog, and he's the one who has created bumps charts for the Tour de France races (which also appear in the book).

Interactive Graphics for Data Analysis is an advanced book written by two researchers who have deep experience developing graphics software. People who like to go beyond the basics will find it a useful addition to the literature.

To give you an idea of the level of sophistication, just in Chapter 1 (titled Interactivity), the two authors utilize set operations, SQL statements, and parallel coordinate plots. They assume you have some sense of what those are. That said, those sections can be skipped without interrupting the flow of the book.

The following key messages from these authors are worth repeating:

  • There is a distinction between statistical graphics and data graphics. Underlying trends and patterns in the data is often made clear by performing statistical analyses on the data, with the results added to charts (e.g. loess lines). When dealing with very large data sets, statistical charts (such as box plots) are found to be much more scalable, precisely because they do not attempt to put every data point onto the page.
  • The authors stress the need to look at a variety of charts when doing exploratory data analysis. This is because most chart types do certain things well but not others.
  • Igda_img003  Throughout the book, they make much hay of the problem of "over-plotting", that is, overlapping data. This happens when data is abundant, or when values are concentrated in a narrow range. A great illustration of this problem is the parallel coordinates plot, which can look entirely different depending on which lines are plotted on top of which other lines. (The charts on the right are identical except for the order in which the lines are plotted.) Common strategies include "jittering", and varying transparency. Many of these strategies have issues of their own. 
  • They also point out that the look of many multivariate charts (such as mosaic charts) depends on the sorting of the data. This is a key weakness of many such plots. Just think about this the next time you create a stacked column chart.

The book is divided into two sections: Principles and Examples. The second half, the Examples section, consists of case studies in which the authors show examples of how to investigate the structure of a given data set.

Igdaimg002 The example of using the fatty-acid contents of Italian olive oils to deduce their regional origin is a good visualization of how the statistical technique of classification trees work. Here is the telling diagram:

 Notice that data with the same color are oils from the same region, the rectangular sections are results of the statistical classification procedure, and we would like to see most (if not all) of the data within each section having the same color.

***

Without a doubt, graphics designers should be aware of the issues raised by these authors. The book appears to be written for students who are creating statistical software (complete with end-of-chapter exercises.) I'm left wondering what users of graphics software can do with this information because much of this material relates to the design of graphics software. Knowing these issues makes you want to do things the software may not be designed to do efficiently. For example, most software packages I have used do not have a simple toggle to sort categorical variables by various means (alphabetical, increasing or decreasing frequency, increasing or decreasing value of another variable, etc.).


Hoisted from the archives: a revolution

In October 2007, I wrote about the "canvass" metaphor for graphing software. This was what I said:

With the advent of AJAX and other interactive technologies, one can only hope that new graphing software will use the "canvass" metaphor.  If we want to reduce the spacing between bars, we should be able to grab the bars and move them together.  If we want to change the ordering, we should be able to mouse over some menu and select a pre-defined ordering scheme, or to drag and move bars around as we please. etc. etc.

To push this metaphor further, this kind of software should facilitate the "exploratory" stage of graph-making. I blogged about this stage of making sketches before. One longs for software that allows one to flip through many different chart types quickly, to settle on the desired type, and then to make the nitty-gritty changes to the axes, colors, dots, etc.

The revolution has arrived in the form of JMP's Graph Builder function. It is not perfect yet, as even the example I use will show, but I'm excited because we are getting closer to that "canvass" metaphor.

***

Spam_donutsI'm going to re-make this inedible pair of donuts from an otherwise quite nice infographics on the growth and nature of spam in the last 10 years. (New Scientist)

I have pointed out the biggest shortcoming of donut charts often: the fact that the most important clue to the size of each sector of the underlying pie chart, that is, the angle at the center of the pie, has been cut off from the chart, and often, as in here, obscured by a number.

There are dramatic shifts in proportions of spam types during the last decade but the effect is underwhelming as depicted.

In the Graph Builder, I can push around the data and create different chart types.  First, I made a small-multiples bar chart.

Bars_sm_multiple

By clicking on the word "Year" and dragging it to a box called "Overlay", I made a paired bar chart:

Paired bars

What about a dot plot instead? This change requires a right click but easy enough:

Dots

Here's where I encountered a little inconvenience. It's probably ignorance on my part since I didn't read the manual. I couldn't figure out how to increase the dot size for all dots at once, only one at a time.

In any case, I'm still searching.  I want to do a small-multiples line chart. For this, I drag the word "Year" into the bottom of the chart labelled "X", and then right-click to add a line to the dot chart.

Lines_sm_multi

This is close to a desired chart type for this data.  The change from year to year is highly apparent, and the increased and decreased spam types are also obvious. I would color the increases differently from the decreases if I have the time.

I had a very difficult time (and failed in) getting the year labels to say 1999 and 2009 which are the logical points for this data. JMP seems to have a mind of its own.

Since it takes no time, I experimented some more.  By moving "Category" to "Wrap", I reproduced the above chart but in a matrix form:

Lines_sm_multi_wrapped

Finally, I made the "Category" an "overlay" which resulted in this chart.  This is kind of like the Bumps chart but obviously a bad idea for this data: (I'm not even showing the really ugly legend).

Lines_overlay_category

So, my dream toy -- the "canvass" style graph maker -- is here! It only takes a few minutes to move the data around this canvass, and see these different chart types.

***
I indicated that this goes a long way but isn't perfect. Right now, sketching and exploring is easy but refining and detailing is not as easy.

What I would like to see: once the general form of the chart is chosen, maybe a second canvass is needed, with Photoshop as a metaphor, in which we can chisel out the nitty-gritty details, like the axis labels, dot sizes, line widths and so on.

Also, the number of chart types can, and I presume will, be increased over time. For instance, I don't think the current version allows a profile chart; it seems to adhere to the overly-rigid rule that a categorical data series should not be connected by a line.

(I should say that in the current release, one way to accomplish this is to save the resulting graph-sketch as a "JMP script" and then go into the code and change things around. But since we are doing point and click, and visual interaction, why not go all the way?)

Most existing graphing software fall into two extremes: the Excel style which is super-rigid, or the R style which allows minute control over every little thing. This, I think, is the third way.

 


Playthings in the unreal world 3

Some readers may be interested in the R code used to generate the small multiples charts.  The code also highlights one of the virtues of R, which is "elegance": because it natively handles vector and matrix computations, the programmer can (if he or she chooses to) reduce the use of (inelegant) looping.  (Yes, coding elegance is a kind of romantic ideal, and inelegant codes have many practical advantages -- easier to debug, easier to manage, easier to collaborate on with others, etc.)


# reading in data

bigmac = read.csv("bigmac.csv", header=T)

# initializing empty matrix

bigmac2 = matrix(0,nrow(bigmac),nrow(bigmac))
colnames(bigmac2) = bigmac$Country
rownames(bigmac2) = bigmac$Country


# main computation

for(i in colnames(bigmac2)) bigmac2[,i] = bigmac$Price/bigmac$Price[bigmac$Country=="i"]
bigmac3=round(bigmac2-1,3)


# this matrix holds the colors of each bar to be plotted

bigmaccol= (bigmac3>0)


# graphical parameters

par(mar=c(3,7,3,1), mfrow=c(2,2),cex.axis=0.8, cex=1)


# plotting

for (i in c("US","EuroArea","Japan","China")) {
    barplot(rev(bigmac3[,i]), horiz=T, xlim=c(-1,3),las=2, col=rev(bigmaccol[,i]), main=paste("Relative to ",i))
}


In the main computation step, the one formula takes the original vector of prices (the left column in the Economist's chart), computes relative prices 23 times using successively each country's price as the standard, and deposits all 23 vectors into a matrix.  Then, the next step takes the entire matrix, subtracts 1 from each entry, and rounds each entry to 3 decimal places.

In plotting, the default options are not satisfactory: the changes I made included switching from columns to bars, reversing the order of plotting, setting the left and right edges of the value axis, turning the country labels on its side (this also turns the value labels -- there is a way to set this for one axis only but not the other but I did not bother), making the positive bars black and the negatives white, and supplying titles that are dynamically assigned according to reference country.

Also, it is almost always true that the global graphical parameters need to be adjusted.  Here, I controlled the amount of white space on each side of the plotting window, set up a 2x2 grid of charts, and changed the font size on the axes.