Update on Dataviz Workshop 2

The class practised doing critiques on the famous Wind Map by Fernanda Viegas and Martin Wattenberg.


Click here for a real-time version of the map.

I selected this particular project because it is a heartless person indeed who does not see the "beauty" in this thing.

Beauty is a word that is thrown around a lot in data visualization circles. What do we mean by beauty?


The discussion was very successful and the most interesting points of discussion were these:

  • Something that is beautiful should take us to some truth.
  • If we take this same map but corrupt all the data (e.g. reverse all wind directions), is the map still beautiful?
  • What is the "truth" in this map? What is its utility?
  • The emotional side of beauty is separate from the information side.
  • "Truth" comes before the emotional side of beauty.

Readers: would love to hear what you think.


PS. Click here for class syllabus. Click here for first update.

Graph redesign is hot

Joe D., a long time reader, points us to a few blogs that have been active creating redesigns of charts, similar to how we do it here.

First up, here are some examples from Storytelling With Data (link).

This example transformed a grouped bar chart into a line chart, something that I have long advocated. I'm still waiting for the day when market research companies start to switch from bars to lines.

Stwd_Student Makeover 2


Jorge Camoes, also a long-time reader, produced a redesign of a chart on military spending first printed in Time magazine. (link)


Dual-axis plots have been pilloried here often, especially when the two axes have different and incompatible units, as in here. As usual, transforming to a scatter plot is a good first step, which is what Jorge has done here. He then connected the dots to indicate the time evolution of the relationship. This is a smart move here just because the pattern is so stark.

The chart now illustrates an "inflexion point" in 2000. Prior to 2000, troop size was decreasing while the budget was stable. After 2000, budget increased sharply while troop size remained relatively stable.

Now peer back at the original chart. You can discern the sharp decrease in troop size over time, and the sharp increase in budget over time, but separately. The chart teases a cross-over point around 1995 which turned out to be misleading. This is a great illustration of why dual-axis plots are dangerous.

The state of charting software

Andrew Wheeler took the time to write code (in SPSS) to create the "Scariest Chart ever" (link). I previously wrote about my own attempt to remake the famous chart in grayscale. I complained that this is a chart that is easier to make in the much-maligned Excel paradigm, than in a statistical package: "I find it surprising how much work it would be to use standard tools like R to do this."

Andrew disagreed, saying "anyone saavy with a statistical package would call bs". He goes on to do the "Junk Charts challenge," which has two parts: remake the original Calculated Risk chart, and then, make the Junk Charts version of the chart.

I highly recommend reading the post. You'll learn a bit of SPSS and R (ggplot2) syntax, and the philosophy behind these languages. You can compare and contrast different ways to creating the charts. You can compare the output of various programs to generate the charts.

I'll leave you to decide whether the programs he created are easier than Excel.


Unfortunately, Andrew skipped over one of the key challenges that I envision for anyone trying to tackle this problem. The data set he started with, which he found from the Minneapolis Fed, is post-processed data. (It's a credit to him that he found a more direct source of data.) The Fed data is essentially the spreadsheet that sits behind the Calculated Risk chart. One can just highlight the data, and create a plot directly in Excel without any further work.

What I started with was the employment level data from BLS. What such data lacks is the definition of a recession, that is, the starting year and ending year of each recession. The data also comes in calendar months and years, and transforming that to "months from start of recession" is not straightforward. If we don't want to "hard code" the details, i.e. allowing the definition of a recession to be flexible, and make this a more general application, the challenge is more severe.


Another detail that Andrew skimmed over is the uneven length of the data series. One of the nice things about the Calculated Risk chart is that each line terminates upon reaching the horizontal axis. Even though more data is available for out years, that part of the time series is deemed extraneous to the story. This creates an awkward dataset where some series have say 25 values and others have only 10 values. While most software packages will handle this, more code needs to be written either during the data processing phase or during the plotting.

By contrast, in Excel, you just leave the cells blank where you want the lines to terminate.


In the last section, Andrew did a check on how well the straight lines approximate the real data. You can see that the approximation is extremely well. (The two panels where there seems to be a difference are due to a disagreement between the data as to when the recession started. If you look at 1974 instead of 1973, and also follow Calculated Risk's convention of having a really short recession in 1980, separate from that of 1981, then the straight lines match superbly.)



I'm the last person to say Excel is the best graphing package out there. That's not the point of my original post. If you're a regular reader, you will notice I make my graphs using various software, including R. I came across a case where I think current software packages are inferior, and would like the community to take notice.

Speaking analytics

(This is a cross-post from my other blog, as it also relates to data graphics.)

I was a guest on the Analytically Speaking series, organized by JMP. In this webcast (link, registration required), I talk about the coexistence of data science and statistics, why my blog is called "Junk Charts", what I look for in an analytics team, the tension between visualization and machine algorithms, two modes of statistical modeling, and other things analytical.

Three lessons from Jobs

I feel like I know Steve Jobs even though I don't know him. I know him through the Apple products I have used through the years.

My first exposure to Apple coincided with coming to the States for college. Before the move, I had only ever used PCs, assembled by my Dad. HappymacThe first week of college, I found myself in a room of Macintoshes: in those days, they were off-white cubic blocks, slightly smaller than shoeboxes, with black-and-white, low-resolution screens. A "happy Mac" was always there to greet you. It only took 15 or 20 minutes to fall in love. In this time, I figured out how to use a mouse, the difference between single and double clicking, minimizing windows, file directories, etc. etc. When my friends tell me today that their six-month-old baby could instinctively learn to start their favorite game on the iPad, I believe them. I believe them because I experienced it myself.

By all accounts, Apple products bear the fingerprints of Steve Jobs's dogged vision. His vision offers three important lessons for graphics designers:

1) Never take your eyes off the user experience.

The product is in service of the user. Charts serve readers. What are the key questions to answer? How can we help deliver their needs effortlessly?

2) Maintain the producer's control.

Knowing the user does not mean relinquishing control. Apple products are very tightly designed. The email application on the iPhone works beautifully out of the box but it doesn't try to replicate every feature available online. It doesn't have to. Good graphics are never neutral; their producers have a point of view.

3) Balance form and function.

Distractors often mock Apple for false "innovations": they ask, why should a white iPhone cost more than a black one? how can rainbow-color iPods be considered an innovation? But we all react to beauty, to form. One shouldn't elevate form at the expense of function but function without form is hardly enough. The same holds for graphics.

The return on effort in data graphics

I contributed the following post to the Statistics Forum. They are having a discussion comparing information visualization and statistical graphics. I use the following matrix to classify charts in terms of how much work they make readers do, and how much value readers get out of doing said work.



To read the rest of it, click here.

Have data graphics progressed in the last century?

Received a wonderful link via reader Lonnie P. to this website that presents a historical reconstruction of W.E.B. DuBois's exhibit of the "American negro" at the 1900 Paris Expo. Amusingly, DuBois presented a large series of data graphics to educate the world on the state (plight) of blacks in America over a century ago.

You can really spend a whole afternoon examining these charts (and more); too bad the charts have poor resolution and it is often hard to make out the details.


Judging from this evidence, we must face up to the fact that data graphics have made little progress during these eleven decades. Ideas, good or bad, get reinvented. Disappointingly, we haven't learned from the worst ones.

Exhibit A 


(see discussion here)

Exhibit B


 (see discussion here)

Exhibit C 


(See discussion here.)

Exhibit D

 (see the Vampire chart here)

Exhibit E

(see the discussion here.)

Exhibit F

(see discussion here.)

Audio bookmarks

I look at a fair number of online videos, especially those embedded on blogs. But I haven't seen this feature implemented broadly. It is a wow feature.

Look at the dots above the progress bar: they tell you what topic is being discussed and allow you to jump back and forth between segments. (the particular dot I moused over said "Randy Moss") The video I saw came from this link.


This simple-looking feature is immensely useful to users. You can efficiently search through the audio file and find the segments you're interested in. It's like bookmarks students might put on pages of a textbook for easy reference, except these are audio bookmarks.

Why isn't this feature more prevalent? I think it's because of the amount of manual effort needed to set this up. Imagine how the data has to be processed. In the digital age, the audio file is a bunch of bits (ones and zeroes) so no computer or humans will be able to identify topics from data stored in that way. So, someone would need to listen to the audio file, and mark off the segments manually, and tag the segments. Then, the audio bookmarks can be plotted on the progress bar... basically a dot plot with time on the horizontal axis.

In theory, you can train a computer to listen to an audio file and approximate this task. The challenge is to attain the required accuracy so you don't need to hire an army of people to correct mistakes.

A very simple concept but immensely functional. Great job!

Showing dynamics on a business chart

Dave S. achieved a rare feat, which is to send in a great-looking set of charts. This post at Asymco is worth reading in its entirety; the author Horace discusses the process by which he worked through several charts, arriving at the one he's most happy with.


The secret to the success here is the careful framing of the question, and the collection of the appropriate data to address that question. The question is the competition between wireless phone vendors in the last three years. It was established that the right way to view this competition is in two dimensions: share of revenues, and share of profits. Note the word "share". Share of profits is not a metric that is often discussed but it is the right metric to compare to the share of revenues -- getting both numbers onto the same comparable scale is what makes this work.

Needless to say, the raw data one would collect come from the financial statements of the eight individual vendors. Plotting these numbers directly would be a mistake. So you take the numbers, making sure that you're really counting wireless revenues and wireless profits, and then compute the shares. (I am not actually sure that they have wireless profit data because large companies like Apple and Nokia typically don't break out their profit shares, even if they provide the revenue shares by line of business.)

Horace also avoided the plague of plotting all time-series data as line charts (similar to the plague of plotting all geographic data on maps). By plotting revenues and profits simultaneously, he no longer can plot time (years) on one of the two axes, and that is a good thing.


Screen-shot-2010-10-05-at-10-5-10.04.22-PM This is the final graph Horace landed on. It puts all the vendors at the origin in 2007 and then tells us where they landed in 2010 in terms of revenue and profit share growth/decline.

It would be even better if he makes the scales work harder: e.g. have equal lengths for the 10% change along both the vertical and horizontal axes. Alternatively, you can scale it such as each unit on either axis represent equal dollars.

This is a very focused chart that answers the question about the relative change in positioning of each vendor. What it doesn't answer is the starting position or ending position of each. Note, while Nokia is depicted as losing share on both revenues and profits, Nokia still has twice the revenue share of the other vendors, and out-earns everyone except Apple!

I am not saying this is a bad chart. It is designed to answer the relative question, not the absolute question. That's all.


There is one way to have the cake and eat it too.  Horace almost created that chart. He showed two scatter plots, one for 2007 and one for 2010.

If he just overlays one on the other, and use lines to connect the dots for each phone vendor, he will have a chart that shows absolute and relative values all at once. Here's a crude illustration of this: (missing the labels to show that the arrow end of the line represents 2010 positions)


I like this kind of chart a lot. It is great for showing dynamics in a set of variables, without actually making the chart dynamic.

(Even on this chart, it is better to harmonize the two scales.)


Book review: Interactive Graphics for Data Analysis

I am happy to provide the following review of this interesting book by Martin and Simon, who are readers of Junk Charts. Martin also publishes a blog, and he's the one who has created bumps charts for the Tour de France races (which also appear in the book).

Interactive Graphics for Data Analysis is an advanced book written by two researchers who have deep experience developing graphics software. People who like to go beyond the basics will find it a useful addition to the literature.

To give you an idea of the level of sophistication, just in Chapter 1 (titled Interactivity), the two authors utilize set operations, SQL statements, and parallel coordinate plots. They assume you have some sense of what those are. That said, those sections can be skipped without interrupting the flow of the book.

The following key messages from these authors are worth repeating:

  • There is a distinction between statistical graphics and data graphics. Underlying trends and patterns in the data is often made clear by performing statistical analyses on the data, with the results added to charts (e.g. loess lines). When dealing with very large data sets, statistical charts (such as box plots) are found to be much more scalable, precisely because they do not attempt to put every data point onto the page.
  • The authors stress the need to look at a variety of charts when doing exploratory data analysis. This is because most chart types do certain things well but not others.
  • Igda_img003  Throughout the book, they make much hay of the problem of "over-plotting", that is, overlapping data. This happens when data is abundant, or when values are concentrated in a narrow range. A great illustration of this problem is the parallel coordinates plot, which can look entirely different depending on which lines are plotted on top of which other lines. (The charts on the right are identical except for the order in which the lines are plotted.) Common strategies include "jittering", and varying transparency. Many of these strategies have issues of their own. 
  • They also point out that the look of many multivariate charts (such as mosaic charts) depends on the sorting of the data. This is a key weakness of many such plots. Just think about this the next time you create a stacked column chart.

The book is divided into two sections: Principles and Examples. The second half, the Examples section, consists of case studies in which the authors show examples of how to investigate the structure of a given data set.

Igdaimg002 The example of using the fatty-acid contents of Italian olive oils to deduce their regional origin is a good visualization of how the statistical technique of classification trees work. Here is the telling diagram:

 Notice that data with the same color are oils from the same region, the rectangular sections are results of the statistical classification procedure, and we would like to see most (if not all) of the data within each section having the same color.


Without a doubt, graphics designers should be aware of the issues raised by these authors. The book appears to be written for students who are creating statistical software (complete with end-of-chapter exercises.) I'm left wondering what users of graphics software can do with this information because much of this material relates to the design of graphics software. Knowing these issues makes you want to do things the software may not be designed to do efficiently. For example, most software packages I have used do not have a simple toggle to sort categorical variables by various means (alphabetical, increasing or decreasing frequency, increasing or decreasing value of another variable, etc.).