« April 2013 | Main | June 2013 »

Chance to ask me a question this Friday

I will be at Book Expo this Friday signing books at the McGraw-Hill booth. If you're in NYC, drop by and say hi between 11 and 12.

Yes, it's a new book!  The title is Numbersense: How to Use Big Data to Your Advantage (link). If you read my blogs, you already know where I'm going with this. How can we be smart consumers of data analyses in a world overflowing with data? It will be in stores in July. Between now and then, you can come back here to learn more.

Also, at 12:30, I'll be interviewed at the Shindig event by Peggy Sanservieri, who blogs at Huffington Post on book marketing. This is an online live chat event. Go to their site to register, and you'd have the opportunity to ask me questions.

(This is cross-posted on both blogs.)

Superimposing time series is the biggest source of silly theories

Business Insider (link) published the following chart and declared "the end of the car age in one chart". The chart superimposed the monthly motor vehicle miles driven per capita and the labor force participation rate.


This is the conclusion of the post:

There's a logical connection between the two. Not in the workforce? You're less inclined to drive.

It's strange that they chose to show a time series going back to the 1970s. The conclusion is logical only for the last five years of the data. Looking back even another decade, to the last recession (2001), one finds the exact opposite conclusion: as the work force participation rate fell, the per-capita miles driven went up.

The other problem is causation creep, about which I have written on the sister blog (link). This chart merely shows correlation (and that is questionable). The conclusion of cause and effect is purely theory. Another theory would be the rise in telecommuting and work-from-home situations. A counter-theory would be that the unemployed may have more free time to drive. Another theory is that gas prices have gone up:


Any time series you can find that has a peak during the 2000s can be similarly interpreted as having caused people to stop driving. Here's a chart of real house prices from Calculated Risk.


Falling house prices causes people to stop driving. Or perhaps falling house prices causes people to lose jobs.

Lose the base, connect the dot, and confuse the message

Reader Jack S. sent over this chart (link):

The first problem readers encounter with this image is "What is MMI?"  I like to think of any presentation as a set of tearout pages. Even if the image is part of a book, or part of a deck of slides, once it is published, the writer should expect readers from tearing a sheet out and passing it along. In fact, you'd love to have people pass along your work. This means that when creating a plot such as this, the designer must explain what MMI is in the footnote. Yes, on every chart even if every chart in the report deals with MMI.

MMI, I'm told, is some kind of metric of health care cost.


What a mess. They are trying to use the metaphor of "measuring one's temperature", which I suppose is cute because MMI measures health care costs.

Next, the designer chose to plot the index against the national average as opposed to the dollar amount of MMI. This presents a challenge since the thermometer does not have a natural baseline number. This is especially true on the Fahrenheit scale used in the U.S.

Then, a map is introduced to place the major cities. The bulb of each thermometer now doubles as a dot on the map. This step is mind-boggling because the city labels aren't even on the map. So if you know where these cities are, you don't need the map for guidance but if you don't know the locations, you're as hopeless as before.

How the data now gets onto the complex picture requires some deconstruction.

First, start with a bar chart of the relative index (the third column of the table shown above).

Then, chop off the parts below 85 (colored gray).


Next, identify the cities that are below the national average (i.e. index < 100) and color them blue.

You can see this by focusing only on the chart above the map. In other words, this part:


To get from here to the version published, add a guiding line from each bar to the dot on the map for the corresponding city. Notice that a constant-length portion of each bar has been chopped off, and now each bar is augmented by some additional length that varies with the distance of the bar chart from the geographical location of the city as shown on the map below. For instance, Miami, which is furthest south, has the biggest distortion.


The choice of 85 as a cutoff is arbitrary and inexplicable. If we really want to create a "cutoff" of sorts, we can use 100, which represents the national average. By plotting the gap between the city index and the national index, effectively, the percent difference, we also can use the sign of the difference to indicate above/below the national average, thus saving a color.



One of the most telling signs of a failed chart is the appearance of the entire data set next to the chart. That's the essence of the self-sufficiency test.

More power brings more responsibility

Nick C. on Twitter sent us to the following chart of salaries in Major League Soccer. (link)


This chart is hosted at Tableau, which is one of the modern visualization software suites. It appears to be a user submission. Alas, more power did not bring more responsibility.

Sorting the bars by total salary would be a start.

The colors and subsections of the bars were intended to unpack the composition of the total salaries, namely, which positions took how much of the money. I'm at a loss to explain why those rectangles don't seem to be drawn to scale, or what it means to have rectangles stacked on top of each other. Perhaps it's because I don't know much about how the cap works.

Combined with the smaller chart (shown below), the story seems to be that while all teams have similar cap numbers, the actual salaries being paid could differ by multiples.


This is the standard stacked bar chart showing the distribution of salary cap usage by team:


 I have never understood the appeal of stacking data. It's not easy to compare the middle segments.

After quite a bit of work, I arrived at the following:


The MLS teams are divided into five groups based on how they used the salary cap. Salary cap figures are converted into proportion of total cap. For example, the first cluster includes Chicago, Los Angeles, New York, Seattle and Toronto, and these teams spread the wealth among the D, F, and M players while not spending much on goalie and "others". On the other hand, Groups 2 and 3, especially Group 3 allocated 30-45% of the cap on the midfield. 

Three teams form their own clusters. CLB spends more of its cap on "others" than any other team (others are mostly hyphenated positions like D-F, F-M, etc.) DAL and VAN spend a lot less on midfield players than other teams. VAN spends a lot on defense.

My version has many fewer data points (although the underlying data set is the same) but it's easier to interpret.


I tried various chart types like bar charts, and even pie charts. I still like the profile (line) charts best.


In a modern software (I'm using JMP's Graph Builder here), it's only one click to go from line to bar, and one click to go to pie.



Rotating circle, loose ends, on-line dashboards and charts

There is a tendency when producing dashboards to go for the cutesy-cutesy. Reader Daniel L. came across an attempt by Facebook to document its data center metrics (link). They chose this circular, spiraling design:


Notice that the lines of equal distance on a circular plot are the concentric circles. Thus, when they connect different points in a continuous way, as if it were a standard line chart, the line segments between data points are distorted. The diagram below shows the problem:


One potential advantage (although not worthwhile) of wrapping the data into a circle is that the 24 hours become a continuous line. Except that it isn't the case here! Weirdly, the purple and blue lines show a huge discontinuity at the ray that points vertically upwards from the origin. This leads to an even more fascinating find.

The circle actually rotates! It's like a rotating restaurant. The time shown vertically pointing upwards keeps changing as I write this post. This makes the discontinuity even more baffling. You'd think the previous data point just shifts anti-clockwise but apparently not. If any of you can figure this out, please leave a comment.


As Daniel pointed out, the traditional line charts shown in the bottom half of the page would have done the job with less fuss. Not as eye-catching, but not as baffling either.



One innovation of on-line charts is the replacement of axis labels with mouse-over effects. Mousing over the chart here produces the underlying data values. This is elegance.

One horrible trend with on-line charts is the horrendous choice of scale. Look at the top two charts, especially the orange line chart about power usage. It makes no sense to choose a scale that completely annihilates the underlying fluctuations.

I have found the same problems with many Google charts. It looks as if nothing is happening except when you look more closely, you learn that a tiny distance represents a big percentage shift in the underlying data.


A gift from the NY Times Graphics team

This post is long over-due. I have been meaning to write about this blog for a long time but never got around to it. It's like the email response you postponed because you want to think before you fire it off. But I received two mentions of it within the last few days, which reminded me I have to get to work on this one.

One of the best blogs to read - that is similar in spirit to Junk Charts - is ChartNThings. This is the behind-the-scenes blog of the venerable New York Times graphics department. They talk about the considerations that go into making specific charts that subsequently showed up in the newspaper. You get to see their sketches. Kind of like my posts here, except with the graphics professional's perspective.

As Andrew Gelman said in his annotated blog roll (link), ChartNThings is "the ultimate graphics blog. The New York Times graphics team presents some great data visualizations along with the stories behind them. I love this sort of insider’s perspective."


The other mention is from a friend who reviewed something I wrote about fantasy football. He pointed me to this particular post from the ChartNThings blog that talks about luck and skill in NFL.

They have a perfect illustration of how statistics can help make charts better.

Start with the following chart that shows the value of players picked organized by the round in which they are picked.


Think of this as plotting the raw data. A pattern is already apparent, which is that on average, the players picked in earlier rounds (on the left) have produced higher value for their clubs. However, there is quite a bit of noise on the page. One problem with dot plots is over-plotting when the density of points is high, as is here. Our eyes cannot judge density properly especially in the presence of over-plotting.

What the NYT team did next is to take the average value for all players picked in each round in each year, and plot those instead. This drastically reduces the number of dots per round, and cleans up the canvass a great deal.

It's amazing how much more powerful is this chart than the previous one. Instead of the average value, one can also try the median value, or plot percentiles to showcase the distribution. (They later offered a side-by-side box plot, which is also an excellent idea.)

The post then goes into exploring a paper by some economists who wanted to ignore the average and focus on the noise. I'll make some comments on that analysis on my other blog. (The post is now live.)


One behind-the-scenes thing I'd add about this behind-the-scenes blog is that the authors must have spent quite a bit of time organizing the materials and creating the streamlined stories for us to savor. Graphical creation involves a lot of sketching and exploration, so there are lots of dead ends, backtracking, stuff you throw away. There will be lots of charts with little flaws that you didn't care to correct because it's not your final version. There will be lots of charts which will only be intelligible to the creator since they are missing labels, scales, etc., again because those were supposed to be sketch work. There will even be charts that the creator can't make sense of because the train of thought has been lost by the end of the project.

So we should applaud what the team has done here for the graphics community.