## Misguided warheads in the classroom

##### Oct 28, 2014

Alberto Cairo just gave a wonderful talk to my workshop, in which he complains about the state of dataviz teaching. So, it's quite opportune that reader Maja Z. sent in a couple of examples from a recent course on data visualization for academics. She was surprised to see these held out as examples of good work. I'll discuss one chart today, and the other one some other day.

The original is from a Korean newspaper.

The instructor for the course praised this chart for this principle: "always try to find a graphic that relates to your subject, like the bullets here representing military spending, and use it in the chart."

For students who take my class, they learn the opposite lesson: I like to say imagery often backfires. I do like charts with imagery that makes the data come alive but more often than not, the designer falls in love with the imagery and let the data down.

This chart presumably shows the top 10 military spenders in the world by total amount spent in 2013. You'd think that the Chinese spent a bit more than half what the Americans did. But the data labels say \$640 billion vs \$188 billion, only about 30%. Next, the Russian spend is 46% of the Chinese according to the data, etc. So, is this really a data visualization or just some pictures with numbers printed next to them?

It's possible that the data is encoded in the surface areas or the volumes of these warheads but in reality, this is a glorified column chart, so most readers will respond to the heights of the columns.

***

The designer seems to appreciate that total spending is not necessarily a great metric. Spending as a proportion of GDP is provided as a secondary metric. I'm not so sure what to make of this though: should we expect richer nations to need/want to spend more building bombs and such? It just doesn't seem very logical to me.

Instead, a more meaningful metric might be military spending per capita. Controlling for population seems somewhat logical; the more people you have to protect, the more money you have to spend.

In the end, I made this scatter plot that tries to have it both ways:

(The percentages are of GDP.)

Here, we can see that Saudi Arabia and the U.S. are particularly aggressive spenders, spending over \$2000 per person per year. The respective two dots are way above the average line (for the top 10 spenders). At the richer end of the scale, the American spending is way above the international average. On the other hand, Japan and Germany both spend significantly less than would be predicted by their GDP per capita levels.

Of note, readers more easily relate to the per-capita numbers than the aggregate figures in the original chart. They learn, for instance, that Saudi Arabia's average GDP was \$27,000 per head, of which \$2,500 went to arming itself up.

## The class pondering Big Data

##### Oct 23, 2014

Note: I'm traveling a lot lately and it is affecting my ability to post on a regular basis.

It's three weeks into my chart-building workshop (link) at NYU and we are starting to discuss individual projects. One of the major discussion points this week is the quality of the underlying data being visualized.

One student is visualizing movie data from IMDB. He showed a chart comparing the year of a movie's release and the number of votes it has received. Do people talk more about new or old movies? Not surprisingly, the distribution is highly skewed with recent movies getting a lot more votes. The consensus in the room is that you never just want to see the pattern; the natural question to ask is why are we seeing such a pattern.

The easiest response  is people tend to vote on recent movies. This is the availability heuristic. You tend to talk about things that are top of mind. But there is a lot more to that. Perhaps movies of specific genres get discussed more often. Perhaps movies with larger marketing budgets get more buzz. etc. etc. If any of these factors are important, a good data visualization should bring them out.

Another factor that isn't obvious is that IMDB only started recently relative to the history of movies. The start date of data collection is highly informative here. Imagine a database that gets created five years ago versus one that was created five decades ago. The former dataset is not a random sample of the latter, far from it. The availability heuristic matters here. Also, the movie industry is growing in the meantime so the number of movies is changing. Internet access is also growing so the number of votes is changing. Finally, all students agree that anyone caring to comment on older movies probably is someone who likes those movies, and thus expect that the average rating on older movies to be higher than more recent ones... we'd have to verify this hypothesis using the data.

A lot of Big Data have these characteristics. The starting date of data collection matters a lot. Averaging data without accounting for these timing issues leads to wrong conclusions.

***

The dynamics of people rating/commenting on movies is a topic I'm interestsed in. If you go to Amazon and pull up Freakonomics, published 6 years ago, it has over 1800 reviews, of which over 800 are five stars, and 1300 are four or five stars, and yet the most recent reviews submitted are dated 3, 5, 6, etc. days ago. Why do people keep writing reviews?  For example, two of the reviews written this week just said "great!" and "great book". Another said "Outstanding take on the odd correlations between things in our culture. Definitley makes you think outside the lines." That comment has probably been repeated hundreds of times already by the preceding reviewers. Have anyone studied this yet?

##### Oct 15, 2014

I had the pleasure of visiting the Facebook data science team last week, and we spent some time chatting about visual communication, something they care as much about as I do. Solomon reported about our conversation in this blog post. One topic is stacked bar charts, which are useful in limited situations, such as when the categorical variable has two or three levels.

Solomon used stacked bars in his fascinating post about how candidates from the two political parties are using Facebook messages in the run-up to mid-term elections. Be sure to read about it here. This is an example of good data journalism in which the outcomes of the analysis are presented simply, hiding the amount of technical work that went into its production.

This stacked bar chart is effective at pointing out the differences in the types of messages being sent out by party:

I do have one question, which is the placement of the 50-percent line. The line is very important to this chart, and I like the way it looks. When the line sits at 50 percent, it implies that the Republican and Democratic candidates were issuing about equal numbers of Facebook messages. If the share of total messages is not 50/50, then the reference line should sit elsewhere.

They later split the races by tosses-up versus uncompetitive, and use confidence intervals to communicate both the expected rate and the uncertainty of the estimate. The uncertainty bars in effect tell readers that there are many more uncompetitive elections than tosses-up.

The choice of the chart form is fine. But it makes me pull out my Tufte book. The data-ink ratio on this chart needs a little help. The gridlines can go. Even the 250 label on the x-axis can go. I might even go with just labelling the midpoints.

Lastly, this next chart is enlightening. Seems like older adults are much more likely to comment and/or like such political messages; men are more likely to comment while women are more likely to like. The small-multiples format helps us grasp the three-way analysis without much suffering.

## An infographic showing up here for the right reason

##### Oct 09, 2014

Infographics do not have to be "data ornaments" (link). Once in a blue moon, someone finds the right balance of pictures and data. Here is a nice example from the Wall Street Journal, via ThumbsUpViz.

What makes this work is that the picture of the running back serves a purpose here, in organizing the data.  Contrast this to the airplane from Consumer Reports (link), which did a poor job of providing structure. An alternative of using a bar chart is clearly inferior and much less engaging.

***

I went ahead and experimented with it:

I fixed the self-sufficiency issue, always present when using bubble charts. In this case, I don't think it matters whether the readers know the exact number of injuries so I removed all of the data from the chart.

Here are  three temptations that I did not implement:

• Not include the legend
• Not include the text labels, which are rendered redundant by the brilliant idea of using the running guy
• Hide the bar charts behind a mouseover effect.

## Data decorations, ornaments, chartjunk, and all that

##### Oct 07, 2014

Alberto Cairo left a comment about "data decorations". This is a name he's using to describe something like the windshield-wiper chart I discussed the other day. It seems like the visual elements were purely ornamental and adds nothing to the experience--one might argue that the experience was worse than just staring at the data table.

It just happens that I have another example of such a chart, submitted by Xan. This one is from Consumer Reports, and illustrates some findings from a recent survey on what things air travellers hate most. Good luck figuring all this out!

A few of these ideas work, such as the complaints about leg room being tied to the seated passengers inside the plane. But then, the data about people hating middle seats is placed on the upper left corner between the left wing and the tail. All of the atypically shaped charts (the cloud, the triangle, the octaogon) seem to use the oft-criticized convention of coding the data onto just one dimension of these multi-dimensioned objects. I just find the organization of the text confusing and poorly structured.

Xan pulled something from a much older Consumer Reports. And they dared to use a boring bar chart:

A nice compromise would be to create some subsections under Airlines to group different types of complaints (stuff relating to seating, stuff about service, stuff about punctuality, etc.). Ask a designer to draw some icons (remember the NYT dog graphic!)

## A patently pointless picture

##### Oct 03, 2014

I am mystified by the intention behind this chart, published in NYT Magazine (Sept 14, 2014).

It is not a data visualization since the circles were not placed to scale. The 650 and 660 should have been further to the right on a horizontal time scale. And if we were to take the radial time axis literally, the 390 circle would be closest to the center.

It is not a work of art. It doesn’t look particularly appealing. Sometimes, designers are inspired by imagery. The accompanying article concerns windshield wipers, and I’m not seeing the imagery.

***

The arrangement of the circles actually interfere with the reader’s comprehension. Here is a straightforward version of the data as a column chart.

Now, let’s turn it on the side, with time running vertically instead of horizontal (the convention).

Then, we need to invert convention once again by making the vertical axis run in reverse so that time runs from up to down, instead of down to up.

Finally, distort the frequency axis, replace the bars with circles, and you have essentially replicated the original.

The point is each step obscures the pattern more. In this case, following conventions makes a better chart.

***

I have a pet peeve about presenting partial data next to complete data, even if it is labeled correctly. On this chart, the number 390 cannot be compared against any of the other numbers because we are not even half way into the decade of the 2010s. Instead of plotting total number of patents per decade, it would have been more useful to plot number of patents per year in each decade. 43, 26, 65, 41, etc. For the 2010s, I am assuming they have data for 3.5 years.

A simple column chart looks like this:

The per-year view shows that the 2010s is unusual. Of course, I should add a footnote to the chart to make it clear that we only have partial data for 2010, and that the assumption behind the averaging is that the pace of patents will remain the same on average for the remainder of the decade.

In the Trifecta Checkup, this is Type DV.