Book Preview: How Charts Lie, by Alberto Cairo

Howchartslie_coverIf you’re like me, your first exposure to data visualization was as a consumer. You may have run across a pie chart, or a bar chart, perhaps in a newspaper or a textbook. Thanks to the power of the visual language, you got the message quickly, and moved on. Few of us learned how to create charts from first principles. No one taught us about axes, tick marks, gridlines, or color coding in science or math class. There is a famous book in our field called The Grammar of Graphics, by Leland Wilkinson, but it’s not a For Dummies book. This void is now filled by Alberto Cairo’s soon-to-appear new book, titled How Charts Lie: Getting Smarter about Visual Information.

As a long-time fan of Cairo’s work, I was given a preview of the book, and I thoroughly enjoyed it and recommend it as an entry point to our vibrant discipline.

In the first few chapters of the book, Cairo describes how to read a chart. Some may feel that there is not much to it but if you’re here at Junk Charts, you probably agree with Cairo’s goal. Indeed, it is easy to mis-read a chart. It’s also easy to miss the subtle and brilliant design decisions when one doesn’t pay close attention. These early chapters cover all the fundamentals to become a wiser consumer of data graphics.

***

How Charts Lie will open your eyes to how everyone uses visuals to push agendas. The book is an offshoot of a lecture tour Cairo took during the last year or so, which has drawn large crowds. He collected plenty of examples of politicians and others playing fast and loose with their visual designs. After reading this book, you can’t look at charts with a straight face!

***

In the second half of his book, Cairo moves beyond purely visual matters into analytical substance. In particular, I like the example on movie box office from Chapter 4, titled “How Charts Lie by Displaying Insufficient Data”. Visual analytics of box office receipts seems to be a perennial favorite of job-seekers in data-related fields.

The movie data is a great demonstration of why one needs to statistically adjust data. Cairo explains why Marvel’s Blank Panther is not the third highest-grossing film of all time in the U.S., as reported in the media. That is because gross receipts should be inflation-adjusted. A ticket worth $15 today cost $5 some time ago.

This discussion features a nice-looking graphic, which is a staircase chart showing how much time a #1 movie has stayed in the top position until it is replaced by the next higher grossing film.

Cairo_howchartslie_movies

Cairo’s discussion went further, exploring the number of theaters as a “lurking” variable. For example, Jaws opened in about 400 theaters while Star Wars: The Force Awakens debuted in 10 times as many. A chart showing per-screen inflation-adjusted gross receipts looks much differently from the original chart shown above.

***

Another highlight is Cairo’s analysis of the “cone of uncertainty” chart frequently referenced in anticipation of impending hurricanes in Florida.

Cairo_howchartslie_hurricanes

Cairo and his colleagues have found that “nearly everybody who sees this map reads it wrongly.” The casual reader interprets the “cone” as a sphere of influence, showing which parts of the country will suffer damage from the impending hurricane. In other words, every part of the shaded cone will be impacted to a larger or smaller extent.

That isn’t the designer’s intention! The cone embodies uncertainty, showing which parts of the country has what chance of being hit by the impending hurricane. In the aftermath, the hurricane would have traced one specific path, and that path would have run through the cone if the predictive models were accurate. Most of the shaded cone would have escaped damage.

Even experienced data analysts are likely to mis-read this chart: as Cairo explained, the cone has a “confidence level” of 68% not 95% which is more conventional. Areas outside the cone still has a chance of being hit.

This map clinches the case for why you need to learn how to read charts. And Alberto Cairo, who is a master visual designer himself, is a sure-handed guide for the start of this rewarding journey.

***

Here is Alberto introducing his book.


The French takes back cinema but can you see it?

I like independent cinema, and here are three French films that come to mind as I write this post: Delicatessen, The Class (Entre les murs), and 8 Women (8 femmes). 

The French people are taking back cinema. Even though they purchased more tickets to U.S. movies than French movies, the gap has been narrowing in the last two decades. How do I know? It's the subject of this infographic

DataCinema

How do I know? That's not easy to say, given how complicated this infographic is. Here is a zoomed-in view of the top of the chart:

Datacinema_top

 

You've got the slice of orange, which doubles as the imagery of a film roll. The chart uses five legend items to explain the two layers of data. The solid donut chart presents the mix of ticket sales by country of origin, comparing U.S. movies, French movies, and "others". Then, there are two thin arcs showing the mix of movies by country of origin. 

The donut chart has an usual feature. Typically, the data are coded in the angles at the donut's center. Here, the data are coded twice: once at the center, and again in the width of the ring. This is a self-defeating feature because it draws even more attention to the area of the donut slices except that the areas are highly distorted. If the ratios of the areas are accurate when all three pieces have the same width, then varying those widths causes the ratios to shift from the correct ones!

The best thing about this chart is found in the little blue star, which adds context to the statistics. The 61% number is unusually high, which demands an explanation. The designer tells us it's due to the popularity of The Lion King.

***

The one donut is for the year 1994. The infographic actually shows an entire time series from 1994 to 2014.

The design is most unusual. The years 1994, 1999, 2004, 2009, 2014 receive special attention. The in-between years are split into two pairs, shrunk, and placed alternately to the right and left of the highlighted years. So your eyes are asked to zig-zag down the page in order to understand the trend. 

To see the change of U.S. movie ticket sales over time, you have to estimate the sizes of the red-orange donut slices from one pie chart to another. 

Here is an alternative visual design that brings out the two messages in this data: that French movie-goers are increasingly preferring French movies, and that U.S. movies no longer account for the majority of ticket sales.

Redo_junkcharts_frenchmovies

A long-term linear trend exists for both U.S. and French ticket sales. The "outlier" values are highlighted and explained by the blockbuster that drove them.

 

P.S.

1. You can register for the free seminar in Lyon here. To register for live streaming, go here.
2. Thanks Carla Paquet at JMP for help translating from French.


An enjoyable romp through the movies

Chris P. tipped me about this wonderful webpage containing an analysis of high-grossing movies. The direct link is here.

First, a Trifecta checkup: This thoughtful web project integrates beautifully rendered, clearly articulated graphics with the commendable objective of bringing data to the conversation about gender and race issues in Hollywood, an ambitious goal that it falls short of achieving because the data only marginally address the question at hand.

There is some intriguing just-beneath-the-surface interplay between the Q (question) and D (data) corners of the Trifecta, which I will get to in the lower half of this post. But first, let me talk about the Visual aspect of the project, which for the most part, I thought, was well executed.

The leading chart is simple and clear, setting the tone for the piece:

Polygraphfilm_bars

I like the use of color here. The colored chart titles are inspired. I also like the double color coding - notice that the proportion data are coded not just in the lengths of the bar segments but also in the opacity. There is some messiness in the right-hand-side labeling of the first chart but probably just a bug.

This next chart also contains a minor delight: upon scrolling to the following dot plot, the reader finds that one of the dots has been labeled; this is a signal to readers that they can click on the dots to reveal the "tooltips". It's a little thing but it makes a world of difference.

Polygraphfilm_dotplotwithlabel

I also enjoy the following re-imagination of those proportional bar charts from above:

Polygraphfilm_tinmen_bars

This form fits well with the underlying data structure (a good example of setting the V and the D in harmony). The chart shows the proportion of words spoken by male versus female actors over the course of a single movie (Tin Men from 1987 is the example shown here). The chart is centered in the unusual way, making it easy to read exactly when the females are allowed to have their say.

There is again a possible labeling hiccup. The middle label says 40th minute which would imply the entire movie is only 80 minutes long. (A quick check shows Tin Men is 110 minutes long.) It seems that they are only concerned with dialog, ignoring all moments of soundtrack, or silence. The visualization would be even more interesting if those non-dialog moments are presented.

***

The reason why the music and silence are missing has more to do with practicality than will. The raw materials (Data) used are movie scripts. The authors, much to their merit, acknowledge many of the problems that come with this data, starting with the fact that directors make edits to the scripts. It is also not clear how to locate each line along the duration of the movie. An assumption of speed of dialog seems to be required.

I have now moved to the Q corner of the Trifecta checkup. The article is motivated by the #OscarSoWhite controversy from a year or two ago, although by the second paragraph, the race angle has already been dropped in favor of gender, and by the end of the project, readers will have learned also about ageism but  the issue of race never returned. Race didn't come back because race is not easily discerned from a movie script, nor is it clearly labeled in a resource such as IMDB. So, the designers provided a better solution to a lesser problem, instead of a lesser solution to a better problem.

In the last part of the project, the authors tackle ageism. Here we find another pretty picture:

Polygraphfilm_ageanalysis

At the high level, the histograms tell us that movie producers prefer younger actresses (in their 20s) and middle-aged actors (forties and fifties). It is certainly not my experience that movies have a surplus of older male characters. But one must be very careful interpreting this analysis.

The importance of actors and actresses is being measured by the number of words in the scripts while the ages being analyzed are the real ages of the actors and actresses, not the ages of the characters they are playing.

Tom Cruise is still making action movies, and he's playing characters much younger than he is. A more direct question to ask here is: does Hollywood prefer to put younger rather than older characters on screen?

Since the raw data are movie scripts, the authors took the character names, and translated those to real actors and actresses via IMDB, and then obtained their ages as listed on IMDB. This is the standard "scrape-and-merge" method executed by newsrooms everywhere in the name of data journalism. It often creates data that are only marginally relevant to the problem.

 

 

 


February talks, and exploratory data analysis using visuals

News:

In February, I am bringing my dataviz lecture to various cities: Atlanta (Feb 7), Austin (Feb 15), and Copenhagen (Feb 28). Click on the links for free registration.

I hope to meet some of you there.

***

On the sister blog about predictive models and Big Data, I have been discussing aspects of a dataset containing IMDB movie data. Here are previous posts (1, 2, 3).

The latest instalment contains the following chart:

Redo_scorebytitleyear_ans

The general idea is that the average rating of the average film on IMDB has declined from about 7.5 to 6.5... but this does not mean that IMDB users like oldies more than recent movies. The problem is a bias in the IMDB user base. Since IMDB's website launched only in 1990, users are much more likely to be reviewing movies released after 1990 than before. Further, if users are reviewing oldies, they are likely reviewing oldies that they like and go back to, rather than the horrible movie they watched 15 years ago.

Modelers should be exploring and investigating their datasets before building their models. Same thing for anyone doing data visualization! You need to understand the origin of the data, and its biases in order to tell the proper story.

Click here to read the full post.

 

 


Batmen not as interesting as it seems

When this post appears, I will be on my way to Seattle. Maybe I will meet some of you there. You can still register here.

I held onto this tip from a reader for a while. I think it came from Twitter:

20160326_woc432_1 batman

The Economist found a fun topic but what's up with the axis not starting at zero?

The height x weight gimmick seems cool but on second thought, weight is not the same as girth so it doesn't make much sense!

In the re-design, I use bubbles to indicate weight and vertical location to indicate height. The data aren't as interesting as one might think. All the actors pretty much stayed true to the comic-book ideal, with Adam West being the closest. I also changed the order of the actors.

Redo_batman

I left out the Lego, as it creates a design challenge that does not justify the effort.

 

 


Three axes or none

Catching up on some older submissions. Reader Nicholas S. saw this mind-boggling chart about Chris Nolan movies when Interstellar came out:

Vulture_chris_nolan_by_numbers

This chart was part of an article by Vulture (link).

It may be the first time I see not one, not two, but three different scales on the same chart.

First we have Rotten Tomatoes score for each movie in proportions:

Vulture_chrisnolan_score

The designer chopped off 49% of each column. So the heights of the columns are not proportional to the data.

Next we see the running time of movies in minutes (dark blue columns):

Vulture_chrisnolan_runtime

For this series, the designer hid 40 minutes worth of each movie below the axis. So again, the heights of the columns do not convey the relative lengths of the movies.

Thirdly, we have light blue columns representing box office receipts:

  Vulture_chrisnolan_boxoffice

Or maybe not. I can't figure out what is the scale used here. The same-size chunks shown above display $45,000 in one case, and $87 million in another!

So the designer kneaded together three flawed axes. Or perhaps the designer just banished the idea of an axis. But this experiment floundered.

***

Here is the data in three separate line charts:

Redo_chrisnolanfilms

***

In a Trifecta Checkup (link), the Vulture chart falls into Type DV. The question might be the relationship between running time and box office, and between Rotten Tomatoes Score and box office. These are very difficult to answer.

The box office number here refers to the lifetime gross ticket receipts from theaters. The movie industry insists on publishing these unadjusted numbers, which are completely useless. At the minimum, these numbers should be adjusted for inflation (ticket prices) and for population growth, if we are to use them to measure commercial success.

The box office number is also suspect because it ignores streaming, digital, syndication, and other forms of revenues. This is a problem because we are comparing movies across time.

You might have noticed that both running time and box office numbers have gone up over time. (That is to say, running time and box office numbers are highly correlated.) Do you think that is because moviegoers are motivated to see longer films, or because movies are just getting longer?

 

 

PS. [12/15/2014] I will have a related discussion on the statistics behind this data on my sister blog. Link will be active Monday afternoon.


The need to think about what you're seeing: an incomplete geography lesson

If your chart is titled "The Most Popular TV Show Set in Every State," what would you expect the data to look like?

You'd think the list would be dominated by the hit shows like The Walking Dead and Downton Abbey, and you might guess that there are probably only four or five unique shows on the list.

But then it's easy to miss the word "set" in the title. They are looking for most popular show given that it is set in a particular state. Now this is a completely different question -- and conversely, it guarantees that there will be 50 different shows for the 50 states, assuming that one show can't be set in multiple states. This is also, computationally, a much more complex question. Some locations, like New York, Mass. (Boston), and Illinois (Chicago), are many times more likely to be the settings of TV shows than other states. This means, one might need to go back many years to find the "popular" shows in the less attention-grabbing states.

I used quotations for the word "popular" because if one has to dig deep into history for a specific state, then it is possible that the selected show would not be popular in the aggregate! This is not unlike the issue of whether having your kids pick up a popular sport (like basketball) or instrument (like violin) is better or worse than an unpopular one (like squash or trombone). The latter route is potentially the shorter to stand out but their achievement will be known only to the niche audience.

***

This brings me to how one should look at a map like this one in Business Insider (link):

  Bi_map_populartvshowset

 The first thing that strikes you are the colors. The colors that signify nothing. Since each state has its own TV show, by definition each piece of information is unique. As far as I can tell, the choice of which states share the same color is totally up to the designer.

As I have remarked in the past, too often the designer uses the map as a lesson in geography. The only information presented to readers through the map type is where each state in located in the union. Without the state names, even this lesson is incomplete. We learn nothing about the relative popularity of these shows, the longevity, the years in which they went on air, etc.

Geographical data should not automatically be placed on a map.

***

Is there any "data" in this map? It depends on how you see it. Here's what the author described went into pairing each state with a TV show:

To qualify, we looked at television series as opposed to reality shows.* Selections were based on each show’s longevity, audience and critical acclaim using info from IMDB/Metacritic, awards, and lasting impact on American culture and television... *When there wasn't a famous enough series to choose from, we selected a more popular reality show. That happens once on this list (IA).

 


When a chart does nothing for the story

PixardeclineexcelThere is some banter on Twitter about a chart that appeared in The Atlantic on "Pixar's Sad Decline--in One Chart". (@thewhyaxis, @jschwabish, @tealtan).

Link to article

***

It's a bit horrible but not the worst chart ever.

The most offensive aspect is the linear regression line. It's clearly an inappropriate model for this dataset.

I also don't like charts that include impossible values on the axis, in this case, the Rotten Tomato Score does not ever go above 100%.

If the chart is turned on its side, the movie titles can be read horizontally.

Redo_pixar

***
I am compelled by the story but the chart doesn't help at all. Of course, it would be better if they can find data on the profitability of each movie. Readers should ask how correlated the Rotten Tomato Score is with box office, and also, what are the relative costs of producing these different movies. Jon has the score against profit chart (link).

 


Nielsen's cross-platform crossing diagram crosses up readers

My friend Augustine F., who's a data-savvy guy, couldn't figure out what's going on with this chart in Nielsen's cross-platform report.

Nielsen_streamingtv

It's a case of a Bumps chart done poorly.

The reader must first read the beginning pages of the report to find one's bearing. The two charts are supposed to investigate the correlation between streaming video and regular TV. What causes the confusion is that the populations being analyzed are different between the two charts.

In the left chart, they exclude anyone who do not watch streaming video (35% of the sample), and then divide those who watch streaming video into five equal-sized segments based on how much they watch. Then, they look at how much regular TV each segment watches on average.

In the right chart, they exclude anyone who do not watch regular TV (just 0.5% of the sample), and then divide those who watch regular TV into five equal-sized segments based on how much they watch. Then, they look at how much online streaming video each segment watches on average.

***

What crosses us up is the relative scales. The scale for regular TV viewing is tightly clustered between 212 and 247 daily minutes on the left chart but has a wide range between 24 and 522 on the right chart. The impression given by the designer is that the same population (18-34 year olds) is divided into five groups (quintiles) for each chart, albeit using different criteria. It just doesn't make sense that the group averages do not match.

The reason for this mismatch is the hugely divergent rates of exclusion as described above. What the chart seems to be saying is that the 65% who use streaming video have very similar TV viewing behavior (about 220 daily minutes). In other words, we surmise that most of those people on the left chart map to groups 2 and 3 on the right chart.

Who are the people in groups 1, 4 and 5 on the right chart? It appears that they are the 35% who don't watch streaming video. Thus, the real insight of this chart is that there are two types of people who don't watch streaming video: those who watch very little regular TV at all, and those who watch twice the average amount of regular TV.

***

Here's another puzzle: Nielsen claims that high streaming = low TV and low streaming = high TV. Is it really true that high streaming = low TV? Take the segment of highest streaming (#1 on the left chart). This group, which is 13% of the survey population, accounts for 83% of the streaming minutes -- almost 71,000 out of 86,000 minutes. Now look at the right chart. It turns out that the streaming minutes are quite evenly distributed among those TV-based quintiles, ranging from 15,000 minutes to 23,000 minutes each.

So, it is impossible to fit all of the top streaming quintile into any one TV quintile - they have too many streaming minutes. In fact, the top streaming quintile must be quite spread out among the TV quintiles since each of the TV quintiles is 1.5 times the size of a streaming quintile!

So, we must conclude that customers who stream a lot include both fervent TV fans as well as those who watch little TV.

***

In a return-on-effort analysis, this is a high-effort, low-reward chart.