Transforming the data to fit the message

A short time ago, there were reports that some theme-park goers were not happy about the latest price hike by Disney. One of these report, from the Washington Post (link), showed a chart that was intended to convey how much Disney park prices have outpaced inflation. Here is the chart:

Wapo_magickingdom_price_changes

I had a lot of trouble processing this chart. The two lines are labeled "original price" and "in 2014 dollars". The lines show a gap back in the 1970s, which completely closes up by 2014. This gives the reader an impression that the problem has melted away - which is the opposite of the designer intended.

The economic concept being marshalled here is the time value of money, or inflation. The idea is that $3.50 in 1971 is equivalent to a much higher ticket price in "2014 dollars" because by virtue of inflation, putting that $3.50 in the bank in 1971 and holding till 2014 would make that sum "nominally" higher. In fact, according to the chart, the $3.50 would have become $20.46, an approx. 7-fold increase.

The gap thus represents the inflation factor. The gap melting away is a result of passing of time. The closer one is to the present, the less the effect of cumulative inflation. The story being visualized is that Disney prices are increasing quickly whether or not one takes inflation into account. Further, if inflation were to be considered, the rate of increase is lower (red line).

What about the alternative story - Disney's price increases are often much higher than inflation? We can take the nominal price increase, and divide it into two parts, one due to inflation (of the prior-period price), and the other in excess of inflation, which we will interpret as a Disney premium.

The following chart then illustrates this point of view:

Redo_disneypricehikes

Most increases are small, and stay close to the inflation rate. But once in a while, and especially in 2010s, the price increases have outpaced inflation by a lot.

Note: since 2013, Disney has introduced price tiers, starting with two and currently at four levels. In the above chart, I took the average of the available prices, making the assumption that all four levels are equally popular. The last number looks like a price decrease because there is a new tier called "Low". The data came from AllEars.net.


Made in France stereotypes

France is on my mind lately, as I prepare to bring my dataviz seminar to Lyon in a couple of weeks.  (You can still register for the free seminar here.)

The following Made in France poster brings out all the stereotypes of the French.

Made_in_france_small

(You can download the original PDF here.)

It's a sankey diagram with so many flows that it screams "it's complicated!" This is an example of a graphic for want of a story. In a Trifecta Checkup, it's failing in the Q(uestion) corner.

It's also failing in the D(ata) corner. Take a look at the top of the chart.

Madeinfrance_totalexports

France exported $572 billion worth of goods. The diagram then plots eight categories of exports, ranging from wines to cheeses:

Madeinfrance_exportcategories

Wine exports totaled $9 billion which is about 1.6% of total exports. That's the largest category of the eight shown on the page. Clearly the vast majority of exports are excluded from the sankey diagram.

Are the 8 the largest categories of exports for France? According to this site, those are (1) machinery (2) aircraft (3) vehicles (4) electrical machinery (5) pharmaceuticals (6) plastics (7) beverages, spirits, vinegar (8) perfumes, cosmetics.

Compare: (1) wines (2) jewellery (3) perfume (4) clothing (5) cheese (6) baked goods (7) chocolate (8) paintings.

It's stereotype central. Name 8 things associated with the French brand and cherry-pick those.

Within each category, the diagram does not show all of the exports either. It discloses that the bars for wines show only $7 of the $9 billion worth of wines exported. This is because the data only capture the "Top 10 Importers." (See below for why the designer did this... France exports wine to more than 180 countries.)

Finally, look at the parade of key importers of French products, as shown at the bottom of the sankey:

Madeinfrance_topimporters

The problem with interpreting this list of countries is best felt by attempting to describe which countries ended up on this list! It's the list of countries that belong to the top 10 importers of one or more of the eight chosen products, ordered by the total value of imports in those 8 categories only but only including the value in any category if it rises to the top 10 of the respective category.

In short, with all those qualifications, the size or rank of the black bars does not convey any useful information.

***

One feature of the chart that surprised me was no flows in the Wine category from France to Italy or Spain. (Based on the above discussion, you should realize that no flows does not mean no exports.) So I went to the Comtrade database that is referenced in the poster, and pulled out all the wine export data.

How does one visualize where French wines are going? After fiddling around the numbers, I came up with the following diagram:

Redo_jc_frenchwineexports

I like this type of block diagram which brings out the structure of the dataset. The key features are:

  • The total wine exports to the rest of the world was $1.4 billion in 2016
  • Half of it went to five European neighbors, the other half to the rest of the world
  • On the left half, Germany took a third of those exports; the UK and Switzerland together is another third; and the final third went to Belgium and the Netherlands
  • On the right half, the countries in the blue zone accounted for three-fifths with the unspecified countries taking two-fifths.
  • As indicated, the two-fifths (in gray) represent 20% of total wine exports, and were spread out among over 180 countries.
  • The three-fifths of the blue zone were split in half, with the first half going to North America (about 2/3 to USA and 1/3 to Canada) and the second half going to Asia (2/3 to China and 1/3 to Japan)
  • As the title indicates, the top 9 importers of French wine covered 80% of the total volume (in litres) while the other 180+ countries took 20% of the volume

 The most time-consuming part of this exercise was finding the appropriate structure which can be easily explained in a visual manner.

 

 


Education deserts: places without schools still serve pies and story time

I very much enjoyed reading The Chronicle's article on "education deserts" in the U.S., defined as places where there are no public colleges within reach of potential students.

In particular, the data visualization deployed to illustrate the story is superb. For example, this map shows 1,500 colleges and their "catchment areas" defined as places within 60 minutes' drive.

Screenshot-2018-8-22 Who Lives in Education Deserts More People Than You Might Think 2

It does a great job walking through the logic of the analysis (even if the logic may not totally convince - more below). The areas not within reach of these 1,500 colleges are labeled "deserts". They then take Census data and look at the adult population in those deserts:

Screenshot-2018-8-22 Who Lives in Education Deserts More People Than You Might Think 4

This leads to an analysis of the racial composition of the people living in these "deserts". We now arrive at the only chart in the sequence that disappoints. It is a pair of pie charts:

Chronicle_edudesserts_pie

 The color scheme makes it hard to pair up the pie slices. The focus of the chart should be on the over or under representation of races in education deserts relative to the U.S. average. The challenge of this dataset is the coexistence of one large number, and many small numbers.

Here is one solution:

Redo_jc_chronedudesserts

***

The Chronicle made a commendable effort to describe this social issue. But the analysis has a lot of built-in assumptions. Readers should look at the following list and see if you agree with the assumptions:

  • Only public colleges are considered. This restriction requires the assumption that the private colleges pretty much serve the same areas as public colleges.
  • Only non-competitive colleges are included. Precisely, the acceptance rate must be higher than 30 percent. The underlying assumption is that the "local students" won't be interested in selective colleges. It's not clear how the 30 percent threshold was decided.
  • Colleges that are more than 60 minutes' driving distance away are considered unreachable. So the assumption is that "local students" are unwilling to drive more than 60 minutes to attend college. This raises a couple other questions: are we only looking at commuter colleges with no dormitories? Is the 60 minutes driving distance based on actual roads and traffic speeds, or some kind of simple model with stylized geometries and fixed speeds?
  • The demographic analysis is based on all adults living in the Census "blocks" that are not within 60 minutes' drive of one of those colleges. But if we are calling them "education deserts" focusing on the availability of colleges, why consider all adults, and not just adults in the college age group? One further hidden assumption here is that the lack of colleges in those regions has not caused young generations to move to areas closer to colleges. I think a map of the age distribution in the "education deserts" will be quite telling.
  • Not surprisingly, the areas classified as "education deserts" lag the rest of the nation on several key socio-economic metrics, like median income, and proportion living under the poverty line. This means those same areas could be labeled income deserts, or job deserts.

At the end of the piece, the author creates a "story time" moment. Story time is when you are served a bunch of data or analyses, and then when you are about to doze off, the analyst calls story time, and starts making conclusions that stray from the data just served!

Story time starts with the following sentence: "What would it take to make sure that distance doesn’t prevent students from obtaining a college degree? "

The analysis provided has nowhere shown that distance has prevented students from obtaining a college degree. We haven't seen anything that says that people living in the "education deserts" have fewer college degrees. We don't know that distance is the reason why people in those areas don't go to college (if true) - what about poverty? We don't know if 60 minutes is the hurdle that causes people not to go to college (if true).We know the number of adults living in those neighborhoods but not the number of potential students.

The data only showed two things: 1) which areas of the country are not within 60 minutes' driving of the subset of public colleges under consideration, 2) the number of adults living in those Census blocks.

***

So we have a case where the analysis is incomplete but the visualization of the analysis is superb. So in our Trifecta analysis, this chart poses a nice question and has nice graphics but the use of data can be improved. (Type QV)

 

 

 


Visualizing the Thai cave rescue operation

The Thai cave rescue was a great story with a happy ending. It's also one that lends itself to visualization. A good visualization can explain the rescue operation more efficiently than mere words.

A good visual should bring out the most salient features of the story, such as:

  • Why the operation was so daunting?
  • What were the tactics used to overcome those challenges?
  • How long did it take?
  • What were the specific local challenges that must be overcome?
  • Were there any surprises?

In terms of what made the rescue challenging, some of the following are pertinent:

  • How far in they were?
  • How deep were they trapped?
  • How much of the caves were flooded? Why couldn't they come out by themselves?
  • How much headroom was there in different sections of the cave "tunnel"?

There were many attempts at visualizing the Thai cave rescue operation. The best ones I saw were: BBC (here, here), The New York Times (here), South China Morning Post (here) and Straits Times (here). It turns out each of these efforts focuses on some of the aspects above, and you have to look at all of them to get the full picture.

***

BBC's coverage began with a top-down view of the route of the rescue, which seems to be the most popular view adopted by news organizations. This is easily understood because of the standard map aesthetic.

Bbc_102494059_caves_976

The BBC map is missing a smaller map of Thailand to place this in a geographical context.

While this map provides basic information, it doesn't address many of the elements that make the Thai cave rescue story compelling. In particular, human beings are missing from this visualization. The focus is on the actions ("diving", "standing"). This perspective also does not address the water level, the key underlying environmental factor.

***

Another popular perspective is the sideway cross-section. The Straits Times has one:

Straittimes_thai rescue_part

The excerpt of the infographic presents a nice collection of data that show the effort of the rescue. The sideway cross-sectional section shows the distance and the up-and-down nature of the journey, the level of flooding along the route, plus a bit about the headroom available at different points. Most of these diagrams bring out the "horizontal" distance but somehow ignore the "vertical" distance. One possibility is that the real trajectory is curvy - but if we can straighten out the horizontal, we should be able to straighten out the vertical too.

The NYT article gives a more detailed view of the same perspective, with annotations that describe key moments along the rescue route.

Nyt_detailed_thairescueroute

If, like me, you like to place humans into this picture, then you have to go back to the Straits Times, where they have an expanded version of the sideway cross-section.

  Straitstimes_riskyroute_thairescue

This is probably my most favorite single visualization of the rescue operation.

There are better cartoons of the specific diving actions, though. For example, the BBC has this visual that shows the particularly narrow part of the route, corresponding to the circular inset in the Straits Times version above.

Bbc_thairescue_tightspace

The drama!

NYT also has a set of cartoons. Here's one:

Nyt_thairescue_divers

***

There is one perspective that curiously has been underserved in all of the visualizations - this is the first-person perspective. Imagine the rescuer (or the kids) navigating the rescue route. It's a cross-section from the front, not from the side.

Various publications try to address this by augmenting the top-down route view with sporadic cross-sectional diagrams. Recall the first map we showed from the BBC. On the right column are little annotations of this type (here):

Bbc_thaicaverescue_crosssection

I picked out this part of the map because it shows that the little human figure serves two potentially conflicting purposes. In the bottom diagram, the figurine shows that there is limited headroom in this part of the cave, plus the actual position of the figurine on the ledge conveys information about where the kids were. However, on the top cross-section, the location of the figure conveys no information; the only purpose of the human figure is to show how tall the cave is at that site.

The South China Morning Post (here - site appears to be down when I wrote this) has this wonderful animation of how the shape of the headroom changed as they navigated the route. Please visit their page to see the full animation. Here are two screenshots:

Scmp_caveshape_1

Scmp_caveshape_2

This little clip adds a lot to the story! It'd be even better if the horizontal timeline at the bottom is replaced by the top-down route map.

Thank you all the various dataviz teams for these great efforts.

 

 

 


Several problems with stacked bar charts, as demonstrated by a Delta chart designer

In the Trifecta Checkup (link), I like to see the Question and the Visual work well together. Sometimes, you have a nice message but you just pick the wrong Visual.

An example is the following stacked column chart, used in an investor presentation by Delta.

Delta_aircraft

From what I can tell, the five types of aircraft are divided into RJ (regional jet) and others (perhaps, larger jets). With each of those types, there are two or three subtypes. The primary message here is the reduction in the RJ fleet and the expansion of Small/Medium/Large.

One problem with a stacked column chart with five types is that it takes too much effort to understand the trends of the middle types.

The two types on the edges are not immune to confusion either. As shown below, both the dark blue (Large) type and the dark red (50-seat RJ) type are associated with downward sloping lines except that the former type is growing rapidly while the latter is vanishing from the mix!

Redo_delta_aircraft

 In this case, the slopegraph (Bumps-type chart) can overcome some of the limitations.

Redo_deltaaircraft_2

***

This example was used in my new dataviz workshop, launched in St. Louis yesterday. Thank you to the participants for making it a lively session!


A gem among the snowpack of Olympics data journalism

It's not often I come across a piece of data journalism that pleases me so much. Here it is, the "Happy 700" article by Washington Post is amazing.

Wpost_happy700_map2

 

When data journalism and dataviz are done right, the designers have made good decisions. Here are some of the key elements that make this article work:

(1) Unique

The topic is timely but timeliness heightens both the demand and supply of articles, which means only the unique and relevant pieces get the readers' attention.

(2) Fun

The tone is light-hearted. It's a fun read. A little bit informative - when they describe the towns that few have heard of. The notion is slightly silly but the reader won't care.

(3) Data

It's always a challenge to make data come alive, and these authors succeeded. Most of the data work involves finding, collecting and processing the data. There isn't any sophisticated analysis. But a powerful demonstration that complex analysis is not always necessary.

(4) Organization

The structure of the data is three criteria (elevation, population, and terrain) by cities. A typical way of showing such data might be an annotated table, or a Bumps-type chart, grouped columns, and so on. All these formats try to stuff the entire dataset onto one chart. The designers chose to highlight one variable at a time, cumulatively, on three separate maps. This presentation fits perfectly with the flow of the writing. 

(5) Details

The execution involves some smart choices. I am a big fan of legend/axis labels that are informative, for example, note that the legend doesn't say "Elevation in Meters":

Wpost_happy700_legend

The color scheme across all three maps shows a keen awareness of background/foreground concerns. 


Two nice examples of interactivity

Janie on Twitter pointed me to this South China Morning Post graphic showing off the mighty train line just launched between north China and London (!)

Scmp_chinalondonrail

Scrolling down the page simulates the train ride from origin to destination. Pictures of key regions are shown on the left column, as well as some statistics and other related information.

The interactivity has a clear purpose: facilitating cross-reference between two chart forms.

The graphic contains a little oversight ... The label for the key city of Xian, referenced on the map, is missing from the elevation chart on the left here:

Scmp_chinalondonrail_xian

 ***

I also like the way New York Times handled interactivity to this chart showing the rise in global surface temperature since the 1900s. The accompanying article is here.

Nyt_surfacetemp

When the graph is loaded, the dots get printed from left to right. That's an attention grabber.

Further, when the dots settle, some years sink into the background, leaving the orange dots that show the years without the El Nino effect. The reader can use the toggle under the chart title to view all of the years.

This configuration is unusual. It's more common to show all the data, and allow readers to toggle between subsets of the data. By inverting this convention, it's likely few readers need to hit that toggle. The key message of the story concerns the years without El Nino, and that's where the graphic stands.

This is interactivity that succeeds by not getting in the way. 

 

 

 


Choosing the right metric reveals the story behind the subway mess in NYC

I forgot who sent this chart to me - it may have been a Twitter follower. The person complained that the following chart exaggerated how much trouble the New York mass transit system (MTA) has been facing in 2017, because of the choice of the vertical axis limits.

Streetsblog_mtatraffic

This chart is vintage Excel, using Excel defaults. I find this style ugly and uninviting. But the chart does contain some good analysis. The analyst made two smart moves: the chart controls for month-to-month seasonality by plotting the data for the same month over successive years; and the designation "12 month averages" really means moving averages with a window size of 12 months - this has the effect of smoothing out the short-term fluctuations to reveal the longer-term trend.

The red line is very alarming as it depicts a sustained negative trend over the entire year of 2017, even though the actual decline is a small percentage.

If this chart showed up on a business dashboard, the CEO would have been extremely unhappy. Slow but steady declines are the most difficult trends to deal with because it cannot be explained by one-time impacts. Until the analytics department figures out what the underlying cause is, it's very difficult to curtail, and with each monthly report, the sense of despair grows.

Because the base number of passengers in the New York transit system is so high, using percentages to think about the shift in volume underplays the message. It's better to use actual millions of passengers lost. That's what I did in my version of this chart:

Redo_jc_mtarevdecline

The quantity depicted is the unexpected loss of revenue passengers, measured against a forecast. The forecast I used is the average of the past two years' passenger counts. Above the zero line means out-performing the forecast but of course, in this case, since October 2016, the performance has dipped ever farther below the forecast. By April, 2017, the gap has widened to over 5 million passengers. That's a lot of lost customers and lost revenues, regardless of percent!

The biggest headache is to investigate what is the cause of this decline. Most likely, it is a combination of factors.


Sorting out what's meaningful and what's not

A few weeks ago, the New York Times Upshot team published a set of charts exploring the relationship between school quality, home prices and commute times in different regions of the country. The following is the chart for the New York/New Jersey region. (The article and complete data visualization is here.)

Nyt_goodschoolsaffordablehomes_nyc

This chart is primarily a scatter plot of home prices against school quality, which is represented by average test scores. The designer wants to explore the decision to live in the so-called central city versus the decision to live in the suburbs, hence the centering of the chart about New York City. Further, the colors of the dots represent the average commute times, which are divided into two broad categories (under/over 30 minutes). The dots also have different sizes, which I presume measures the populations of each district (but there is no legend for this).

This data visualization has generated some negative reviews, and so has the underlying analysis. In a related post on the sister blog, I discuss the underlying statistical issues. For this post, I focus on the data visualization.

***

One positive about this chart is the designer has a very focused question in mind - the choice between living in the central city or living in the suburbs. The line scatter has the effect of highlighting this particular question.

Boy, those lines are puzzling.

Each line connects New York City to a specific school district. The slope of the line is, nominally, the trade-off between home price and school quality. The slope is the change in home prices for each unit shift in school quality. But these lines don't really measure that tradeoff because the slopes span too wide a range.

The average person should have a relatively fixed home-price-to-school-quality trade-off. If we could estimate this average trade-off, it should be represented by a single slope (with a small cone of error around it). The wide range of slopes actually undermines this chart, as it demonstrates that there are many other variables that factor into the decision. Other factors are causing the average trade-off coefficient to vary so widely.

***

The line scatter is confusing for a different reason. It reminds readers of a flight route map. For example:

BA_NYC_Flight_Map

The first instinct may be to interpret the locations on the home-price-school-quality plot as geographical. Such misinterpretation is reinforced by the third factor being commute time.

Additionally, on an interactive chart, it is typical to hide the data labels behind mouseovers or clicks. I like the fact that the designer identifies some interesting locales by name without requiring a click. However, one slight oversight is the absence of data labels for NYC. There is nothing to click on to reveal the commute/population/etc. data for central cities.

***

In the sister blog post, I mentioned another difficulty - most of the neighborhoods are situated to the right and below New York City, challenging the notion of a "trade-off" between home price and school quality. It appears as if most people can spend less on housing and also send kids to better schools by moving out of NYC.

In the New York region, commute times may be the stronger factor relative to school quality. Perhaps families chose NYC because they value shorter commute times more than better school quality. Or, perhaps the improvement in school quality is not sufficient to overcome the negative of a much longer commute. The effect of commute times is hard to discern on the scatter plot as it is coded into the colors.

***

A more subtle issue can be seen when comparing San Francisco and Boston regions:

Nyt_goodschoolsaffordablehomes_sfobos

One key insight is that San Francisco homes are on average twice as expensive as Boston homes. Also, the variability of home prices is much higher in San Francisco. By using the same vertical scale on both charts, the designer makes this insight clear.

But what about the horizontal scale? There isn't any explanation of this grade-level scale. It appears that the central cities have close to average grade level in each chart so it seems that each region is individually centered. Otherwise, I'd expect to see more variability in the horizontal dots across regions.

If one scale is fixed across regions, and the other scale is adapted to each region, then we shouldn't compare the slopes across regions. The fact that the lines are generally steeper in the San Francisco chart may be an artifact of the way the scales are treated.

***

Finally, I'd recommend aggregating the data, and not plot individual school districts. The obsession with magnifying little details is a Big Data disease. On a chart like this, users are encouraged to click on individual districts and make inferences. However, as I discussed in the sister blog (link), most of the differences in school quality shown on these charts are not statistically meaningful (whereas the differences on the home-price scale are definitely notable). 

***

If you haven't already, see this related post on my sister blog for a discussion of the data analysis.