« September 2015 | Main | November 2015 »

More chart drama, and data aggregation

Robert Kosara posted a response to my previous post.

He raises an important issue in data visualization - the need to aggregate data, and not plot raw data. I have no objection to that point.

What was shown in my original post are two extremes. The bubble chart is high drama at the expense of data integrity. Readers cannot learn any of the following from that chart:

  • the shape of the growth and subsequent decline of the flu epidemic
  • the beginning and ending date of the epidemic
  • the peak of the epidemic*

* The peak can be inferred from the data label, although there appears to be at least one other circle of approximately equal size, which isn't labeled.

The column chart is low drama but high data integrity. To retain some dramatic element, I encoded the data redundantly in the color scale. I also emulated the original chart in labeling specific spikes.

The designer then simply has to choose a position along these two extremes. This will involve some smoothing or aggregation of the data. Robert showed a column chart that has weekly aggregates, and in his view, his version is closer to the bubble chart.

Robert's version indeed strikes a balance between drama and data integrity, and I am in favor of it. Here is the idea (I am responsible for the added color).

Kosara_avianflu2

***

Where I depart from Robert is how one reads a column chart such as the one I posted:

Redo_avianflu2

Robert thinks that readers will perceive each individual line separately, and in so doing, "details hide the story". When I look at a chart like this, I am drawn to the envelope of the columns. The lighter colors are chosen for the smaller spikes to push them into the background. What might be the problem are those data labels identifying specific spikes; they are a holdover from the original chart--I actually don't know why those specific dates are labeled.

***

In summary, the key takeaway is, as Robert puts it:

the point of this [dataset] is really not about individual days, it’s about the grand totals and the speed with which the outbreak happened.

We both agree that the weekly version is the best among these. I don't see how the reader can figure out grand totals and speed with which the outbreak happened by staring at those dramatic but overlapping bubbles.


Is it worth the drama?

Quite the eye-catching chart this:

Wsj_avianflu

The original accompanied this article in the Wall Street Journal about avian flu outbreaks in the U.S.

The point of the chart appears to be the peak in the flu season around May. The overlapping bubbles were probably used for drama.

A column chart, with appropriate colors, attains much of the drama but retains the ability to read the data.

Redo_avianflu2

 


Raw data and the incurious

The following chart caught my eye when it appeared in the Wall Street Journal this month:

Wsj_fedratehike

This is a laborious design; much sweat has been poured into it. It's a chart that requires the reader to spend time learning to read.

A major difficulty for any visualization of this dataset is keeping track of the two time scales. One scale, depicted horizontally, traces the dates of Fed meetings. These meetings seem to occur four times a year except in 2012. The other time scale is encoded in the colors, explained above the chart. This is the outlook by each Fed committee member of when he/she expects a rate hike to occur.

I find it challenging to understand the time scale in discrete colors. Given that time has an order, my expectation is that the colors should be ordered. Adding to this mess is the correlation between the two time scales. As time treads on, certain predictions become infeasible.

Part of the problem is the unexplained vertical scale. Eventually, I realize each cell is a committee member, and there are 19 members, although two or three routinely fail to submit their outlook in any given meeting.

Contrary to expectation, I don't think one can read across a row to see how a particular member changes his/her view over time. This is because the patches of color would be less together otherwise.

***

After this struggle, all I wanted is some learning from this dataset. Here is what I came up with:

Redo_wsjfedratehike

There is actually little of interest in the data. The most salient point is that a shift in view occurred back in September 2012 when enough members pushed back the year of rate hike that the median view moved from 2014 to 2015. Thereafter, there is a decidedly muted climb in support for the 2015 view.

***

This is an example in which plotting elemental data backfires. Raw data is the sanctuary of the incurious.

 

 


Bewildering baseball math

Over Twitter, someone asked me about this chart:

Mlb_pipeline

It's called the MLB pipeline. The text at the top helpfully tells us what the chart is about: how the playoff teams in baseball are built. That's the good part.

It then took me half a day to understand what is going on below. There are four ways for a player to be on a team: homegrown, trades and free agents, wherein homegrown includes drafted players or international players.

Each row is a type of player. You can look up which teams have exactly X players of a specific type. It gets harder if you want to know how many players team Y has of a given type. It is even harder if you don't know the logos of every team (e.g. Toronto Blue Jays).

Some fishy business is going on with the threesomes and foursomes. Here is the red threesome:

Mlb_threesome1

Didn't know baseball employs half a player. The green section has a different way to play threesomes:

Mlb_threesome2

The blue section takes inspiration from both and shows us a foursome:

Mlb_foursome

I was stuck literally in the middle for quite a while:

Mlb_middlesection

Eventually, I realized that this is a summary of the first two sections on the page. I still don't understand why there is no gap between 11 and 14 but then the 14 and 15 arrows are twice as large as 9, 10 and 11 even though every arrow contains exactly one team.

***

The biggest problem in the above chart is the hidden base: each team's roster has a total of 25 players.

Here is a different view of the data:

Redo_mlb_pipeline

With this chart, I want to emphasize two points: first, addressing the most interesting question of which team(s) emphasize which particular player acquisition tactic; second, providing the proper reference level to interpret the data.

Regarding the vertical, reference lines: take the top left chart about players arriving through trade. If every team equally emphasizes this tactic, then each team should have the same number of traded players on the 25-person roster. This would mean every team has approximately 11 traded players. This is clearly not the case. Several teams, especially Cubs and Blue Jays, utilized trades more often than teams like Mets and Royals.

 

 


A data visualization that is invariant to the data

This map appeared in Princeton Alumni Weekly:

Pu_region_orig_sm

Here is another map I created:

Pu_region_orig_area

If you think they look basically the same, you got the point. Now look at the data on the maps. The original map displays the proportion of graduates who ended up in different regions of the country. The second map displays the proportion of land mass in different regions of the country.

The point is that this visual design is not self-sufficient. If you cover up the data printed on the map, there is nothing else to see. Further, if you swap in other data series (anything at all), nothing on the map changes. Yes, this map is invariant to the data!

This means the only way to read this map is to read the data directly.

***

Maps also have the other issue. The larger land areas draw the most attention. However, the sizes of the regions are in inverse proportion to the data being depicted. The smaller the values, the larger the areas on the map. This is the scatter plot of the proportion of graduates (the data) versus the proportion of land mass:

Pu_region_scatter

***

One quick fix is to use a continuous color scale. In this way, the colors encode the data. For example:

Pu_regions_continuous1_labels

The dark color now draws attention to itself.

Of course, one should think twice before using a map.

***

One note of curiosity: Given the proximity to NYC, it is not surprising that NYC is the most popular destination for Princeton graduates. Strangely enough, a move from Princeton to New York is considered out of region, by the way the regions are defined. New Jersey is lumped with Pennsylvania, Maryland, Virginia, etc. into the Mid-Atlantic region while New York is considered Northeast.

 


After seeing this chart, my mouth needed a rinse

The credit for today's headline goes to Andrew Gelman, who said something like that when I presented the following chart at his Statistical Graphics class yesterday:

Fidelityad_consumerstaples_adj_smWith this chart (which appeared in a large ad in the NY Times), Fidelity Investment wants to tell potential customers to move money into the consumer staples category because of "greater return" and "lower risk". You just might wonder what a "consumer staple" is. Toothbrushes, you see.

There are too many issues with the chart to fit into one blog post. My biggest problem concerns the visual trickery used to illustrate "greater" and "lower". The designer wants to focus readers on the two orange brushes: return for consumer staples is higher, and risk is lower, you see.

The "greater" (i.e. right-facing) toothbrush is associated with longer brushes and higher elevation; the "lower" (left-facing) toothbrush, with shorter brushes and lower elevation.

But looking carefully at the scales reveals that the return ranges from 6% to 14% and the risk ranges from 10% to 25%. So larger numbers are depicted by shorter brushes and lower elevation, exactly the opposite of one's expectation. The orange brushes happen to  represent the same value of 14.3% but the one on the right is at least four times as large as the one on the left. As the dentist says, time to rinse out!

The vertical axis represents ranking of the investment categories in terms of decreasing return and/or risk so on both toothbrushes, the axis should run from 1 to 10.

***

How would the dentist fix this?

The first step is to visit the Q corner of the Trifecta Checkup. The purpose of this chart is for investors to realize that (using the chosen metrics) consumer durables have the best combination of risk and return. In finance, risk is measured as the volatility of return. So, in effect, all the investors care about is the probability of getting a certain level of return.

The trouble with any chart that shows both risk and return is that readers have no way of going from the pair of numbers to the probability of getting a certain level of return.

The fix is to plot the probability of returns directly.

Redo_fidelity_staples

In the above sketch, I just assumed a normal probability model, which is incorrect; but it is not hard to substitute this with an empirial distribution, if you obtain the raw data.

Unlike the original chart, it does not appear that consumer staples is a clearcut winner.