Reimagining the league table

The reason for the infrequent posting is my travel schedule. I spent the past week in Seattle at JSM. This is an annual meeting of statisticians. I presented some work on fantasy football data that I started while writing Numbersense.

For my talk, I wanted to present the ubiquitous league table in a more useful way. The league table is a table of results and relevant statistics, at the team level, in a given sports league, usually ordered by the current winning percentage. Here is an example of ESPN's presentation of the NFL end-of-season league table from 2014.


If you want to know weekly results, you have to scroll to each team's section, and look at this format:


For the graph that I envisioned for the talk,  I wanted to show the correlation between Points Scored and winning/losing. Needless to say, the existing format is not satisfactory. This format is especially poor if I want my readers to be able to compare across teams.


The graph that I ended up using is this one:


 The teams are sorted by winning percentage. One thing should be pretty clear... the raw Points Scored are only weakly associated with winning percentage. Especially in the middle of the Points distribution, other factors are at play determining if the team wins or loses.

The overlapping dots present a bit of a challenge. I went through a few other drafts before settling on this.

The same chart but with colored dots, and a legend:


Only one line of dots per team instead of two, and also requiring a legend:


 Jittering is a popular solution to separating co-located dots but the effect isn't very pleasing to my eye:


Small multiples is another frequently prescribed solution. Here I separated the Wins and Losses in side-by-side panels. The legend can be removed.



As usual, sketching is one of the most important skills in data visualization; and you'd want to have a tool that makes sketching painless and quick.

Where are the millionaires? Where's the news?

The financial media, ranging from Wall Street Journal to Zero Hedge, blogged about the geographical distribution of U.S. millionaires. The stories came with a map, and in the case of the latter, two data tables ranked by ascending and descending prevalence of millionaires. The map looks like this:


The talking point lifted from the press release of Phoenix Marketing, who is the origin of the data, focuses improbably on North Dakota. For example, the WSJ blog began with:

The state making the fastest climb up the millionaire rankings doesn’t have a single Tiffany or Saks Fifth Avenue store. The closest BMW dealership is a six-hour drive from the capital.

Welcome to North Dakota, which jumped 14 spots in the annual rankings of millionaire households per capita released by Phoenix Marketing International.

The trouble is, you can't pick North Dakota out of the map; it just doesn't stand out. The map uses a different methodology of ordering the states, by groupings of the prevalence of millionaires, that is, the proportion of households in each state who are labeled "millionaires" by Phoenix Marketing.

The text, by contrast, draws attention to the change in the rank of states using the proportion of households who are millionaires as the ranking criterion. This data is two steps removed from the data used for the map (start with the map data, compute the year-to-year change, then convert to ranks).


State-level averages pose a challenge: state population varies a lot, and this leads to variability in the estimates of smaller states. You are likely to find smaller states over-represented in the top and bottom of state ranking charts. I talked about a similar situation relating to interpreting high schools test data (see this post, and Prologue of Numbersense link.)

Instead of using proportion of households who are millionaires, I prefer to use the number of millionaires per 1,000 households. Mathematically, these two are equivalent. If we plot that metric versus the size of states (number of households), we see the familiar pattern:


I labeled the North Dakota data point to show how unremarkable it is. While it may have risen in "rank", it is still ranked below median in terms of number of millionaires per 1000 households. Also notice that of states with similar number of households, the millionaires metric ranges wildly from 40 to 70 per 1000 households.

An interpretation of these state average millionaire metrics has to account for state population size.


The following map illustrates the ups and downs between 2007 and 2013 by state.  (I found 2007 data but not the 2012 data.)


Think of an accounting equation. In this view, the positive changes must balance out the negative changes since I am only converned about any shift in mix. What this map shows is that Texas, California, New York, and Washington have the top net gains in the number of millionaires while Florida, and Michigan have the biggest net losses. North Dakota is again in the middle of the bunch.

This view ignores the total net change in millionaires as it focuses on the mix by state.  You'd need to figure out what is the relevant question before you can come up with a good visualization of this (or any) data.



A straight line going nowhere fast, despite tweets and likes

Ken B., another Australian reader, wasn't too proud of this effort, apparently excerpted from an HSBC report by the Sydney Morning Herald (link):


Ken: If you plot ranking by ranking it magically turns into a straight line.


There are a few other annoyances. Gridlines, data labels, double-edged arrow, bars all based on the same data, which can easily be conveyed with a ranked table. In fact, just turn the chart 90 degrees clockwise, get rid of everything else except the names of countries, and you have a much more readable figure.

The completely unnecessary legend is an Excel special. If only one data series is plotted, it should be automatic to suppress the legend.

The three-letter acronyms for different currencies is a futile educational lesson kind of like plotting geographical data on maps (in many cases). For most readers, the message of the chart does not require knowing the names of the currencies, nor their acronyms. For those who care about acronyms, say currency traders, they most likely already know those letters.


Just like I don't understand how we can define "over-rated" or "under-rated" restaurants (see this post and this), I also don't understand how we can define "over-valued" or "under-valued" currencies given the impossiblity of knowing the "true value" of any currency. 


I just had to point your attention to the fact that 123 people tweeted this article, and 221 liked this item on Facebook. And these actions form part of the so-called Big Data revolution.

As good as Bolt

The accomplished graphics team at NYT outdid themselves with this feature on the 100m dash through Olympic history (link). You should really go and check out the full presentation.


About_100m_winnersThey start with a data table like the one shown on the right. It's a boring list of names and winning times by year and by medal type. What can one do to animate this data? The NYT team found many ways.

The presentation consists of a static dot plot plus a short movie.

They found many ways to convey the meaning of the tenths and hundredths of a second that separate the top performers. In the dot plot, for example, they did not draw the actual winning times. Instead, they converted the differences in winning times into distances. Here is the right section of the chart:


We are drawn into compressing time and place, having Usain Bolt race all of the former winners and assuming everyone ran the same race they did in real life. The dot plot tells us how far ahead of each past winner Bolt is.

Some time ago, I wrote about the "audiolization" of duration data, in another piece about a NYT chart (link). They deployed this strategy beautifully at the end of the short film. The runners were aligned like keys on a piano, and the resulting sound is like playing a scale across the keyboard. Lovely, that is to say.



The authors bring in a number of other data points to create reference points for understanding this data. For example, if you blink, you might miss the national jerseys worn by each winner in the hypothetical competition:


 Later, the dominance of American runners is plainly shown via white lanes:


 The perspective hides the relative impotency of American sprinters in recent Olympics. This view of the surge of Caribbean runners makes up for it:



Next, they compared the times for U.S. age group record holders to Olympic winning times. This is a fun way to look at the data. (Pardon the strutting Play button.)


They play with foreground/background here in an effective way. The 15- and 16-year-old age-group record holder is said to be "good enough for a bronze as recently as 1980".

Fun aside, think twice before you repeat this "insight". It falls into the category of those things that sound impressive but are quite meaningless. For one thing, the gap between the two runners is affected by a multitude of factors: the age of the runner (which is elevated here over and above other factors), the nationality of the runner, and the time of the run. This last point is key: if we compare the 15-to-16-year-old 100m record time from 1980 to the winning times of Olympic medalists from that year, the gap would be much wider.

Also, pay attention to the distribution of runners. It gets very crowded very quickly near the top end of the scale. In other words, while the gap as measured in part-seconds may seem small, the gap as measured in individual athletes would be very wide -- we'd find loads of athletes whose times fit into the gap illustrated here.


According to the dot plot, in some years, like the 1950s, there were no gold medalists. Looking at the data here, I think this is an overplotting effect, where two times were so close that the dots were literally on top of each other. This creates the situation where one of the dots will be on top of the other, and which one is on top is a feature of the software you're using. Jittering is one common strategy to deal with this problem, or we can just place the gold, silver and bronze dots on their own levels. The latter strategy would look exactly like the over-the-top view used in the short film:


(We'll also note that this view has time running left to right, which is perhaps more natural than time running bottom up, as in the dot plot. However, we are used to seeing runners cross the finish line from left to right on a TV screen so this is a case of eight ounces and half a pound.)

In the short film, I find the gigantic play/pause button at the center of the screen an annoyance, ruining my enjoyment. (I'm using Firefox and a Mac.)


Now, go check out the entire feature (link), and applaud the effort.

No sorting and lack of structure undermine a chart


Reader Daniel L. isn't impressed with this page of charts about gay rights in the U.S., from the Guardian paper (London). (link)


The use of circles to organize data has a long history, stretching back at least to the Nightingale rose, which turns the time dimension into a circle. Andrew doesn't like this concept (e.g. here), neither do I.  Here is something similar by McCandless (link) that has appeared on this blog.


Take the following set of charts showing the legislative differences by region of the country.


Since states within region are categories with no order, there is no easy way to order the states. This is made worse by the categorical nature of the other variable: the legislative posture on marriage, civil union and domestic partnership is very messy data with no order either.

The regions can be sorted reasonably by the "average" permissiveness but this chart shows no concern over sorting at all.

About the only easy read from this set of charts is the observation that the Northeast states are most permissive while the Southeast statements are most restrictive. Anyone who has casual exposure to this social issue knows this without needing a chart.


The key to clarifying this chart is to clarify the underlying structure, particularly the structure of the permissiveness variable. Dissecting the data reveals that there are only five possible postures (Banned all three rights, Banned marriage but allows one of the other rights, Allow civil unions, Allow marriage, and No information).  The following data table conveys the data with minimal fuss:




9-9-9 data deserving a chart

Herman Cain's 9-9-9 plan has attracted lots of attention. Paul Krugman cites a Tax Policy Center study that shows the plan penalizes middle and lower income households.


The main idea is that there are people who will pay more taxes and those who will pay less under every income group. The impact of the tax policy depends on the mix between those paying more and those paying less.

The data table is far too cumbersome to bring out the message. Here is a visualization:


For example, for the top 20% income group, about 50 % will see an average tax cut of 3900. But about 80 to 90% of the other 4 quintiles will experience tax hikes.

Rank confusion

This chart, found in Princeton Alumni Weekly, only partially scanned here, supposedly gave reasons for "Princeton's top-rated [Ph.D.] programs" "to celebrate". My alma mater has outstanding academic departments, but it would be difficult to know from this chart!


Due to the color scheme, the numbers that jump out at you are the ones in the bright orange background, which refers to how many other departments are ranked equal to Princeton's in those subjects. It takes some effort to realize that the more zeroes there are in the top buckets (fading orange), the better.

The editor started with a nice idea, which is to convert raw rankings into clusters of rankings. She recognized that in this type of rankings (see a related post on my book blog here), it is meaningless to gloat about #1 versus #2 because they are probably statistically the same. For instance, in the ranking of Architecture departments (ARC), 37 schools (including Princeton) all belonged to the same cluster as Princeton, judged to be a statistical tie.

One of the main reasons why this chart looks so confusing is its failing the self-sufficiency test. It really is a disguised data table, with some colorful background and shadows; the graphical elements add nothing to the data at all. If one covered up all the data, there is nothing left to see!

In the following rework, I emphasize the cluster structure. Each subject has three possible clusters, schools ranked above, equal to, and below Princeton. Instead of plotting raw numbers, the chart shows proportions of schools in each category. The order is roughly such that the departments with the relatively higher standing float to the top. Because a bar chart is used, the department names could be spelt out in their entirety and placed horizontally.


If one has access to the raw data, it would be even better to reveal the entire cluster structure. It is quite possible that the clusters above and below Princeton can be further subdivided into more clusters. This will allow readers to understand better what the cluster ranks mean.


Game over, Tetris

This is the image in my head at the moment (link here):


That's when @yawnathan pointed me to the following infographics via twitter:


Here are some issues with the choice of graphics:

  • Not sure how tetris-shaped pieces are better than a standard stacked bar chart, or a line chart
  • Adding a one-liner for each analysis summarizing the key insight is essential, and much more engaging than dry titles like "by gender"
  • Ordering each section of the poster in a sensible way would help bring out the message; maintaining the same order in all four sections has little benefit but adds to the confusion
  • Many of the corporate logos are not popular enough to yield recognition; they do not resemble their company names enough to elicit free association

But the chart also fails to ask the right question. In thinking about "who uses which sites?", it would be much more informative to cut the data in a different way -- tell us among males, what proportion uses Digg v. Stumbleupon v. Facebook, etc. The problem with the current graphic is that it offers no information about scale. For example, Ning may have 1/1000-th of the total traffic compared to Facebook (I made this up) but you wouldn't know since everything is expressed as a proportion of each site's user base.

Besides, what is the objective behind asking the question, who uses which sites? Are readers asked to draw conclusions about the relative viability of the business models of these companies? Is there some significance associated with an elderly skew or female skew?

Finally, the chart hits the trifecta! It also fails from the data collection perspective. While it discloses the source of the data as "Google Ad Planner", it is impossible for readers to make sense of the data. How reliable is this data? Did the income levels come from surveys of users (self-reported and probably biased)? Or from users associated with a specific advertising campaign? Did they come from matching users' IP addresses to Census data? If so, how much actual household-level data are used? Or perhaps a statistical model was built to predict income levels?  Of which period is the data representative? Does that period generalize to other periods? Were there any (or many) missing values? Were these values imputed or set to the average? If a sample was used, how do we know that it is unbiased?


In this form, the infographics poster is nothing more than a done-up data dump.

The best way to handle two dimensions may be to not use two dimensions

Guess what the designer at Nielsen wanted to tell you with this chart:

Reader Steven S. couldn't figure it out, and chances are neither can you.

What about...

  • The smartphone (OS) market is dominated by three top players (Android, Apple and Blackberry) each having roughly 30% share, while others split the remaining 10%.
  • The age-group mix for each competitor is similar (or are they?)

Maybe those are the messages; if so, there is no need to present a bivariate plot (the so-called "mosaic" plot, or in consulting circles, the Marimekko). Having two charts carrying one message each would accomplish the job cleanly.


Trying to do too much in one chart is a disease; witness the side effects.  Smartphone_sm1

The two columns, counting from the right, contain rectangles that appear to be of different sizes, and yet the data labels claim each piece represents 1%, and in some cases "< 1%".  The simultaneous manipulation of both the height and the width plays mind tricks.

Also, while one would ordinarily applaud the dropping of decimals from a chart like this, doing so actually creates the ugly problem that the five pieces of 1% (on the left column shown here) have the same width but clearly varying heights!

Smartphone_sm2 What about this section of the plot shown on the left? Does the smaller green box look like it's less than 1/3 the size of the longer green box? This chart is clearly not self-sufficient, and as such one might prefer a simple data table.

The downfall of the mosaic plot is that it gives the illusion of having two dimensions but only an illusion: in fact, the chart is dominated by one dimension, as all proportions are relative to the grand total.

For instance, the chart says that 6% of all smartphone users are between the ages of 18 and 24 AND uses an Android phone. It also tells us that 2% of all smartphone users are between 35 and 44 AND uses a Palm phone. Those are not two numbers anyone would desire to compare. There are hardly any practical questions that require comparing them.

Sometimes, the best way to handle two dimensions is not to use two dimensions.


 The original article notes that "Of the three most popular smartphone operating systems, Android seems to attract more young consumers." In the chart shown below,  Redo_phoneos we assume that the business question is the relative popularity of phone operating systems across age groups. 

The right metric for comparison is the market share of each OS within an age group.

For example, tracing the black line labeled "Android", this chart tells us that Android has 37% of the 18-24 market while it has about 20% of the 65 and up market. 

Android has an overall market share of about 30%, and that average obscures a youth bias that is linear with age.

On the other hand, the iPhone (green line) has also an average market share of about 30% but its profile is pretty flat in all age groups except 65 and up where it has considerable strength.

Further, the gap between Android and iPhone at the older age group actually opens up at 55 years and up. In the 55-64 age group, the iPhone holds a market share that is similar to its overall average while the Android performs quite a bit worse than its average. We note that Palm OS has some strength in the older age groups as well while the Blackberry also significantly underperforms in 65 and over.

Why aren't all these insights visible in the mosaic chart? It all because the chosen denominator of the entire market (as opposed to each age group) makes a lot of segments very small, and then the differences between small segments become invisible when placed beside much larger segments.

Now, the reconstituted chart gives no information about the relative sizes of the age groups. The market size for the older groups is quite a bit smaller than the younger groups. This information should be provided in a separate chart, or as a little histogram tucked under the age-group axis.