Enhanced tables, and supercharged spreadsheets with in-cell tech

Old-timer Chris P. sent me to this Bloomberg article about Vanguard ETFs and low-cost funds (link). The article itself is interesting, and I will discuss it on the sister blog some time in the future.

Chris is impressed with this table included with the article:


This table indeed presents the insight clearly. Those fund sectors in which Vanguard does not compete have much higher costs than the fund sectors in which Vanguard is a player. The author calls this the "Vanguard effect."

This is a case where finding a visual design to beat this table is hard.

For a certain type of audience, namely financial, the spreadsheet is like rice or pasta; you simply can't live without it. The Bloomberg spreadsheet does one better: the bands of blue contrast with the white cells, which neatly divides those funds into two groups.

If you use spreadsheets a lot, you should definitely look into in-cell charts. Perhaps Tufte's sparkline is the most famous but use your imagination. I also wish vendors would support in-cell charts more eagerly.

Here is a vision of what in-cell technology can do with the above spreadsheet. (The chart is generated in R.)




Rethinking the index data, with modesty and clarity in mind

I discussed the rose chart used in the Environmental Performance Index (EPI) report last week. This type of data is always challenging to visualize.

One should start with an objective. If the goal is a data dump, that is to say, all you want is to deliver the raw data in its full glory to the user, then you should just print a set of data tables. This has traditionally been the delivery mechanism of choice.

If, on the other hand, your interest is communicating insights, then you need to ask some interesting questions. One such question is how do different regions and/or countries compare with each other, not just in the overall index but also in the major sub-indices?

Learning to ask such a question requires first understanding the structure of the data. As described in the previous post, the EPI is a weighted average of a bunch of sub-indices. Each sub-index measures "distance to a target," which is then converted into a scale from 0 to 100. This formula guarantees that at the aggregate level, the EPI is not going to be 0 or 100: a country would have to score 100 on all sub-indices to attain EPI perfection!

Here is a design sketch to address the question posed above:


For a print version, I chose several reference countries listed at the bottom that span the range of common values. In the final product, hovering over a stripe should disclose a country and its EPI. Then the reader can construct comparisons of the type: "Thailand has a value of 53, which places it between Brazil and China."

The chart reveals a number of insights. Each region stakes out its territory within the EPI scale. There are no European countries with EPI lower than 45 while there are no South Asian countries with EPI higher than 50 or so. Within each region, the distribution is very wide, and particularly so in the East Asia and Pacific region. Europe is clearly the leading region, followed by North America.

The same format can be replicated for every sub-index.

This type of graph addresses a subset of the set of all possible questions and it does so in a clear way. Modesty in your goals often helps.


Reimagining the league table

The reason for the infrequent posting is my travel schedule. I spent the past week in Seattle at JSM. This is an annual meeting of statisticians. I presented some work on fantasy football data that I started while writing Numbersense.

For my talk, I wanted to present the ubiquitous league table in a more useful way. The league table is a table of results and relevant statistics, at the team level, in a given sports league, usually ordered by the current winning percentage. Here is an example of ESPN's presentation of the NFL end-of-season league table from 2014.


If you want to know weekly results, you have to scroll to each team's section, and look at this format:


For the graph that I envisioned for the talk,  I wanted to show the correlation between Points Scored and winning/losing. Needless to say, the existing format is not satisfactory. This format is especially poor if I want my readers to be able to compare across teams.


The graph that I ended up using is this one:


 The teams are sorted by winning percentage. One thing should be pretty clear... the raw Points Scored are only weakly associated with winning percentage. Especially in the middle of the Points distribution, other factors are at play determining if the team wins or loses.

The overlapping dots present a bit of a challenge. I went through a few other drafts before settling on this.

The same chart but with colored dots, and a legend:


Only one line of dots per team instead of two, and also requiring a legend:


 Jittering is a popular solution to separating co-located dots but the effect isn't very pleasing to my eye:


Small multiples is another frequently prescribed solution. Here I separated the Wins and Losses in side-by-side panels. The legend can be removed.



As usual, sketching is one of the most important skills in data visualization; and you'd want to have a tool that makes sketching painless and quick.

Where are the millionaires? Where's the news?

The financial media, ranging from Wall Street Journal to Zero Hedge, blogged about the geographical distribution of U.S. millionaires. The stories came with a map, and in the case of the latter, two data tables ranked by ascending and descending prevalence of millionaires. The map looks like this:


The talking point lifted from the press release of Phoenix Marketing, who is the origin of the data, focuses improbably on North Dakota. For example, the WSJ blog began with:

The state making the fastest climb up the millionaire rankings doesn’t have a single Tiffany or Saks Fifth Avenue store. The closest BMW dealership is a six-hour drive from the capital.

Welcome to North Dakota, which jumped 14 spots in the annual rankings of millionaire households per capita released by Phoenix Marketing International.

The trouble is, you can't pick North Dakota out of the map; it just doesn't stand out. The map uses a different methodology of ordering the states, by groupings of the prevalence of millionaires, that is, the proportion of households in each state who are labeled "millionaires" by Phoenix Marketing.

The text, by contrast, draws attention to the change in the rank of states using the proportion of households who are millionaires as the ranking criterion. This data is two steps removed from the data used for the map (start with the map data, compute the year-to-year change, then convert to ranks).


State-level averages pose a challenge: state population varies a lot, and this leads to variability in the estimates of smaller states. You are likely to find smaller states over-represented in the top and bottom of state ranking charts. I talked about a similar situation relating to interpreting high schools test data (see this post, and Prologue of Numbersense link.)

Instead of using proportion of households who are millionaires, I prefer to use the number of millionaires per 1,000 households. Mathematically, these two are equivalent. If we plot that metric versus the size of states (number of households), we see the familiar pattern:


I labeled the North Dakota data point to show how unremarkable it is. While it may have risen in "rank", it is still ranked below median in terms of number of millionaires per 1000 households. Also notice that of states with similar number of households, the millionaires metric ranges wildly from 40 to 70 per 1000 households.

An interpretation of these state average millionaire metrics has to account for state population size.


The following map illustrates the ups and downs between 2007 and 2013 by state.  (I found 2007 data but not the 2012 data.)


Think of an accounting equation. In this view, the positive changes must balance out the negative changes since I am only converned about any shift in mix. What this map shows is that Texas, California, New York, and Washington have the top net gains in the number of millionaires while Florida, and Michigan have the biggest net losses. North Dakota is again in the middle of the bunch.

This view ignores the total net change in millionaires as it focuses on the mix by state.  You'd need to figure out what is the relevant question before you can come up with a good visualization of this (or any) data.



A straight line going nowhere fast, despite tweets and likes

Ken B., another Australian reader, wasn't too proud of this effort, apparently excerpted from an HSBC report by the Sydney Morning Herald (link):


Ken: If you plot ranking by ranking it magically turns into a straight line.


There are a few other annoyances. Gridlines, data labels, double-edged arrow, bars all based on the same data, which can easily be conveyed with a ranked table. In fact, just turn the chart 90 degrees clockwise, get rid of everything else except the names of countries, and you have a much more readable figure.

The completely unnecessary legend is an Excel special. If only one data series is plotted, it should be automatic to suppress the legend.

The three-letter acronyms for different currencies is a futile educational lesson kind of like plotting geographical data on maps (in many cases). For most readers, the message of the chart does not require knowing the names of the currencies, nor their acronyms. For those who care about acronyms, say currency traders, they most likely already know those letters.


Just like I don't understand how we can define "over-rated" or "under-rated" restaurants (see this post and this), I also don't understand how we can define "over-valued" or "under-valued" currencies given the impossiblity of knowing the "true value" of any currency. 


I just had to point your attention to the fact that 123 people tweeted this article, and 221 liked this item on Facebook. And these actions form part of the so-called Big Data revolution.

As good as Bolt

The accomplished graphics team at NYT outdid themselves with this feature on the 100m dash through Olympic history (link). You should really go and check out the full presentation.


About_100m_winnersThey start with a data table like the one shown on the right. It's a boring list of names and winning times by year and by medal type. What can one do to animate this data? The NYT team found many ways.

The presentation consists of a static dot plot plus a short movie.

They found many ways to convey the meaning of the tenths and hundredths of a second that separate the top performers. In the dot plot, for example, they did not draw the actual winning times. Instead, they converted the differences in winning times into distances. Here is the right section of the chart:


We are drawn into compressing time and place, having Usain Bolt race all of the former winners and assuming everyone ran the same race they did in real life. The dot plot tells us how far ahead of each past winner Bolt is.

Some time ago, I wrote about the "audiolization" of duration data, in another piece about a NYT chart (link). They deployed this strategy beautifully at the end of the short film. The runners were aligned like keys on a piano, and the resulting sound is like playing a scale across the keyboard. Lovely, that is to say.



The authors bring in a number of other data points to create reference points for understanding this data. For example, if you blink, you might miss the national jerseys worn by each winner in the hypothetical competition:


 Later, the dominance of American runners is plainly shown via white lanes:


 The perspective hides the relative impotency of American sprinters in recent Olympics. This view of the surge of Caribbean runners makes up for it:



Next, they compared the times for U.S. age group record holders to Olympic winning times. This is a fun way to look at the data. (Pardon the strutting Play button.)


They play with foreground/background here in an effective way. The 15- and 16-year-old age-group record holder is said to be "good enough for a bronze as recently as 1980".

Fun aside, think twice before you repeat this "insight". It falls into the category of those things that sound impressive but are quite meaningless. For one thing, the gap between the two runners is affected by a multitude of factors: the age of the runner (which is elevated here over and above other factors), the nationality of the runner, and the time of the run. This last point is key: if we compare the 15-to-16-year-old 100m record time from 1980 to the winning times of Olympic medalists from that year, the gap would be much wider.

Also, pay attention to the distribution of runners. It gets very crowded very quickly near the top end of the scale. In other words, while the gap as measured in part-seconds may seem small, the gap as measured in individual athletes would be very wide -- we'd find loads of athletes whose times fit into the gap illustrated here.


According to the dot plot, in some years, like the 1950s, there were no gold medalists. Looking at the data here, I think this is an overplotting effect, where two times were so close that the dots were literally on top of each other. This creates the situation where one of the dots will be on top of the other, and which one is on top is a feature of the software you're using. Jittering is one common strategy to deal with this problem, or we can just place the gold, silver and bronze dots on their own levels. The latter strategy would look exactly like the over-the-top view used in the short film:


(We'll also note that this view has time running left to right, which is perhaps more natural than time running bottom up, as in the dot plot. However, we are used to seeing runners cross the finish line from left to right on a TV screen so this is a case of eight ounces and half a pound.)

In the short film, I find the gigantic play/pause button at the center of the screen an annoyance, ruining my enjoyment. (I'm using Firefox and a Mac.)


Now, go check out the entire feature (link), and applaud the effort.

No sorting and lack of structure undermine a chart


Reader Daniel L. isn't impressed with this page of charts about gay rights in the U.S., from the Guardian paper (London). (link)


The use of circles to organize data has a long history, stretching back at least to the Nightingale rose, which turns the time dimension into a circle. Andrew doesn't like this concept (e.g. here), neither do I.  Here is something similar by McCandless (link) that has appeared on this blog.


Take the following set of charts showing the legislative differences by region of the country.


Since states within region are categories with no order, there is no easy way to order the states. This is made worse by the categorical nature of the other variable: the legislative posture on marriage, civil union and domestic partnership is very messy data with no order either.

The regions can be sorted reasonably by the "average" permissiveness but this chart shows no concern over sorting at all.

About the only easy read from this set of charts is the observation that the Northeast states are most permissive while the Southeast statements are most restrictive. Anyone who has casual exposure to this social issue knows this without needing a chart.


The key to clarifying this chart is to clarify the underlying structure, particularly the structure of the permissiveness variable. Dissecting the data reveals that there are only five possible postures (Banned all three rights, Banned marriage but allows one of the other rights, Allow civil unions, Allow marriage, and No information).  The following data table conveys the data with minimal fuss:




9-9-9 data deserving a chart

Herman Cain's 9-9-9 plan has attracted lots of attention. Paul Krugman cites a Tax Policy Center study that shows the plan penalizes middle and lower income households.


The main idea is that there are people who will pay more taxes and those who will pay less under every income group. The impact of the tax policy depends on the mix between those paying more and those paying less.

The data table is far too cumbersome to bring out the message. Here is a visualization:


For example, for the top 20% income group, about 50 % will see an average tax cut of 3900. But about 80 to 90% of the other 4 quintiles will experience tax hikes.

Rank confusion

This chart, found in Princeton Alumni Weekly, only partially scanned here, supposedly gave reasons for "Princeton's top-rated [Ph.D.] programs" "to celebrate". My alma mater has outstanding academic departments, but it would be difficult to know from this chart!


Due to the color scheme, the numbers that jump out at you are the ones in the bright orange background, which refers to how many other departments are ranked equal to Princeton's in those subjects. It takes some effort to realize that the more zeroes there are in the top buckets (fading orange), the better.

The editor started with a nice idea, which is to convert raw rankings into clusters of rankings. She recognized that in this type of rankings (see a related post on my book blog here), it is meaningless to gloat about #1 versus #2 because they are probably statistically the same. For instance, in the ranking of Architecture departments (ARC), 37 schools (including Princeton) all belonged to the same cluster as Princeton, judged to be a statistical tie.

One of the main reasons why this chart looks so confusing is its failing the self-sufficiency test. It really is a disguised data table, with some colorful background and shadows; the graphical elements add nothing to the data at all. If one covered up all the data, there is nothing left to see!

In the following rework, I emphasize the cluster structure. Each subject has three possible clusters, schools ranked above, equal to, and below Princeton. Instead of plotting raw numbers, the chart shows proportions of schools in each category. The order is roughly such that the departments with the relatively higher standing float to the top. Because a bar chart is used, the department names could be spelt out in their entirety and placed horizontally.


If one has access to the raw data, it would be even better to reveal the entire cluster structure. It is quite possible that the clusters above and below Princeton can be further subdivided into more clusters. This will allow readers to understand better what the cluster ranks mean.