Raw data and the incurious

The following chart caught my eye when it appeared in the Wall Street Journal this month:


This is a laborious design; much sweat has been poured into it. It's a chart that requires the reader to spend time learning to read.

A major difficulty for any visualization of this dataset is keeping track of the two time scales. One scale, depicted horizontally, traces the dates of Fed meetings. These meetings seem to occur four times a year except in 2012. The other time scale is encoded in the colors, explained above the chart. This is the outlook by each Fed committee member of when he/she expects a rate hike to occur.

I find it challenging to understand the time scale in discrete colors. Given that time has an order, my expectation is that the colors should be ordered. Adding to this mess is the correlation between the two time scales. As time treads on, certain predictions become infeasible.

Part of the problem is the unexplained vertical scale. Eventually, I realize each cell is a committee member, and there are 19 members, although two or three routinely fail to submit their outlook in any given meeting.

Contrary to expectation, I don't think one can read across a row to see how a particular member changes his/her view over time. This is because the patches of color would be less together otherwise.


After this struggle, all I wanted is some learning from this dataset. Here is what I came up with:


There is actually little of interest in the data. The most salient point is that a shift in view occurred back in September 2012 when enough members pushed back the year of rate hike that the median view moved from 2014 to 2015. Thereafter, there is a decidedly muted climb in support for the 2015 view.


This is an example in which plotting elemental data backfires. Raw data is the sanctuary of the incurious.



Circular but insufficient

One of my students analyzed the following Economist chart for her homework.


I was looking for it online, and found an interactive version that is a bit different (link). Here are three screen shots from the online version for years 2009, 2013 and 2018. The first and last snapshots correspond to the years depicted in the print version.


The online version is the self-sufficiency test for the print version. In testing self-sufficiency, we want to see if the visual elements (i.e. the circular sectors on the print version) pull their own weights. The quick answer is no. The reader can't tell how much sales are represented in each sector, nor can they reliably estimate the relative scales of print versus ebook (pink/red vs yellow/orange) or year-to-year growth rates.

As usual, when we see the entire data set printed on the chart itself, it is giveaway that the visual elements are mere ornaments.

The online version does not have labels unless you hover over the hemispheres. But again it is a challenge to learn anything from the picture.

In the Trifecta checkup, this is a Type V chart.


This particular dataset is made for the bumps-style chart:





The missing Brazil effect, and BYOC charts

Announcement: I'm giving a free public lecture on telling and finding stories via data visualization at NYU on 7/15/2014. More information and registration here.


The Economist states the obvious, that the current World Cup is atypically high-scoring (or poorly defended, for anyone who've never been bothered by the goal count). They dubiously dub it the Brazil effect (link).

Perhaps in a sly vote of dissent, the graphic designer came up with this effort:


(Thanks to Arati for the tip.)

The list of problems with this chart is long but let's start with the absence of the host country and the absence of the current tournament, both conspiring against our ability to find an answer to the posed question: did Brazil make them do it?


Turns out that without 2014 on the chart, the only other year in which Brazil hosted a tournament was 1950. But 1950 is not even comparable to the modern era. In 1950, there was no knock-out stage. They had four groups in the group stage but divided into two groups of four, one group of three and one group of two. Then, four teams were selected to play a round-robin final stage. This format is so different from today's format that I find it silly to try to place them on the same chart.

This data simply provide no clue as to whether there is a Brazil effect.


The chosen design is a homework assignment for the fastidious reader. The histogram plots the absolute number of drawn matches. The number of matches played has tripled from 16 to 48 over those years so the absolute counts are highly misleading. It's worse than nothing because the accompanying article wants to make the point that we are seeing fewer draws this World Cup compared to the past. The visual presents exactly the opposite message! (Hint: Trifecta Checkup)

Unless you realize this is a homework assignment. You can take the row of numbers listed below the Cup years and compute the proportion of draws yourself. BYOC (Bring Your Own Calculator). Now, pay attention because you want to use the numbers in parentheses (the number of matches), not the first number (that of teams).

Further, don't get too distracted by the typos: in both 1982 and 1994, there were 24 teams playing, not 16 or 32. The number of matches (52 in each case) is correctly stated.


Wait, the designer provides the proportions at the bottom of the chart, via this device:


As usual, the bubble chart does a poor job conveying the data. I deliberately cropped out the data labels to demonstrate that the bubble element cannot stand on its own. This element fails my self-sufficiency test.


I find the legend challenging as well. The presentation should be flipped: look at the proportion of ties within each round, instead of looking at the overall proprotion of ties and then breaking those ties by round.

The so-called "knockout round" has many formats over the years. In early years, there were often two round-robin stages, followed by a smaller knockout round. Presumably the second round-robin stage has been classified as "knockout stage".

Also notice the footnote, stating that third-place games are excluded from the histogram. This is exactly how I would do it too because the third-place match is a dead rubber, in which no rational team would want to play extra-time and penalty shootout.

The trouble is inconsistency. The number of matches shown underneath the chart includes that third-place match so the homework assignment above actually has a further wrinkle: subtract one from the numbers in parentheses. The designer gets caught in this booby trap. The computed proportion of draws displayed at the bottom of the chart includes the third-place match, at odds with the histogram.


Here is a revised version of the chart:



A few observations are in order:

  • The proportion of ties has been slowly declining over the last few Cups.
  • The drop in proportion of ties in 2014 is not drastic.
  • While the proportion of ties has dropped in the 2014 World Cup, the proportion of 0-0 ties has increased. (The gap between the two lines shows the ties with goals.)
  • In later rounds, since the 1980s, the proportion of ties has been fairly stable, between 20 and 35 percent.

Another reason for separate treatment is that the knockout stage has not started yet in 2014 when this chart was published. Instead of removing all of 2014, as the Economist did, I can include the group stage for 2014 but exclude 2014 from the knockout round analysis.

In the Trifecta Checkup, this is Type DV. The data do not address the question being posed, and the visual conveys the wrong impression.


Finally, there is one glaring gap in all of this. Some time ago (the football fans can fill in the exact timing), FIFA decided to award three points for a win instead of two. This was a deliberate effort to increase the point differential between winning and drawing, supposedly to reduce the chance of ties. Any time-series exploration of the frequency of ties would clearly have to look into this issue.


Update on Dataviz Workshop 1

Happy to report on the dataviz workshop, a first-time offering at NYU. I previously posted the syllabus here.

I made minor changes to the syllabus, adding Alberto Cairo's book, The Functional Art (link), as optional reading, some articles from the recent debate in the book review circle about the utility of "negative reviews" (start here), and some blog posts by Stephen Few.

The Cairo and Few readings, together with Tufte, are closest to what I want to accomplish in the first two classes, before we start discussing individual projects: encouraging students to adopt the mentality of the course, that is to say, to think of dataviz as an artform. An artform implies many things, one of which is a seriousness about the output, and another is the recognition that the work has an audience. 

The field of data visualization is sorely lacking high-level theory, immersed as so many of us are in tools, data, and rules of thumb. It is my hope that these workshop discussions will lead to a crytallization of the core principles of the field.

We went on a tour of many dataviz blogs, and documented various styles of criticism. In the next class, we will discuss what style we'd adopt in the course.


The composition of the class brings me great excitement. There are 12 enrolled students, which is probably the maximum for a class of this type.  One student subsequently dropped out, after learning that the workshop is really not for true beginners.

The workshop participants come from all three schools of dataviz: computer science, statistics, and design. Amongst us are an academic economist trained in statistical methods, several IT professionals, and an art director. This should make for rewarding conversation, as inevitably there will be differences in perspective.


REQUEST FOR HELP: A variety of projects have been proposed; several are using this opportunity to explore data sets from their work. That said, some participants are hoping to find certain datasets. If you know of good sources for the following, please write a comment below and link to them:

  • Opening-day ratings from sites like Rotten Tomatoes
  • New York City water quality measures by county (or other geographical unit), probably from an environmental agency
  • Data about donors/donations to public media companies


Since this is a dataviz blog, I want to include a chart with this post. I did a poll of the enrolled students, and one of the questions was about what dataviz tools they use to generate charts. I present here two views of the same data.

The first is a standard column chart, plotting the number of students who include a particular tool in his or her toolset (each student is allowed to name more than one tools). This presents a simple piece of information simply: Excel is the most popular although the long tail indicates the variety of tools people use in practice.


What the first option doesn't bring out is the correlation between tools, indicated by several tools used by the same participant. The second option makes this clear, with each column representing a student. This chart is richer as it also provides information on how many tools the average student uses, and the relationship between different tools.


The tradeoff is that the reader has to work a little more to understand the relative importance of the different tools, a message that is very clear in the first option. 

This second option is also not scalable. If there are thousands of students, the chart will lose its punch (although it will undoubtedly be called beautiful).

Which version do you like? Are there even better ways to present this information?


Hate the defaults

One piece of  advice I give for those wanting to get into data visualization is to trash the defaults (see the last part of this interview with me). Jon Schwabish, an economist with the government, gives a detailed example of how this is done in a guest blog on the Why Axis.

Here are the highlights of his piece.


He starts with a basic chart, published by the Bureau of Labor Statistics. You can see the hallmarks of the Excel chart using the Excel defaults. The blue, red, green color scheme is most telling.



Just by making small changes, like using tints as opposed to different colors, using columns instead of bars, reordering the industry categories, and placing the legend text next to the columns, Schwabish made the chart more visually appealing and more effective.


 The final version uses lines instead of columns, which will outrage some readers. It is usually true that a grouped bar chart should be replaced by overlaid line charts, and this should not be limited to so-called discrete data.


Schwabish included several bells and whistles. The three data points are not evenly spaced in time. The year-on-year difference is separately plotted as a bar chart on the same canvass. I'd consider using a line chart here as well... and lose the vertical axis since all the data are printed on the chart (or else, lose the data labels). 

This version is considerably cleaner than the original.


I noticed that the first person to comment on the Why Axis post said that internal BLS readers resist more innovative charts, claiming "they don't understand it". This is always a consideration when departing from standard chart types.

Another reader likes the "alphabetical order" (so to speak) of the industries. He raises another key consideration: who is your audience? If the chart is only intended for specialist readers who expect to find certain things in certain places, then the designer's freedom is curtailed. If the chart is used as a data store, then the designer might as well recuse him/herself.


Experiments with multiple dimensions

Reader (and author) Bernard L. sends us to the Economist (link), where they walked through a few charts they sketched to show data relating to the types of projects that get funded on Kickstarter. The three metrics collected were total dollars raised, average dollars per project, and the success rate of different categories of projects.

Here's the published version, which is a set of bar charts, ranked by individual metrics, and linked by colors.


This bar chart does the job. The only challenge is the large number of colors. But otherwise, it's not hard to see that fashion projects have the worst success rate and raised relatively little money overall although the average pledge amount tended to be higher than average.

The following chart used more of a Bumps chart aesthetic. It dropped the average pledge per project metric, which I think is a reasonable design choice. The variance in pledge amount is probably pretty high and thus the average may not be a good metric anyway. The Bumps format though suffers because there are too many categories and the two metrics are rather uncorrelated, resulting in a spider web. Instead of using colors as a link, this format uses explicit lines as links between the metrics.


The following version combines features from both. It requires no colors. It drops the third metric, while adopting the bar chart format. The two charts retain the same order of categories so that one can read across to learn about both metrics.



PS. Readers want to see a scatter plot:


The overall pattern is clearer on a scatter plot. When there are so many categories, it's a pain to put the data labels on the chart. It's odd that the amount pledged for games is the highest of the categories and yet it has among the lowest rate of being fully funded. Is this a sign of inefficiency?

Gelman joins in the fun

The great Andrew Gelman did a Junk Charts style post today, and very well indeed.

The offending Economist plot is the donut chart, which is a favorite of that magazine.  I commented on this type of chart before.


Andrew created two alternatives, one is a line chart (profile chart) which is often a better option (despite the data being categorical), the other is more creative, and the better of the two.




Some of Gelman's readers complained that he arbitrarily "standardized" the data by indexing against the average of the countries depicted; one can further grumble that a 50% "excess" may sound impressive but it would be equivalent to less than an hour, perhaps not as startling. These types of complaints are fair but do realize that blog posts like these are primarily concerned with how data is best visualized. If one prefers a different indexing method, or a different set of countries, or a different color for the lines, etc., one can easily revise the chart to reflect those preferences.

The easiest way to see why the third chart is better than the first is that the strongest message coming off the first chart is that there are no material differences between these six countries in terms of time usage but in the third chart, the designer (here, it's Gelman) is asserting that there are interesting differences.

Another view of the Indian states

The previous post has elicited protests of "it's not that bad" from some corners. Well, it's bad. Let's look at it from another angle.


We start with the Economist chart, and ask what is the message.

Economist_indianratio The chart is saying that in 2006-8, there are 10 Indian states that have female-to-male-babies ratios below the world average (the so-called "natural ratio"). For those who know their Indian geography, the chart gives the names of these anomalous states. The chart also tells us among these 10 states, some have been gaining and others have been losing ground when compared to the 2001-3 period. There is no obvious pattern as to which states are gaining, and which losing.

That's pretty much everything that one can discern from this chart.

The problem is, the average Economist reader already knows that in India, as in China and many Asian countries, there are more male babies than female ones than in other parts of the world. If he or she doesn't know this fundamental statistic, the chart does not help because it says nothing about the other 24 states that make up the Indian average.

Worse, the chart raises the suspicion of voodoo statistics. It suggests that the other 24 states have a gender ratio that is at least equal to, if not above, the natural ratio. One would then have to believe that either the overall Indian average is higher than the natural ratio, or the negative deviations from the world average (as shown on the chart) are quite a bit larger than the positive deviations (not shown), or that the states with positive deviations (not shown) are generally less populous than the ones shown.

Either of the last two conclusions, if true, would be interesting because it implies that the cultural norms, typically claimed to explain this anomaly, are entrenched only within certain geographies. Then, it is inappropriate to speak of India's sex ratio, given this variability between states.


As I pointed out in the prior post, with two data series (two observation dates of the same statistic) at their disposal, the Economist chart focuses on the more recent data. This self-imposed restriction obscures meaningful differences between states over time.

Redo_indianratio The junkcharts version shows that the current 11 "worst" states could be clustered into two groups: the first group (black lines) has gained ground over the last decade, while the second group (gray) has stagnated, and in some cases, lost ground.

What's more, we learn that every one of the states in the gray group is ranked higher than those in the black group at the start of the decade.

Further, while the distance between the black and gray groups have narrowed over the decade, the gray group, despite the slight decline, is still ranked above the black grup except for Kerala, which has seen dramatic improvement.

Those who know their Indian geography might have further insights as to why the states cluster in this way.

In my view, these findings are much more interesting than the things one can learn from the original chart.




A skewed view of ten Indian states

Economist_indiasexratio The Economist published this chart to illustrate the problem of the "missing girls" in Indian society.

The girls-to-boys ratio (ages 0-6) should be about 952 but in India, it is 914. That's an average number for 35 territories, and the most skewed ratio was 830 in Punjab.

Curiously, the Economist chose to focus on only 11 states instead of showing all 35. The first 10 of these had sex ratio below the natural number of 952 while the last one was over the average. Nowhere on the chart or in the article is it explained whether the unmentioned 24 states all had above-average sex ratios: unlikely, unless certain states have much higher youth population than others.

In fact, the reference line of 952 is misplaced. Readers will find that there are two metrics depending on which survey one is looking at, either sex ratio at birth or sex ratio for children aged 0-6. The natural ratio of 952 is for the 0-6 measure but the data by territory are all for the at-birth measure. Instead, the dotted red line needs to be at 904, which is the national average sex ratio at birth for India for the 2006-8 period.


The lethal error in this chart is not starting the horizontal axis at zero. 
Redo_indiangirls1 By cutting off the same amount from each bar, this chart messes up the ratio of lengths, and presents a misleading picture of the relative sex ratio between territories.  We may think Punjab's sex ratio is half that of Gujarat (in the original chart) but as the chart on the right shows, that is far from the truth!


The other unfortunate practice, typical of the Economist, is to stick a second set of data on the right of the chart as an afterthought. In fact, that data representing the change in the sex ratio over time is more interesting than what the exact sex ratio was in each territory in 2006-8.

A much better way to present the data, without favoring one series or another, is the Bumps chart, as shown below. We can clearly see that the improvement in sex ratio is concentrated on those states that started out the decade in a worse shape.