Update on Dataviz Workshop 1

Happy to report on the dataviz workshop, a first-time offering at NYU. I previously posted the syllabus here.

I made minor changes to the syllabus, adding Alberto Cairo's book, The Functional Art (link), as optional reading, some articles from the recent debate in the book review circle about the utility of "negative reviews" (start here), and some blog posts by Stephen Few.

The Cairo and Few readings, together with Tufte, are closest to what I want to accomplish in the first two classes, before we start discussing individual projects: encouraging students to adopt the mentality of the course, that is to say, to think of dataviz as an artform. An artform implies many things, one of which is a seriousness about the output, and another is the recognition that the work has an audience. 

The field of data visualization is sorely lacking high-level theory, immersed as so many of us are in tools, data, and rules of thumb. It is my hope that these workshop discussions will lead to a crytallization of the core principles of the field.

We went on a tour of many dataviz blogs, and documented various styles of criticism. In the next class, we will discuss what style we'd adopt in the course.


The composition of the class brings me great excitement. There are 12 enrolled students, which is probably the maximum for a class of this type.  One student subsequently dropped out, after learning that the workshop is really not for true beginners.

The workshop participants come from all three schools of dataviz: computer science, statistics, and design. Amongst us are an academic economist trained in statistical methods, several IT professionals, and an art director. This should make for rewarding conversation, as inevitably there will be differences in perspective.


REQUEST FOR HELP: A variety of projects have been proposed; several are using this opportunity to explore data sets from their work. That said, some participants are hoping to find certain datasets. If you know of good sources for the following, please write a comment below and link to them:

  • Opening-day ratings from sites like Rotten Tomatoes
  • New York City water quality measures by county (or other geographical unit), probably from an environmental agency
  • Data about donors/donations to public media companies


Since this is a dataviz blog, I want to include a chart with this post. I did a poll of the enrolled students, and one of the questions was about what dataviz tools they use to generate charts. I present here two views of the same data.

The first is a standard column chart, plotting the number of students who include a particular tool in his or her toolset (each student is allowed to name more than one tools). This presents a simple piece of information simply: Excel is the most popular although the long tail indicates the variety of tools people use in practice.


What the first option doesn't bring out is the correlation between tools, indicated by several tools used by the same participant. The second option makes this clear, with each column representing a student. This chart is richer as it also provides information on how many tools the average student uses, and the relationship between different tools.


The tradeoff is that the reader has to work a little more to understand the relative importance of the different tools, a message that is very clear in the first option. 

This second option is also not scalable. If there are thousands of students, the chart will lose its punch (although it will undoubtedly be called beautiful).

Which version do you like? Are there even better ways to present this information?


Hate the defaults

One piece of  advice I give for those wanting to get into data visualization is to trash the defaults (see the last part of this interview with me). Jon Schwabish, an economist with the government, gives a detailed example of how this is done in a guest blog on the Why Axis.

Here are the highlights of his piece.


He starts with a basic chart, published by the Bureau of Labor Statistics. You can see the hallmarks of the Excel chart using the Excel defaults. The blue, red, green color scheme is most telling.



Just by making small changes, like using tints as opposed to different colors, using columns instead of bars, reordering the industry categories, and placing the legend text next to the columns, Schwabish made the chart more visually appealing and more effective.


 The final version uses lines instead of columns, which will outrage some readers. It is usually true that a grouped bar chart should be replaced by overlaid line charts, and this should not be limited to so-called discrete data.


Schwabish included several bells and whistles. The three data points are not evenly spaced in time. The year-on-year difference is separately plotted as a bar chart on the same canvass. I'd consider using a line chart here as well... and lose the vertical axis since all the data are printed on the chart (or else, lose the data labels). 

This version is considerably cleaner than the original.


I noticed that the first person to comment on the Why Axis post said that internal BLS readers resist more innovative charts, claiming "they don't understand it". This is always a consideration when departing from standard chart types.

Another reader likes the "alphabetical order" (so to speak) of the industries. He raises another key consideration: who is your audience? If the chart is only intended for specialist readers who expect to find certain things in certain places, then the designer's freedom is curtailed. If the chart is used as a data store, then the designer might as well recuse him/herself.


Experiments with multiple dimensions

Reader (and author) Bernard L. sends us to the Economist (link), where they walked through a few charts they sketched to show data relating to the types of projects that get funded on Kickstarter. The three metrics collected were total dollars raised, average dollars per project, and the success rate of different categories of projects.

Here's the published version, which is a set of bar charts, ranked by individual metrics, and linked by colors.


This bar chart does the job. The only challenge is the large number of colors. But otherwise, it's not hard to see that fashion projects have the worst success rate and raised relatively little money overall although the average pledge amount tended to be higher than average.

The following chart used more of a Bumps chart aesthetic. It dropped the average pledge per project metric, which I think is a reasonable design choice. The variance in pledge amount is probably pretty high and thus the average may not be a good metric anyway. The Bumps format though suffers because there are too many categories and the two metrics are rather uncorrelated, resulting in a spider web. Instead of using colors as a link, this format uses explicit lines as links between the metrics.


The following version combines features from both. It requires no colors. It drops the third metric, while adopting the bar chart format. The two charts retain the same order of categories so that one can read across to learn about both metrics.



PS. Readers want to see a scatter plot:


The overall pattern is clearer on a scatter plot. When there are so many categories, it's a pain to put the data labels on the chart. It's odd that the amount pledged for games is the highest of the categories and yet it has among the lowest rate of being fully funded. Is this a sign of inefficiency?

Gelman joins in the fun

The great Andrew Gelman did a Junk Charts style post today, and very well indeed.

The offending Economist plot is the donut chart, which is a favorite of that magazine.  I commented on this type of chart before.


Andrew created two alternatives, one is a line chart (profile chart) which is often a better option (despite the data being categorical), the other is more creative, and the better of the two.




Some of Gelman's readers complained that he arbitrarily "standardized" the data by indexing against the average of the countries depicted; one can further grumble that a 50% "excess" may sound impressive but it would be equivalent to less than an hour, perhaps not as startling. These types of complaints are fair but do realize that blog posts like these are primarily concerned with how data is best visualized. If one prefers a different indexing method, or a different set of countries, or a different color for the lines, etc., one can easily revise the chart to reflect those preferences.

The easiest way to see why the third chart is better than the first is that the strongest message coming off the first chart is that there are no material differences between these six countries in terms of time usage but in the third chart, the designer (here, it's Gelman) is asserting that there are interesting differences.

Another view of the Indian states

The previous post has elicited protests of "it's not that bad" from some corners. Well, it's bad. Let's look at it from another angle.


We start with the Economist chart, and ask what is the message.

Economist_indianratio The chart is saying that in 2006-8, there are 10 Indian states that have female-to-male-babies ratios below the world average (the so-called "natural ratio"). For those who know their Indian geography, the chart gives the names of these anomalous states. The chart also tells us among these 10 states, some have been gaining and others have been losing ground when compared to the 2001-3 period. There is no obvious pattern as to which states are gaining, and which losing.

That's pretty much everything that one can discern from this chart.

The problem is, the average Economist reader already knows that in India, as in China and many Asian countries, there are more male babies than female ones than in other parts of the world. If he or she doesn't know this fundamental statistic, the chart does not help because it says nothing about the other 24 states that make up the Indian average.

Worse, the chart raises the suspicion of voodoo statistics. It suggests that the other 24 states have a gender ratio that is at least equal to, if not above, the natural ratio. One would then have to believe that either the overall Indian average is higher than the natural ratio, or the negative deviations from the world average (as shown on the chart) are quite a bit larger than the positive deviations (not shown), or that the states with positive deviations (not shown) are generally less populous than the ones shown.

Either of the last two conclusions, if true, would be interesting because it implies that the cultural norms, typically claimed to explain this anomaly, are entrenched only within certain geographies. Then, it is inappropriate to speak of India's sex ratio, given this variability between states.


As I pointed out in the prior post, with two data series (two observation dates of the same statistic) at their disposal, the Economist chart focuses on the more recent data. This self-imposed restriction obscures meaningful differences between states over time.

Redo_indianratio The junkcharts version shows that the current 11 "worst" states could be clustered into two groups: the first group (black lines) has gained ground over the last decade, while the second group (gray) has stagnated, and in some cases, lost ground.

What's more, we learn that every one of the states in the gray group is ranked higher than those in the black group at the start of the decade.

Further, while the distance between the black and gray groups have narrowed over the decade, the gray group, despite the slight decline, is still ranked above the black grup except for Kerala, which has seen dramatic improvement.

Those who know their Indian geography might have further insights as to why the states cluster in this way.

In my view, these findings are much more interesting than the things one can learn from the original chart.




A skewed view of ten Indian states

Economist_indiasexratio The Economist published this chart to illustrate the problem of the "missing girls" in Indian society.

The girls-to-boys ratio (ages 0-6) should be about 952 but in India, it is 914. That's an average number for 35 territories, and the most skewed ratio was 830 in Punjab.

Curiously, the Economist chose to focus on only 11 states instead of showing all 35. The first 10 of these had sex ratio below the natural number of 952 while the last one was over the average. Nowhere on the chart or in the article is it explained whether the unmentioned 24 states all had above-average sex ratios: unlikely, unless certain states have much higher youth population than others.

In fact, the reference line of 952 is misplaced. Readers will find that there are two metrics depending on which survey one is looking at, either sex ratio at birth or sex ratio for children aged 0-6. The natural ratio of 952 is for the 0-6 measure but the data by territory are all for the at-birth measure. Instead, the dotted red line needs to be at 904, which is the national average sex ratio at birth for India for the 2006-8 period.


The lethal error in this chart is not starting the horizontal axis at zero. 
Redo_indiangirls1 By cutting off the same amount from each bar, this chart messes up the ratio of lengths, and presents a misleading picture of the relative sex ratio between territories.  We may think Punjab's sex ratio is half that of Gujarat (in the original chart) but as the chart on the right shows, that is far from the truth!


The other unfortunate practice, typical of the Economist, is to stick a second set of data on the right of the chart as an afterthought. In fact, that data representing the change in the sex ratio over time is more interesting than what the exact sex ratio was in each territory in 2006-8.

A much better way to present the data, without favoring one series or another, is the Bumps chart, as shown below. We can clearly see that the improvement in sex ratio is concentrated on those states that started out the decade in a worse shape.



Perhaps the Economist doesn't take its own advice

Given the recent post questioning the value of the MBA degree, one would think the Economist powers-that-be would not be staffing up MBAs. But then, if not useless MBAs, how would the Economist explain this chart they printed next to the said article?

This chart appears to tell us that all the top MBA programs succeed in reducing their students' earning potential. In each case, the "pre-MBA salary" exceeds the "salary on graduation".

More likely, the red part is the incremental salary, possibly explained by the value of the degree while the gray part is the pre-MBA salary.

However, since the author has few nice words to say about business schools, one can never be 100% sure if he is presenting some counter-intuitive data.


6a00d8341e992c53ef0133f5f3e09a970b In the Trifecta checkup, one would find nothing wrong with the chart type, nor is there anything wrong with asking the return on investment of an MBA degree. 

The third component -- having the right data -- is what renders this effort a failure. It is too simplistic to measure return on investment on the salary upon graduation. Surely, one must also include future career paths, intangible benefits from network relationships, personal development, etc.


Loss aversion and faux accuracy

Econ_geoengReader Bernie M. is not a fan of this Economist chart.

The chart was prepared by Aurora Flight Sciences, an aircraft manufacturer, commissioned by a professor who supports the concept of maintaining a fleet to pump sulphuric acid into the stratosphere as a way to induce artificial cooling to counteract human-induced global warming.

The chart appears to compare many different ways of shooting the acid into the skies along two dimensions: cost and altitude.

Bernie wrote:

I find the choice of axes extremely counterintuitive. Altitude one would expect on the y-axis. And mixing up the scatter chart elements with the connected line chart doesn' really help either.

The convention regarding axes is to put the outcome variable on the vertical, and the explanatory variable on the horizontal. Thus, in this case, if the cost of a particular solution is primarily determined by the "altitude" (presumably of where the acid would be released), then the designer has followed convention. It is unfortunate that "altitude" is more intuitively put on the vertical axis, but I suspect that defying convention might cause more confusion.

On the other hand, if altitude and cost are not related to each other but two different metrics to evaluate geoengineering concepts, then Bernie's point is right on - swap the axes!


The use of connected lines for two of the solutions but not the rest is a symptom of what I have called "loss aversion". The horror of leaving some of the data on the cutting floor.

The only mention of altitude in the article refers to Aurora's assertion that it is sufficient to use newly designed aircraft flying at 20-25 kilometers. If that is Aurora's preferred solution, there is little reason to show all the other altitude configurations that are suboptimal.

Perhaps the designer wants to make the point that the Boeing 747 solution is inferior to the Aurora solution because Aurora could design aircraft to fly at 10-15 km at a lower cost?  If so, then the chart is very misleading in not providing a comparable cost for Boeing's solution if required to fly at 20-25 km.

When comparing different entities, it is always a bad idea to treat the entities differently. Comparison is only possible on equal footing.

In fact, I think the chart would be a lot clearer if they dropped the altitude dimension on the floor. For each solution, plot the yearly cost at the optimal altitude selected by the respective engineers. Use a bar chart. With a single dimension, it is much easier to accommodate the very long data labels.

(Now, I'd defer to the geoengineers as to whether the altitude dimension is dispensable. I don't have any expertise in this science. Judging from the Aurora red line, I'm assuming that there can be feasible solutions at all altitudes, which leads me to conclude that altitude isn't all that.)


So what is the biggest problem with this chart? It is the faux accuracy.

Given the tremendous amount of uncertainty surrounding these projected costs, one would expect very big error bars around the cost estimates. Using single dots with no error bars is hard to stomach.



Stone-age graphic

This Economist chart on the history of world GDP throws the art of graphics back several hundred years. (Thanks Tyler A. for the link.)


And I can't really re-make it since I can't make heads or tails of it.

  • How are the columns sorted? (on second thought, maybe the 70 should read 1870, 13 is 1913, and so on?)
  • Why are there differing gaps between columns?
  • Italy was not a country, and the US was definitely not in existence in AD1 so what does it mean to have values for those on the chart? If this is created by taking current-day boundaries and projecting back in time, why are today's boundaries treated as sacrosanct?
  • If the columns are sorted chronologically, a line chart would be much more readable. At the minimum, it will reduce the number of colors to 1. Note that multiple colors are necessitated by the choice of a stacked column chart.
  • A stacked column chart with percentages should always extend to 100%. The current chart is very misleading if we want to know the percentage of world GDP produced by "other countries".
  • How are the countries ordered within a column? It's neither alphabetical, nor by the starting or ending distributions.
  • Don't challenge readers by having vertically stacked categories and a horizontal legend.
  • It would also be much better if there are annotations to help the reader understand the chart, e.g. collapse of the Roman Empire, Renaissance, Great Depression, Big Fire, etc.

PS. [8/18/10] Dustin linked to a line-chart version of this chart, from the World Bank site, via Chartporn.


I think the evidence is right here as to why the Economist execution leaves a ton to be desired. The use of lines allows the reader to easily trace the rise and fall of different economies, which is the point of the data set. The stacked-column chart draws attention to a point-in-time distribution of GDP among different countries, which is of secondary importance.

There are other differences: this plots the share of "growth" as opposed to the share of total GDP. It also plots regions rather than countries (well, except for China and Japan). It does not presuppose that the US was in existence before its founding. It could have (should have) included an "rest of the world" line.

The spacing of the years is still problematic but it's an Excel inconvenience, really. But it's ok to stretch the axis on a line chart, it's a problem to do it with a column chart, as demonstrated above. The gaps between columns should be proportional to the years between the data but this is impossible to do in a column chart.