Selecting the right analysis plan is the first step to good dataviz

It's a new term, and my friend Ray Vella shared some student projects from his NYU class on infographics. There's always something to learn from these projects.

The starting point is a chart published in the Economist a few years ago.


This is a challenging chart to read. To save you the time, the following key points are pertinent:

a) income inequality is measured by the disparity between regional averages

b) the incomes are given in a double index, a relative measure. For each country and year combination, the average national GDP is set to 100. A value of 150 means the richest region of Spain has an average income that is 50% higher than Spain's national average in the year 2015.

The original chart - as well as most of the student work - is based on a specific analysis plan. The difference in the index values between the richest and poorest regions is used as a measure of the degree of income inequality, and the change in the difference in the index values over time, as a measure of change in the degree of income inequality over time. That's as big a mouthful as the bag of words sounds.

This analysis plan can be summarized as:

1) all incomes -> relative indices, at each region-year combination
2) inequality = rich - poor region gap, at each region-year combination
3) inequality over time = inequality in 2015 - inequality in 2000, for each country
4) country difference = inequality in country A - inequality in country B, for each year


One student, J. Harrington, looks at the data through an alternative lens that brings clarity to the underlying data. Harrington starts with change in income within the richest regions (then the poorest regions), so that a worsening income inequality should imply that the richest region is growing incomes at a faster clip than the poorest region.

This alternative analysis plan can be summarized as:
1) change in income over time for richest regions for each country
2) change in income over time for poorest regions for each country
3) inequality = change in income over time: rich - poor, for each country

The restructuring of the analysis plan makes a big difference!

Here is one way to show this alternative analysis:


The underlying data have not changed but the reader's experience is transformed.

To explain or to eliminate, that is the question

Today, I take a look at another project from Ray Vella's class at NYU.

Rich Get Richer Assigment 2 top

(The above image is a honeypot for "smart" algorithms that don't know how to handle image dimensions which don't fit their shadow "requirement". Human beings should proceed to the full image below.)

As explained in this post, the students visualized data about regional average incomes in a selection of countries. It turns out that remarkable differences persist in regional income disparity between countries, almost all of which are more advanced economies.

Rich Get Richer Assigment 2 Danielle Curran_1

The graphic is by Danielle Curran.

I noticed two smart decisions.

First, she came up with a different main metric for gauging regional disparity, landing on a metric that is simple to grasp.

Based on hints given on the chart, I surmised that Danielle computed the change in per-capita income in the richest and poorest regions separately for each country between 2000 and 2015. These regional income growth values are expressed in currency, not indiced. Then, she computed the ratio of these growth rates, for each country. The end result is a simple metric for each country that describes how fast income has been growing in the richest region relative to the poorest region.

One of the challenges of this dataset is the complex indexing scheme (discussed here). Carlos' solution keeps the indices but uses design to facilitate comparisons. Danielle avoids the indices altogether.

The reader is relieved of the need to make comparisons, and so can focus on differences in magnitude. We see clearly that regional disparity is by far the highest in the U.K.


The second smart decision Danielle made is organizing the countries into clusters. She took advantage of the horizontal axis which does not encode any data. The branching structure places different clusters of countries along the axis, making it simple to navigate. The locations of these clusters are cleverly aligned to the map below.


Danielle's effort is stronger on communications while Carlos' effort provides more information. The key is to understand who your readers are. What proportion of your readers would want to know the values for each country, each region and each year?


A couple of suggestions

a) The reference line should be set at 1, not 0, for a ratio scale. The value of 1 happens when the richest region and the poorest region have identical per-capita incomes.

b) The vertical scale should be fixed.

Gazing at petals

Reader Murphy pointed me to the following infographic developed by Altmetric to explain their analytics of citations of journal papers. These metrics are alternative in that they arise from non-academic media sources, such as news outlets, blogs, twitter, and reddit.

The key graphic is the petal diagram with a number in the middle.


I have a hard time thinking of this object as “data visualization”. Data visualization should visualize the data. Here, the connection between the data and the visual design is tenuous.

There are eight petals arranged around the circle. The legend below the diagram maps the color of each petal to a source of data. Red, for example, represents mentions in news outlets, and green represents mentions in videos.

Each petal is the same size, even though the counts given below differ. So, the petals are like a duplicative legend.

The order of the colors around the circle does not align with its order in the table below, for a mysterious reason.

Then comes another puzzle. The bluish-gray petal appears three times in the diagram. This color is mapped to tweets. Does the number of petals represent the much higher counts of tweets compared to other mentions?

To confirm, I pulled up the graphic for a different paper.


Here, each petal has a different color. Eight petals, eight colors. The count of tweets is still much larger than the frequencies of the other sources. So, the rule of construction appears to be one petal for each relevant data source, and if the total number of data sources fall below eight, then let Twitter claim all the unclaimed petals.

A third sample paper confirms this rule:


None of the places we were hoping to find data – size of petals, color of petals, number of petals – actually contain any data. Anything the reader wants to learn can be directly read. The “score” that reflects the aggregate “importance” of the corresponding paper is found at the center of the circle. The legend provides the raw data.


Some years ago, one of my NYU students worked on a project relating to paper citations. He eventually presented the work at a conference. I featured it previously.


Notice how the visual design provides context for interpretation – by placing each paper/researcher among its peers, and by using a relative scale (percentiles).


I’m ignoring the D corner of the Trifecta Checkup in this post. For any visualization to be meaningful, the data must be meaningful. The type of counting used by Altmetric treats every tweet, every mention, etc. as a tally, making everything worth the same. A mention on CNN counts as much as a mention by a pseudonymous redditor. A pan is the same as a rave. Let’s not forget the fake data menace (link), which  affects all performance metrics.

Summer dataviz workshop to start July 1

Registration is open for my dataviz workshop at NYU. (link)

This is a workshop in the sense of a creative writing workshop. Your "writing" are sketches of data visualization based on your selected datasets. In class, we critique all of the work and produce revisions. You will learn to appreciate good dataviz, to offer constructive and insightful commentary on visualization, and be discriminating in receiving feedback.

Last term, half the class worked on datasets that are related to their jobs. The data sources were diverse, ranging from scholarly citation data, World Bank data, commercial sales and market share data, mountaineering accidents data, standardized testing item data, speeches by death row inmates, juvenile convicts, etc.

Students pick their own tools. They used Excel, Powerpoint, Tableau, d3, etc.

Here is a past syllabus.

The course runs from July 1 to Aug 5. Register here.

Course Announcement: Data Visualization Workshop

The next installment of my data visualization workshop runs from April 7 to May 12 in New York City.

My workshop is modeled after a creative writing workshop. The focus of the six weeks is on giving and receiving feedback on a datavis project of the student's choosing. There are selected readings and industry speakers who provide some perspective on this fast-changing field.

You can find an outline of the course here.

Here are some comments about the course from past students:

"A terrific class. Excellent readings and a workshop structure that allowed for a high level of creativity and honed our skills in constructing and de-constructing effective visualizations."

"This was a great course that opened my eyes!!!"

"Fung brought in some fantastic speakers that are well respected in the industry. The workshop nature of the class helped hone our eye not just for our own project, but to observe and comment on those around us."

Registration information here. Please send along to your friends and/or colleagues.

PS. The course is part of the Certificate in Data Visualization although you can register for my course without doing the Certificate. The full set of courses is found here.


In addition, I'm announcing a new course called "Careers in Data Science and Business Analytics". Please see the announcement on my sister blog here.

The class pondering Big Data

Note: I'm traveling a lot lately and it is affecting my ability to post on a regular basis.

It's three weeks into my chart-building workshop (link) at NYU and we are starting to discuss individual projects. One of the major discussion points this week is the quality of the underlying data being visualized.

One student is visualizing movie data from IMDB. He showed a chart comparing the year of a movie's release and the number of votes it has received. Do people talk more about new or old movies? Not surprisingly, the distribution is highly skewed with recent movies getting a lot more votes. The consensus in the room is that you never just want to see the pattern; the natural question to ask is why are we seeing such a pattern.

The easiest response  is people tend to vote on recent movies. This is the availability heuristic. You tend to talk about things that are top of mind. But there is a lot more to that. Perhaps movies of specific genres get discussed more often. Perhaps movies with larger marketing budgets get more buzz. etc. etc. If any of these factors are important, a good data visualization should bring them out.

Another factor that isn't obvious is that IMDB only started recently relative to the history of movies. The start date of data collection is highly informative here. Imagine a database that gets created five years ago versus one that was created five decades ago. The former dataset is not a random sample of the latter, far from it. The availability heuristic matters here. Also, the movie industry is growing in the meantime so the number of movies is changing. Internet access is also growing so the number of votes is changing. Finally, all students agree that anyone caring to comment on older movies probably is someone who likes those movies, and thus expect that the average rating on older movies to be higher than more recent ones... we'd have to verify this hypothesis using the data.

A lot of Big Data have these characteristics. The starting date of data collection matters a lot. Averaging data without accounting for these timing issues leads to wrong conclusions.


The dynamics of people rating/commenting on movies is a topic I'm interestsed in. If you go to Amazon and pull up Freakonomics, published 6 years ago, it has over 1800 reviews, of which over 800 are five stars, and 1300 are four or five stars, and yet the most recent reviews submitted are dated 3, 5, 6, etc. days ago. Why do people keep writing reviews?  For example, two of the reviews written this week just said "great!" and "great book". Another said "Outstanding take on the odd correlations between things in our culture. Definitley makes you think outside the lines." That comment has probably been repeated hundreds of times already by the preceding reviewers. Have anyone studied this yet?

Try a new way of learning dataviz; course announcement




Fall 2014 (Oct 6 - Nov 24, Mondays 6:30-9:30)

New York University

Instructor: Kaiser Fung

Location: New York City


Learn how to make knock-out data visualization in an innovative, immersive and fun setting, with classmates who are similarly passionate about making the numbers speak visually.


The class is conducted in the style of creative-writing workshops. Each student will focus on one data visualization project during the term, and gain knowledge through drafting and revisions, offering and receiving critique, and above all, learning from others.


You will develop a discriminating eye for good visualizations. For students enrolled in the Certificate in Data Visualization, the course offers an ideal setting to demonstrate mastery of the integrated approach combining the perspectives of statistical graphics, graphical design, and information visualization.


Prerequisite: We welcome students from all backgrounds. A more diverse class makes a better experience for everyone. In order to be a full participant in the course, you should have prior experience making data graphics for an audience (broadly defined), and feel comfortable offering critique of others’ work.


Because of the workshop structure, enrollment is limited to 12 students. Enroll now to reserve your spot.



Second Dataviz Workshop Soon to Start, and Feedback from First Workshop

I'm excited to announce that there will be a summer session for my Dataviz Workshop at NYU (starting June 21). This is a chart-building workshop run like a creative writing workshop. You will work on a personal project throughout the term, receive feedback from classmates, and continually improve the product. I have previously written about the First Workshop here (with syllabus), here, here and here.

Here is the link to register for the course. (Note: the correct class time is 10a - 1p.)


The participants in the First Workshop were very happy with their experience. I can now report on the end-of-course survey. Ten people took the class, and seven responded to the survey. The satisfaction scores are as follows:


It's very gratifying to see that almost everyone thought the class time was well spent. During class, students gave each other feedback on projects. A key to making these sessions work is that students should be both givers and takers. It is really important that they become as comfortable giving critique as taking feedback. I asked the students to self-assess and this is what they said:


I'd also add that the few students who enrolled in the course with less background than the average ended up participating fully and actively in the discussion. As an instructor, I want to get out of the way while keeping the conversation on track. Based on the following rating, I think I did fine:


One of the feedback I received during class--not reflected here--is that some students want to spend more time discussing the reading. I assign three books, which everyone loved but I believe that it is hard for them to finish reading all three books in time for the second class. They would like to spread the discussion of the books over the course of the term. This arrangement would present a challenge. Due to the nature of a workshop, the first two sessions cannot involve project discussion, which is one of the reasons why I give introductory lectures and assign the books. In addition, students spend a lot of time during the term both working on their own projects and reviewing their classmates' projects; and I worry that assigning more reading distracts from the other activities.

Indeed, the course is not a gut course. Several students were surprised by how much work they put in. One or two learned that preparing the data took ten times as much time as they expected. (They selected particularly difficult datasets to work with.)


A specific feedback is to add a session in the computer lab. This creates an opportunity for students to share their knowledge. Those who are good coders can help others who are not with pre-processing tasks. Those who are good with Illustrator can show others how to make the charts pretty. I am not ready for this change in the summer session but in the fall, I'll likely experiment with this.

Finally, the tools used by students are diverse: Excel (5), Illustrator (3), R (2), followed by Powerpoint, Pixelmator (draft stage), Tableau, Stata, Paint and SQL Server (1 each). Three of the students put their work on a Web page, which was the most popular format.


If you are serious about dataviz, please join me this summer for the Second Art of Data Visualization Workshop.

Click on this link to register for the course.



Update on Dataviz Workshop 2

The class practised doing critiques on the famous Wind Map by Fernanda Viegas and Martin Wattenberg.


Click here for a real-time version of the map.

I selected this particular project because it is a heartless person indeed who does not see the "beauty" in this thing.

Beauty is a word that is thrown around a lot in data visualization circles. What do we mean by beauty?


The discussion was very successful and the most interesting points of discussion were these:

  • Something that is beautiful should take us to some truth.
  • If we take this same map but corrupt all the data (e.g. reverse all wind directions), is the map still beautiful?
  • What is the "truth" in this map? What is its utility?
  • The emotional side of beauty is separate from the information side.
  • "Truth" comes before the emotional side of beauty.

Readers: would love to hear what you think.


PS. Click here for class syllabus. Click here for first update.

Update on Dataviz Workshop 1

Happy to report on the dataviz workshop, a first-time offering at NYU. I previously posted the syllabus here.

I made minor changes to the syllabus, adding Alberto Cairo's book, The Functional Art (link), as optional reading, some articles from the recent debate in the book review circle about the utility of "negative reviews" (start here), and some blog posts by Stephen Few.

The Cairo and Few readings, together with Tufte, are closest to what I want to accomplish in the first two classes, before we start discussing individual projects: encouraging students to adopt the mentality of the course, that is to say, to think of dataviz as an artform. An artform implies many things, one of which is a seriousness about the output, and another is the recognition that the work has an audience. 

The field of data visualization is sorely lacking high-level theory, immersed as so many of us are in tools, data, and rules of thumb. It is my hope that these workshop discussions will lead to a crytallization of the core principles of the field.

We went on a tour of many dataviz blogs, and documented various styles of criticism. In the next class, we will discuss what style we'd adopt in the course.


The composition of the class brings me great excitement. There are 12 enrolled students, which is probably the maximum for a class of this type.  One student subsequently dropped out, after learning that the workshop is really not for true beginners.

The workshop participants come from all three schools of dataviz: computer science, statistics, and design. Amongst us are an academic economist trained in statistical methods, several IT professionals, and an art director. This should make for rewarding conversation, as inevitably there will be differences in perspective.


REQUEST FOR HELP: A variety of projects have been proposed; several are using this opportunity to explore data sets from their work. That said, some participants are hoping to find certain datasets. If you know of good sources for the following, please write a comment below and link to them:

  • Opening-day ratings from sites like Rotten Tomatoes
  • New York City water quality measures by county (or other geographical unit), probably from an environmental agency
  • Data about donors/donations to public media companies


Since this is a dataviz blog, I want to include a chart with this post. I did a poll of the enrolled students, and one of the questions was about what dataviz tools they use to generate charts. I present here two views of the same data.

The first is a standard column chart, plotting the number of students who include a particular tool in his or her toolset (each student is allowed to name more than one tools). This presents a simple piece of information simply: Excel is the most popular although the long tail indicates the variety of tools people use in practice.


What the first option doesn't bring out is the correlation between tools, indicated by several tools used by the same participant. The second option makes this clear, with each column representing a student. This chart is richer as it also provides information on how many tools the average student uses, and the relationship between different tools.


The tradeoff is that the reader has to work a little more to understand the relative importance of the different tools, a message that is very clear in the first option. 

This second option is also not scalable. If there are thousands of students, the chart will lose its punch (although it will undoubtedly be called beautiful).

Which version do you like? Are there even better ways to present this information?