To explain or to eliminate, that is the question

Today, I take a look at another project from Ray Vella's class at NYU.

Rich Get Richer Assigment 2 top

(The above image is a honeypot for "smart" algorithms that don't know how to handle image dimensions which don't fit their shadow "requirement". Human beings should proceed to the full image below.)

As explained in this post, the students visualized data about regional average incomes in a selection of countries. It turns out that remarkable differences persist in regional income disparity between countries, almost all of which are more advanced economies.

Rich Get Richer Assigment 2 Danielle Curran_1

The graphic is by Danielle Curran.

I noticed two smart decisions.

First, she came up with a different main metric for gauging regional disparity, landing on a metric that is simple to grasp.

Based on hints given on the chart, I surmised that Danielle computed the change in per-capita income in the richest and poorest regions separately for each country between 2000 and 2015. These regional income growth values are expressed in currency, not indiced. Then, she computed the ratio of these growth rates, for each country. The end result is a simple metric for each country that describes how fast income has been growing in the richest region relative to the poorest region.

One of the challenges of this dataset is the complex indexing scheme (discussed here). Carlos' solution keeps the indices but uses design to facilitate comparisons. Danielle avoids the indices altogether.

The reader is relieved of the need to make comparisons, and so can focus on differences in magnitude. We see clearly that regional disparity is by far the highest in the U.K.


The second smart decision Danielle made is organizing the countries into clusters. She took advantage of the horizontal axis which does not encode any data. The branching structure places different clusters of countries along the axis, making it simple to navigate. The locations of these clusters are cleverly aligned to the map below.


Danielle's effort is stronger on communications while Carlos' effort provides more information. The key is to understand who your readers are. What proportion of your readers would want to know the values for each country, each region and each year?


A couple of suggestions

a) The reference line should be set at 1, not 0, for a ratio scale. The value of 1 happens when the richest region and the poorest region have identical per-capita incomes.

b) The vertical scale should be fixed.

Gazing at petals

Reader Murphy pointed me to the following infographic developed by Altmetric to explain their analytics of citations of journal papers. These metrics are alternative in that they arise from non-academic media sources, such as news outlets, blogs, twitter, and reddit.

The key graphic is the petal diagram with a number in the middle.


I have a hard time thinking of this object as “data visualization”. Data visualization should visualize the data. Here, the connection between the data and the visual design is tenuous.

There are eight petals arranged around the circle. The legend below the diagram maps the color of each petal to a source of data. Red, for example, represents mentions in news outlets, and green represents mentions in videos.

Each petal is the same size, even though the counts given below differ. So, the petals are like a duplicative legend.

The order of the colors around the circle does not align with its order in the table below, for a mysterious reason.

Then comes another puzzle. The bluish-gray petal appears three times in the diagram. This color is mapped to tweets. Does the number of petals represent the much higher counts of tweets compared to other mentions?

To confirm, I pulled up the graphic for a different paper.


Here, each petal has a different color. Eight petals, eight colors. The count of tweets is still much larger than the frequencies of the other sources. So, the rule of construction appears to be one petal for each relevant data source, and if the total number of data sources fall below eight, then let Twitter claim all the unclaimed petals.

A third sample paper confirms this rule:


None of the places we were hoping to find data – size of petals, color of petals, number of petals – actually contain any data. Anything the reader wants to learn can be directly read. The “score” that reflects the aggregate “importance” of the corresponding paper is found at the center of the circle. The legend provides the raw data.


Some years ago, one of my NYU students worked on a project relating to paper citations. He eventually presented the work at a conference. I featured it previously.


Notice how the visual design provides context for interpretation – by placing each paper/researcher among its peers, and by using a relative scale (percentiles).


I’m ignoring the D corner of the Trifecta Checkup in this post. For any visualization to be meaningful, the data must be meaningful. The type of counting used by Altmetric treats every tweet, every mention, etc. as a tally, making everything worth the same. A mention on CNN counts as much as a mention by a pseudonymous redditor. A pan is the same as a rave. Let’s not forget the fake data menace (link), which  affects all performance metrics.

Summer dataviz workshop to start July 1

Registration is open for my dataviz workshop at NYU. (link)

This is a workshop in the sense of a creative writing workshop. Your "writing" are sketches of data visualization based on your selected datasets. In class, we critique all of the work and produce revisions. You will learn to appreciate good dataviz, to offer constructive and insightful commentary on visualization, and be discriminating in receiving feedback.

Last term, half the class worked on datasets that are related to their jobs. The data sources were diverse, ranging from scholarly citation data, World Bank data, commercial sales and market share data, mountaineering accidents data, standardized testing item data, speeches by death row inmates, juvenile convicts, etc.

Students pick their own tools. They used Excel, Powerpoint, Tableau, d3, etc.

Here is a past syllabus.

The course runs from July 1 to Aug 5. Register here.

Course Announcement: Data Visualization Workshop

The next installment of my data visualization workshop runs from April 7 to May 12 in New York City.

My workshop is modeled after a creative writing workshop. The focus of the six weeks is on giving and receiving feedback on a datavis project of the student's choosing. There are selected readings and industry speakers who provide some perspective on this fast-changing field.

You can find an outline of the course here.

Here are some comments about the course from past students:

"A terrific class. Excellent readings and a workshop structure that allowed for a high level of creativity and honed our skills in constructing and de-constructing effective visualizations."

"This was a great course that opened my eyes!!!"

"Fung brought in some fantastic speakers that are well respected in the industry. The workshop nature of the class helped hone our eye not just for our own project, but to observe and comment on those around us."

Registration information here. Please send along to your friends and/or colleagues.

PS. The course is part of the Certificate in Data Visualization although you can register for my course without doing the Certificate. The full set of courses is found here.


In addition, I'm announcing a new course called "Careers in Data Science and Business Analytics". Please see the announcement on my sister blog here.

The class pondering Big Data

Note: I'm traveling a lot lately and it is affecting my ability to post on a regular basis.

It's three weeks into my chart-building workshop (link) at NYU and we are starting to discuss individual projects. One of the major discussion points this week is the quality of the underlying data being visualized.

One student is visualizing movie data from IMDB. He showed a chart comparing the year of a movie's release and the number of votes it has received. Do people talk more about new or old movies? Not surprisingly, the distribution is highly skewed with recent movies getting a lot more votes. The consensus in the room is that you never just want to see the pattern; the natural question to ask is why are we seeing such a pattern.

The easiest response  is people tend to vote on recent movies. This is the availability heuristic. You tend to talk about things that are top of mind. But there is a lot more to that. Perhaps movies of specific genres get discussed more often. Perhaps movies with larger marketing budgets get more buzz. etc. etc. If any of these factors are important, a good data visualization should bring them out.

Another factor that isn't obvious is that IMDB only started recently relative to the history of movies. The start date of data collection is highly informative here. Imagine a database that gets created five years ago versus one that was created five decades ago. The former dataset is not a random sample of the latter, far from it. The availability heuristic matters here. Also, the movie industry is growing in the meantime so the number of movies is changing. Internet access is also growing so the number of votes is changing. Finally, all students agree that anyone caring to comment on older movies probably is someone who likes those movies, and thus expect that the average rating on older movies to be higher than more recent ones... we'd have to verify this hypothesis using the data.

A lot of Big Data have these characteristics. The starting date of data collection matters a lot. Averaging data without accounting for these timing issues leads to wrong conclusions.


The dynamics of people rating/commenting on movies is a topic I'm interestsed in. If you go to Amazon and pull up Freakonomics, published 6 years ago, it has over 1800 reviews, of which over 800 are five stars, and 1300 are four or five stars, and yet the most recent reviews submitted are dated 3, 5, 6, etc. days ago. Why do people keep writing reviews?  For example, two of the reviews written this week just said "great!" and "great book". Another said "Outstanding take on the odd correlations between things in our culture. Definitley makes you think outside the lines." That comment has probably been repeated hundreds of times already by the preceding reviewers. Have anyone studied this yet?

Try a new way of learning dataviz; course announcement




Fall 2014 (Oct 6 - Nov 24, Mondays 6:30-9:30)

New York University

Instructor: Kaiser Fung

Location: New York City


Learn how to make knock-out data visualization in an innovative, immersive and fun setting, with classmates who are similarly passionate about making the numbers speak visually.


The class is conducted in the style of creative-writing workshops. Each student will focus on one data visualization project during the term, and gain knowledge through drafting and revisions, offering and receiving critique, and above all, learning from others.


You will develop a discriminating eye for good visualizations. For students enrolled in the Certificate in Data Visualization, the course offers an ideal setting to demonstrate mastery of the integrated approach combining the perspectives of statistical graphics, graphical design, and information visualization.


Prerequisite: We welcome students from all backgrounds. A more diverse class makes a better experience for everyone. In order to be a full participant in the course, you should have prior experience making data graphics for an audience (broadly defined), and feel comfortable offering critique of others’ work.


Because of the workshop structure, enrollment is limited to 12 students. Enroll now to reserve your spot.



Second Dataviz Workshop Soon to Start, and Feedback from First Workshop

I'm excited to announce that there will be a summer session for my Dataviz Workshop at NYU (starting June 21). This is a chart-building workshop run like a creative writing workshop. You will work on a personal project throughout the term, receive feedback from classmates, and continually improve the product. I have previously written about the First Workshop here (with syllabus), here, here and here.

Here is the link to register for the course. (Note: the correct class time is 10a - 1p.)


The participants in the First Workshop were very happy with their experience. I can now report on the end-of-course survey. Ten people took the class, and seven responded to the survey. The satisfaction scores are as follows:


It's very gratifying to see that almost everyone thought the class time was well spent. During class, students gave each other feedback on projects. A key to making these sessions work is that students should be both givers and takers. It is really important that they become as comfortable giving critique as taking feedback. I asked the students to self-assess and this is what they said:


I'd also add that the few students who enrolled in the course with less background than the average ended up participating fully and actively in the discussion. As an instructor, I want to get out of the way while keeping the conversation on track. Based on the following rating, I think I did fine:


One of the feedback I received during class--not reflected here--is that some students want to spend more time discussing the reading. I assign three books, which everyone loved but I believe that it is hard for them to finish reading all three books in time for the second class. They would like to spread the discussion of the books over the course of the term. This arrangement would present a challenge. Due to the nature of a workshop, the first two sessions cannot involve project discussion, which is one of the reasons why I give introductory lectures and assign the books. In addition, students spend a lot of time during the term both working on their own projects and reviewing their classmates' projects; and I worry that assigning more reading distracts from the other activities.

Indeed, the course is not a gut course. Several students were surprised by how much work they put in. One or two learned that preparing the data took ten times as much time as they expected. (They selected particularly difficult datasets to work with.)


A specific feedback is to add a session in the computer lab. This creates an opportunity for students to share their knowledge. Those who are good coders can help others who are not with pre-processing tasks. Those who are good with Illustrator can show others how to make the charts pretty. I am not ready for this change in the summer session but in the fall, I'll likely experiment with this.

Finally, the tools used by students are diverse: Excel (5), Illustrator (3), R (2), followed by Powerpoint, Pixelmator (draft stage), Tableau, Stata, Paint and SQL Server (1 each). Three of the students put their work on a Web page, which was the most popular format.


If you are serious about dataviz, please join me this summer for the Second Art of Data Visualization Workshop.

Click on this link to register for the course.



Update on Dataviz Workshop 2

The class practised doing critiques on the famous Wind Map by Fernanda Viegas and Martin Wattenberg.


Click here for a real-time version of the map.

I selected this particular project because it is a heartless person indeed who does not see the "beauty" in this thing.

Beauty is a word that is thrown around a lot in data visualization circles. What do we mean by beauty?


The discussion was very successful and the most interesting points of discussion were these:

  • Something that is beautiful should take us to some truth.
  • If we take this same map but corrupt all the data (e.g. reverse all wind directions), is the map still beautiful?
  • What is the "truth" in this map? What is its utility?
  • The emotional side of beauty is separate from the information side.
  • "Truth" comes before the emotional side of beauty.

Readers: would love to hear what you think.


PS. Click here for class syllabus. Click here for first update.

Update on Dataviz Workshop 1

Happy to report on the dataviz workshop, a first-time offering at NYU. I previously posted the syllabus here.

I made minor changes to the syllabus, adding Alberto Cairo's book, The Functional Art (link), as optional reading, some articles from the recent debate in the book review circle about the utility of "negative reviews" (start here), and some blog posts by Stephen Few.

The Cairo and Few readings, together with Tufte, are closest to what I want to accomplish in the first two classes, before we start discussing individual projects: encouraging students to adopt the mentality of the course, that is to say, to think of dataviz as an artform. An artform implies many things, one of which is a seriousness about the output, and another is the recognition that the work has an audience. 

The field of data visualization is sorely lacking high-level theory, immersed as so many of us are in tools, data, and rules of thumb. It is my hope that these workshop discussions will lead to a crytallization of the core principles of the field.

We went on a tour of many dataviz blogs, and documented various styles of criticism. In the next class, we will discuss what style we'd adopt in the course.


The composition of the class brings me great excitement. There are 12 enrolled students, which is probably the maximum for a class of this type.  One student subsequently dropped out, after learning that the workshop is really not for true beginners.

The workshop participants come from all three schools of dataviz: computer science, statistics, and design. Amongst us are an academic economist trained in statistical methods, several IT professionals, and an art director. This should make for rewarding conversation, as inevitably there will be differences in perspective.


REQUEST FOR HELP: A variety of projects have been proposed; several are using this opportunity to explore data sets from their work. That said, some participants are hoping to find certain datasets. If you know of good sources for the following, please write a comment below and link to them:

  • Opening-day ratings from sites like Rotten Tomatoes
  • New York City water quality measures by county (or other geographical unit), probably from an environmental agency
  • Data about donors/donations to public media companies


Since this is a dataviz blog, I want to include a chart with this post. I did a poll of the enrolled students, and one of the questions was about what dataviz tools they use to generate charts. I present here two views of the same data.

The first is a standard column chart, plotting the number of students who include a particular tool in his or her toolset (each student is allowed to name more than one tools). This presents a simple piece of information simply: Excel is the most popular although the long tail indicates the variety of tools people use in practice.


What the first option doesn't bring out is the correlation between tools, indicated by several tools used by the same participant. The second option makes this clear, with each column representing a student. This chart is richer as it also provides information on how many tools the average student uses, and the relationship between different tools.


The tradeoff is that the reader has to work a little more to understand the relative importance of the different tools, a message that is very clear in the first option. 

This second option is also not scalable. If there are thousands of students, the chart will lose its punch (although it will undoubtedly be called beautiful).

Which version do you like? Are there even better ways to present this information?


Announcement: Dataviz Workshop for Spring 2014

I'm very excited to preview the syllabus of a new dataviz course I've been developing to be launched in Spring 2014. This course is focused on the craft of graph building, and is modeled after the writing workshop. Students will work through multiple drafts of a project while giving and receiving criticism from other students. To my knowledge, this is a one-of-a-kind course so I'm putting up the syllabus and will report on how it goes over in a few months. I hope the format will prove successful and others will offer graph building workshops in the years to come. I'm open to suggestions about the syllabus.

The course is offered as part of the brand-new Certificate in Analytics and Data Visualization at New York University. The announcement of the Certificate is here.

You can sign up for the course here. Please spread the word!



COURSE TITLE: The Art of Data Visualization (DATA1-CE9002)

FEB/MAR 2014, Saturday mornings

Woolworth Bldg, NYC

Instructor: Kaiser Fung


Data visualization is storytelling in a graphical medium. The format of this course is inspired by the workshops used extensively to train budding writers, in which you gain knowledge by doing and redoing, by offering and receiving critique, and above all, by learning from each another. Present your project while other students offer critique and suggestions for improvement. The course offers immersion into the creative process, the discipline of sketching and revising, and the practical use of tools. You will develop a discriminating eye for good visualizations. Readings on aspects of the craft are assigned throughout the term. For students in the Certificate of Analytics and Data Visualization, the course offers a chance to demonstrate mastery of the integrated approach combining the perspectives of statistical graphics, graphical design, and information visualization.


  • Give constructive critique on other people’s data visualization
  • Listen and respond to critique from others on one’s own data visualization
  • Evaluate alternative visualization of the same data
  • Refine and improve drafts of data visualization projects
  • Interpret data visualization with an integrated lens combining the perspectives of statistical graphics, graphic design, and information visualization
  • Create at least one piece of work that can be included in one’s portfolio


This is not a beginner’s class. You should have prior experience making data graphics for an audience, and feel comfortable offering critique of other’s work. For students in the Certificate of Analytics and Data Visualization, appropriate preparation includes these courses: Introduction to Analytics and Data Visualization, Statistical Foundations of Analytics and Data Visualization, Applied Data Management for Analytics and Data Visualization, and Designing Data: Infographics. Because of these prerequisites, you may execute designs in your preferred set of tools, such as Excel, Adobe Illustrator, R, Processing, Tableau, and JMP.


Required Textbooks:

      Edward Tufte. The Visual Display of Quantitative Information (Graphics Press)

      Julia Steele and Noah Illinsky (eds.). Beautiful Visualization: Looking at Data Through the Eyes of Experts (O’Reilly, 2010)

      Don Norman. The Design of Everyday Things: Revised and Expanded Edition. (Basic Books, 2013)

Required Course Readings:

Kosara, Robert. "Visualization criticism-the missing link between information visualization and art." In Information Visualization, 2007. IV'07. 11th International Conference, pp. 631-636.

Kosara, Robert, “What is Visualization? A Definition”, blog post, July 2008.

Kirk, Andy, “Walking the tightrope of visualization criticism: the balance, fairness and realism of our visualization criticism must improve”, blog post, July 2012.

Kosara, Robert, “A Criticism of Visualization Criticism Criticism”, blog post, July 2012. http://eagereyes.rog/criticism/criticism-visualization-criticism-criticism. The above three references form a dialogue.

Gelman, Andrew, and Antony Unwin, “Infovis and Statistical Graphics: Different Goals, Different Looks”, Journal of Computational and Graphical Statistics 22(1): pp.2-28.

Gelman, Andrew, and Antony Unwin, “Tradeoffs in Information Graphics”, Journal of Computational and Graphical Statistics 22(1), 2013: pp. 45-49. This is a rejoinder to the discussion of the previous article.

Mitchell, Ian. "AUThoRiTy oR CLiChé? the graphic language of information Design." research, education and design experiences (2012).

Rhyne, Theresa-Marie, “Does the Difference Between Information and Scientific Visualization Really Matter?” IEEE Computer Graphics and Applications 23(3): 6-8.

North, Chris, “Toward Measuring Visualization Insight”, IEEE Computer Graphics and Applications, May/June 2006, pp. 6-9.

Heer, Jeffrey, et. al., “A Tour Through the Visualization Zoo”, Communications of the ACM 53(6): June 2010, pp. 59-67.

Optional but recommended:

Other Ed Tufte books

Any book by Howard Wainer (Visual Revelations, Graphic Discovery, etc.)

Van Wijk, Jarke J., “Views on Visualization”, IEEE Transactions on Visualization and Computer Graphics 12(4): July/August 2006, pp. 421-432.

Zangwill, Nick, "Aesthetic Judgment", The Stanford Encyclopedia of Philosophy (Summer 2013 Edition), Edward N. Zalta (ed.).

Websites: There are a lot of blogs showcasing visualization projects. (List of blogs to be added)


Class attendance: 30%

Ontime submission of drafts: 20%

Ontime submission of written critiques: 20%

Class Participation: 20%

Final Project Grade: 10%


First Two Classes

Course Philosophy

  • Graph building as an artform
  • Graph building as story-telling
  • Visualization criticism
  • The workshop method

           Student questionnaire


           Make assignments and schedules

           Guest speaker talks about real-world graphics design process

The State of Visualization Criticism: review several blogs

Criticism frameworks, e.g. Junk Charts Trifecta Checkup           

           Examples of Visualization Criticism


            In-class discussion: (based on required reading, may shift to future classes depending on time)

  • What is beauty?
  • Novelty, and standards
  • How should visualization be measured?
  • What are insights?
  • What works fall under the data visualization label?
  • What can graphics designers learn from Norman's approach to product design?

            Ground rules for workshop

Final Four Sessions

During the course, each student will hand in two drafts of a graphic, the second of which should take into account prior criticism. The class will be divided into two groups, and projects will be workshopped in alternate weeks. It is crucial that projects are submitted on time so that your classmates have time to prepare considered criticism.


Please leave comments below.

You can sign up for the course here.

Please spread the word!