## Visualizing uneven distributions

##### Jan 30, 2014

The highlight of the post is this chart, which shows an uneven distribution.

The message of the chart is that a large amount of donations (about 25%) came from the top 3 percent of donors. This is a long-tailed distribution, and quite typical of much data that have to do with financial matters. Thus, it is a general problem as many of us encounter this type of data.

One of the insights from Jeff's post is that with some tricks, one can generate a chart that looks like the above using Excel. This is pretty impressive, and he credits Peltier for the pointer.

***

Now, let's see if there are other ways to present this data. One issue I have with the chart is that the most important statistics are found in the text labels. These are of the form: "X% of customers contribute Y% of revenues". So, in effect, there are two relevant data series, one of the share of people and then the share of revenues.

The following is a stacked column chart:

Here, the information is primarily encoded in the dotted guide lines between the two columns. It has the advantage of showing both the absolute share of people as well as of revenues, plus showing the uneven distribution between the two data series.

But it is also less fun to look at. The advantage of the original chart is that one can imagine that all the donors are being lined up along the horizontal axis from those who gave the least to those who gave the most. That's a pretty powerful mental picture. The weakness of the original is that few of us can mentally tally up the strangely shaped areas to learn the share of revenues.

***

The next version is a kind of profile chart:

I like this one because it places the two data series on equal footing, and allows for efficient comparison of the two sets of proportions. It also has the feature of showing all the shares, just like the stacked columns.

PS. Jeff has taken some of his readers' comments into account, and has evolved his original design to this one:

I can see these changes:

• customers ordered with the most important on the left and the least on the right.  To me, a neutral change
• The vertical axis is labelled "subscription value" instead of "How much do we get for each subscription". This is a slight improvement, using fewer words to convey the same point.
• The breakpoints have been set differently to split the revenues into five  so that each segment now accounts for exactly 20% of the revenues. I actually prefer the original segmentation -- that one visually picks out the breakpoints in the data, thus it is empirical rather than canonical. Look at the split between the gray and the yellow segments in the new chart. Does it make sense to split customers with the same subscription value into two groups?

## Where are the millionaires? Where's the news?

##### Jan 21, 2014

The financial media, ranging from Wall Street Journal to Zero Hedge, blogged about the geographical distribution of U.S. millionaires. The stories came with a map, and in the case of the latter, two data tables ranked by ascending and descending prevalence of millionaires. The map looks like this:

The talking point lifted from the press release of Phoenix Marketing, who is the origin of the data, focuses improbably on North Dakota. For example, the WSJ blog began with:

The state making the fastest climb up the millionaire rankings doesn’t have a single Tiffany or Saks Fifth Avenue store. The closest BMW dealership is a six-hour drive from the capital.

Welcome to North Dakota, which jumped 14 spots in the annual rankings of millionaire households per capita released by Phoenix Marketing International.

The trouble is, you can't pick North Dakota out of the map; it just doesn't stand out. The map uses a different methodology of ordering the states, by groupings of the prevalence of millionaires, that is, the proportion of households in each state who are labeled "millionaires" by Phoenix Marketing.

The text, by contrast, draws attention to the change in the rank of states using the proportion of households who are millionaires as the ranking criterion. This data is two steps removed from the data used for the map (start with the map data, compute the year-to-year change, then convert to ranks).

***

State-level averages pose a challenge: state population varies a lot, and this leads to variability in the estimates of smaller states. You are likely to find smaller states over-represented in the top and bottom of state ranking charts. I talked about a similar situation relating to interpreting high schools test data (see this post, and Prologue of Numbersense link.)

Instead of using proportion of households who are millionaires, I prefer to use the number of millionaires per 1,000 households. Mathematically, these two are equivalent. If we plot that metric versus the size of states (number of households), we see the familiar pattern:

I labeled the North Dakota data point to show how unremarkable it is. While it may have risen in "rank", it is still ranked below median in terms of number of millionaires per 1000 households. Also notice that of states with similar number of households, the millionaires metric ranges wildly from 40 to 70 per 1000 households.

An interpretation of these state average millionaire metrics has to account for state population size.

***

The following map illustrates the ups and downs between 2007 and 2013 by state.  (I found 2007 data but not the 2012 data.)

Think of an accounting equation. In this view, the positive changes must balance out the negative changes since I am only converned about any shift in mix. What this map shows is that Texas, California, New York, and Washington have the top net gains in the number of millionaires while Florida, and Michigan have the biggest net losses. North Dakota is again in the middle of the bunch.

This view ignores the total net change in millionaires as it focuses on the mix by state.  You'd need to figure out what is the relevant question before you can come up with a good visualization of this (or any) data.

## Announcement: Dataviz Workshop for Spring 2014

##### Jan 13, 2014

I'm very excited to preview the syllabus of a new dataviz course I've been developing to be launched in Spring 2014. This course is focused on the craft of graph building, and is modeled after the writing workshop. Students will work through multiple drafts of a project while giving and receiving criticism from other students. To my knowledge, this is a one-of-a-kind course so I'm putting up the syllabus and will report on how it goes over in a few months. I hope the format will prove successful and others will offer graph building workshops in the years to come. I'm open to suggestions about the syllabus.

The course is offered as part of the brand-new Certificate in Analytics and Data Visualization at New York University. The announcement of the Certificate is here.

***

NEW YORK UNIVERSITY
CERTIFICATE IN ANALYTICS AND DATA VISUALIZATION

COURSE TITLE: The Art of Data Visualization (DATA1-CE9002)

FEB/MAR 2014, Saturday mornings

Woolworth Bldg, NYC

Instructor: Kaiser Fung

COURSE DESCRIPTION

Data visualization is storytelling in a graphical medium. The format of this course is inspired by the workshops used extensively to train budding writers, in which you gain knowledge by doing and redoing, by offering and receiving critique, and above all, by learning from each another. Present your project while other students offer critique and suggestions for improvement. The course offers immersion into the creative process, the discipline of sketching and revising, and the practical use of tools. You will develop a discriminating eye for good visualizations. Readings on aspects of the craft are assigned throughout the term. For students in the Certificate of Analytics and Data Visualization, the course offers a chance to demonstrate mastery of the integrated approach combining the perspectives of statistical graphics, graphical design, and information visualization.

LEARNING OUTCOMES

• Give constructive critique on other people’s data visualization
• Listen and respond to critique from others on one’s own data visualization
• Evaluate alternative visualization of the same data
• Refine and improve drafts of data visualization projects
• Interpret data visualization with an integrated lens combining the perspectives of statistical graphics, graphic design, and information visualization
• Create at least one piece of work that can be included in one’s portfolio

PREREQUISITES

This is not a beginner’s class. You should have prior experience making data graphics for an audience, and feel comfortable offering critique of other’s work. For students in the Certificate of Analytics and Data Visualization, appropriate preparation includes these courses: Introduction to Analytics and Data Visualization, Statistical Foundations of Analytics and Data Visualization, Applied Data Management for Analytics and Data Visualization, and Designing Data: Infographics. Because of these prerequisites, you may execute designs in your preferred set of tools, such as Excel, Adobe Illustrator, R, Processing, Tableau, and JMP.

Required Textbooks:

Edward Tufte. The Visual Display of Quantitative Information (Graphics Press)

Julia Steele and Noah Illinsky (eds.). Beautiful Visualization: Looking at Data Through the Eyes of Experts (O’Reilly, 2010)

Don Norman. The Design of Everyday Things: Revised and Expanded Edition. (Basic Books, 2013)

Kosara, Robert. "Visualization criticism-the missing link between information visualization and art." In Information Visualization, 2007. IV'07. 11th International Conference, pp. 631-636.

Kosara, Robert, “What is Visualization? A Definition”, blog post, July 2008. http://eagereyes.org/criticism/definition-of-visualization

Kirk, Andy, “Walking the tightrope of visualization criticism: the balance, fairness and realism of our visualization criticism must improve”, blog post, July 2012. http://strata.oreilly.com/2012/07/visualization-criticism.html

Kosara, Robert, “A Criticism of Visualization Criticism Criticism”, blog post, July 2012. http://eagereyes.rog/criticism/criticism-visualization-criticism-criticism. The above three references form a dialogue.

Gelman, Andrew, and Antony Unwin, “Infovis and Statistical Graphics: Different Goals, Different Looks”, Journal of Computational and Graphical Statistics 22(1): pp.2-28.

Gelman, Andrew, and Antony Unwin, “Tradeoffs in Information Graphics”, Journal of Computational and Graphical Statistics 22(1), 2013: pp. 45-49. This is a rejoinder to the discussion of the previous article.

Mitchell, Ian. "AUThoRiTy oR CLiChé? the graphic language of information Design." research, education and design experiences (2012).

Rhyne, Theresa-Marie, “Does the Difference Between Information and Scientific Visualization Really Matter?” IEEE Computer Graphics and Applications 23(3): 6-8.

North, Chris, “Toward Measuring Visualization Insight”, IEEE Computer Graphics and Applications, May/June 2006, pp. 6-9.

Heer, Jeffrey, et. al., “A Tour Through the Visualization Zoo”, Communications of the ACM 53(6): June 2010, pp. 59-67.

Optional but recommended:

Other Ed Tufte books

Any book by Howard Wainer (Visual Revelations, Graphic Discovery, etc.)

Van Wijk, Jarke J., “Views on Visualization”, IEEE Transactions on Visualization and Computer Graphics 12(4): July/August 2006, pp. 421-432.

Zangwill, Nick, "Aesthetic Judgment", The Stanford Encyclopedia of Philosophy (Summer 2013 Edition), Edward N. Zalta (ed.). http://plato.stanford.edu/entries/aesthetic-judgment/

Websites: There are a lot of blogs showcasing visualization projects. (List of blogs to be added)

EVALUATION

Class attendance: 30%

Ontime submission of drafts: 20%

Ontime submission of written critiques: 20%

Class Participation: 20%

SCHEDULE

First Two Classes

Course Philosophy

• Graph building as an artform
• Graph building as story-telling
• Visualization criticism
• The workshop method

Student questionnaire

Introductions

Make assignments and schedules

Guest speaker talks about real-world graphics design process

The State of Visualization Criticism: review several blogs

Criticism frameworks, e.g. Junk Charts Trifecta Checkup

Examples of Visualization Criticism

***

In-class discussion: (based on required reading, may shift to future classes depending on time)

• What is beauty?
• Novelty, and standards
• How should visualization be measured?
• What are insights?
• What works fall under the data visualization label?
• What can graphics designers learn from Norman's approach to product design?

Ground rules for workshop

Final Four Sessions

During the course, each student will hand in two drafts of a graphic, the second of which should take into account prior criticism. The class will be divided into two groups, and projects will be workshopped in alternate weeks. It is crucial that projects are submitted on time so that your classmates have time to prepare considered criticism.

***

## Losing the big picture

##### Jan 08, 2014

One of the dangers of "Big Data" is the temptation to get lost in the details. You become so absorbed in the peeling of the onion that you don't realize your tear glands have dried up.

Hans Rosling linked to a visualization of tobacco use around the world from Twitter (link to original). The setup is quite nice for exploration. I'd call this a "tool" rather than a visual.

***

Let's take a look at the concentric circles on the right.

I appreciate the designer's concept -- the typical visualization of this type of data is looking at relative rates, which obscures the fact that China and India have far and away the most smokers even if their rates are middling (24% and 13% respectively).

This circular chart is supposed to show the absolute distribution of smokers across so-called "super-regions" of the world.

Unfortunately, the designer decided to pile on additional details. The concentric circles present a geography lesson, in effect. For example, high-income super-region is composed of high-income North America, Western Europe, high-income Asia Pacific, etc. and then high-income North America is composed of USA, Canada, etc.

Notice something odd? The further out you go, the larger the circular segments but the smaller the amount of people they represent! There are more people in the super-region of high-income worldwide than in high-income North America and in turn, there are more people in the high-income North American region than in USA. But the size of the graphical elements is reversed.

***

In principle, the "bumps"-like chart used to show the evolution of tobacco prevalence in individual countries make for a nice visual. In fact, Rosling marvelled that the global rate of consumption has fallen in recent years.

However, I'm often irritated when the designer pays no attention to what not to show. There are probably well above 200 lines densely packed into this chart. It is almost for sure that over-plotting will cause some of these lines to literally never see the light of day. Try hovering over these lines and see for yourself.

The same chart with say 10 judiciously chosen lines (countries or regions) provides the reader with a lot more profit.

***

The discerning reader figures out that the best visual actually does not even show up on the dashboard. Go ahead, and click on the tab called "Data" on top of the page. You now see a presentation of each country's "data" by age group and by gender. This is where you can really come up with stories for what is going on in different countries.

For example, the British have really done extremly well in reducing tobacco use. Look at how steep the declines are across the board for British men (in most parts of the world, the prevalence of smoking is much higher among men than women.)

Bulgaria on the other hand shows a rather odd pattern. It is one of the few countries in the bumps chart that showed a climb in smoking rates, at least in the early 2000s. Here the data for men is broken down into age groups.

This chart exposes a weakness of the underlying data. The error bars indicate to us that what is being plotted is not actual data but modeled data. The error bars here are enormous. With the average at about 40% to 50% for many age groups, the confidence interval is also 40% wide. Further, note that there were only three or four observations (purple dots) and curves are being fitted to these three or four dots, plus extrapolation outside the window of observation. The end result is that the apparent uplift in smoking in the early 2000s is probably a figment of the modeler's imagination. You'd want to understand if there are changes in methodologies around that time.

As a responsible designer of data graphics, you should focus less on comprehensiveness and focus more on highlighting the good data. I'm a firm believer of "no data is better than bad data".

## Visualizing movements of people

##### Jan 06, 2014

Long-time reader Daniel L. sends in this chart illustrating a large data set of intra-state migration flows in the U.S. The original chart is at Vizynary by way of Daily Kos.

***

There is no denying that this chart is beautiful to look at. But what is its message? That there are people migrating from and to every state? (assuming all fifty states are present)

Daily Kos describes how one can hover over any state to see its individual patterns. Something like this:

This is a great way, perhaps the only way, to consume the chart. Essentially, the reader is asked to generate a small-multiples panel of charts. The chart does a better job at showing the pairs of states between which people migrate than at showing the relative size of the flows. The size of the flows is coded in the width of the arcs. The widths are too similar to tell apart; and it doesn't help that no legend is provided.

The choice of color is curious. Each region of the country is its own color, in a "nominal" way. It is a design decision to emphasize regions.

Another decision is to hide information on the distances of the migrations. Evidently, the designer sacrificed that information in order to create the neat circular arrangement of states.

A shortcoming of this representation is one missing dimension: the direction of the flow. I'm not sure given any pair of states A and B, whether the net migration is into A or into B.

***

I propose a solution using the map while preserving the interactive element of the original.

On this map, when you hover over a particular state, it highlights all other states for which there are migrations flows into or out of that state. For color, use a blue-white-red scheme with blue indicating net inflow, red indicating net outflow, and white for near-zero flows. Include a legend.

Another important decision for the designer is absolute versus relative scales. In an absolute scheme, you rank the entire set of flows for all pairs of states; obviously, the resulting colors would be influenced by the state populations. Alternatively, you rank the flow sizes within each state; in this case, the smaller states will feel exaggerated.

The map has the additional advantage of showing the approximate distance (and direction) moved, which, for me, is a useful piece of information.