Experiments with multiple dimensions

Reader (and author) Bernard L. sends us to the Economist (link), where they walked through a few charts they sketched to show data relating to the types of projects that get funded on Kickstarter. The three metrics collected were total dollars raised, average dollars per project, and the success rate of different categories of projects.

Here's the published version, which is a set of bar charts, ranked by individual metrics, and linked by colors.


This bar chart does the job. The only challenge is the large number of colors. But otherwise, it's not hard to see that fashion projects have the worst success rate and raised relatively little money overall although the average pledge amount tended to be higher than average.

The following chart used more of a Bumps chart aesthetic. It dropped the average pledge per project metric, which I think is a reasonable design choice. The variance in pledge amount is probably pretty high and thus the average may not be a good metric anyway. The Bumps format though suffers because there are too many categories and the two metrics are rather uncorrelated, resulting in a spider web. Instead of using colors as a link, this format uses explicit lines as links between the metrics.


The following version combines features from both. It requires no colors. It drops the third metric, while adopting the bar chart format. The two charts retain the same order of categories so that one can read across to learn about both metrics.



PS. Readers want to see a scatter plot:


The overall pattern is clearer on a scatter plot. When there are so many categories, it's a pain to put the data labels on the chart. It's odd that the amount pledged for games is the highest of the categories and yet it has among the lowest rate of being fully funded. Is this a sign of inefficiency?

A chart that stops the story-telling impetus

We all like to tell stories. One device that has produced a lot of stories, and provoked much imagination is the dual-axis plot showing two time series. Is there a correlation or is there not? Unfortunately, most of these stories are false.

Claremont_homesLooking at the following chart (link) showing the home sales and median home price in Claremont over the last six years, one gets the sense that the two variables move in tandem, kind of. Both time series appear to reach a peak in 2006 and a trough in 2011. In 2010, both series seem to be levelling off.

When the designer places two series on the same chart, he or she is implicitly saying: there is an interesting relationship between these two data sets.

But this is not always the case. Two data sets may have little to do with each other. This is especially true if each data set shows high variability over time as in here.


Below is another view of the same data. In order to visualize any year-to-year effect or quarterly effect, I split the data along those dimensions. The year-to-year effect is quite strong although there isn't any interesting pattern. The quarterly effect is not so strong, and as the directions of the paths indicate, this effect is not consistent from year to year.


The scales on each axis are "standardized" meaning 0 is the average value, 1 is one standard deviation above the average, etc. Movements of 1 to 2 standard deviations are not unusual so one can see that almost all values on the chart are within 2 SD.

There just doesn't seem to be a compelling story here. This chart taxes our imagination.

PS. In case you're wondering, this chart is made using Graph Builder in JMP. (except for the arrows) I also wish JMP would allow me to use 1,2,3,4 (column data) as my plot objects instead of the standard dots and crosses, etc.

[4/11/2012: Thanks to Ken L. for submitting this chart. Also, Rob Simmon on Twitter points out that the house price data should be inflation-adjusted.]

High-effort graphics

Jon Quinton made a chart for Cancer Research UK, which is quite an eyeful.


The full infographic is here.

Below is a close-up of the key of this chart:


Jc_returnoneffortWhere would this chart fall in my "return on effort matrix"? It is an extremely high-effort chart; I got tired trying to figure out what all those dimensions mean.

Is it a high-reward or a low-reward chart? It depends on why you're reading the chart. For most readers, I suspect it's low-reward.


In my view, the best charts are high-reward, low-effort. I'd emphasize that by effort, I mean effort by the reader. In general, the effort by the chart designer is inversely proportional to that by the reader.

In some special cases, high-effort charts may have high reward justifying the destruction of some brain cells.

Low-effort, low-reward charts are harmless.

More on the return-on-effort matrix here.


One simple improvement to a chart like this one is to produce separate charts for men and women. Outside academia, it seems to me almost all use cases for this chart would involve only one gender.


Does this need fixing?

Reader John B. tried to improve Minard's famous chart of the Napoleon Russian campaign (link). Ed Tufte declared this chart the best ever.


Here is John's alternative. The biggest change he made was to discard the geography and emphasize the chronology.


 John also points us to this page which includes a number of different re-makes of the famous chart. (link)


Drugged-up American graphic

Reader Chris P. found this chart on Visualizing.org, which is one of those sites that invite anyone to contribute graphics to it:


It looks like the designer has taken Tufte's advice of maximizing data-to-ink ratio too literally. There are many, many things going on in a tight space, which leaves the reader feeling drugged-up and cloudy.

From a cosmetic standpoint, fixing the following would help a lot:

  • Make fonts 1-2 points larger in all cases, especially the text on the left hand side
  • Use colors judiciously to stress the key data. In this version, the trends, which are more interesting, are shown in pale gray while the raw data, which are not very exciting, are shown in loud red. Just flip the gray with the red. 
  • Rethink the American flag motive: is drug abuse a uniquely American phenomenon? Should data about the American people always be accompanied by the American flag?
  • Separately present in two charts the time-series data on total arrests, and the cross-sectional data (2008)

Stars_and_drugs Also, realize that by forcing the data into the 50-star configuration, one arbitrarily decides that the data should be rounded to 2-percent buckets. (see right). 

And always ask the fundamental question: what makes this data tick?


As I explored the data, I noticed various arithmetic problems. For example, the arrests by race analysis is itself split into two parts: White/black/Indian/Asian add up to 100 percent and then Hispanic Latino and Latino non Hispanic add up to 100 percent. In some surveys, Hispanics are counted within whites but that doesn't seem to be the case here. The numbers just don't add up.

Also, adding the types of drugs involved does not yield the total number of arrests. Perhaps the category of "others" has been omitted without comment. Now I closed my eyes and proceeded to make a chart out of this.


The new version focuses on one insight: that certain races seem to get arrested for certain drugs. The relative incidence for arrests are not similar among the races for any given drug. Asians and Native Americans appear to have higher proportions of people arrested for marijuana or meth while blacks are much more likely to be arrested for crack. 


You're going to need to click on the chart for the large version to see the text.

Doing this chart gives me another chance to plug the Profile chart. We deliberately connect with lines the categorical data. The lines are meant to mean anything; they are meant to guide our eyes towards the important features of the chart.

One can sometimes superimpose all the lines onto the same plot but the canvass clogs up quickly with more lines, and then a small-multiples presentation like this one is preferred.

We have a temptation to generalize arrest data to talk about drug habits by race but if you intend to do so, bear in mind that arrests need not correlate strongly with usage.

The best way to handle two dimensions may be to not use two dimensions

Guess what the designer at Nielsen wanted to tell you with this chart:

Reader Steven S. couldn't figure it out, and chances are neither can you.

What about...

  • The smartphone (OS) market is dominated by three top players (Android, Apple and Blackberry) each having roughly 30% share, while others split the remaining 10%.
  • The age-group mix for each competitor is similar (or are they?)

Maybe those are the messages; if so, there is no need to present a bivariate plot (the so-called "mosaic" plot, or in consulting circles, the Marimekko). Having two charts carrying one message each would accomplish the job cleanly.


Trying to do too much in one chart is a disease; witness the side effects.  Smartphone_sm1

The two columns, counting from the right, contain rectangles that appear to be of different sizes, and yet the data labels claim each piece represents 1%, and in some cases "< 1%".  The simultaneous manipulation of both the height and the width plays mind tricks.

Also, while one would ordinarily applaud the dropping of decimals from a chart like this, doing so actually creates the ugly problem that the five pieces of 1% (on the left column shown here) have the same width but clearly varying heights!

Smartphone_sm2 What about this section of the plot shown on the left? Does the smaller green box look like it's less than 1/3 the size of the longer green box? This chart is clearly not self-sufficient, and as such one might prefer a simple data table.

The downfall of the mosaic plot is that it gives the illusion of having two dimensions but only an illusion: in fact, the chart is dominated by one dimension, as all proportions are relative to the grand total.

For instance, the chart says that 6% of all smartphone users are between the ages of 18 and 24 AND uses an Android phone. It also tells us that 2% of all smartphone users are between 35 and 44 AND uses a Palm phone. Those are not two numbers anyone would desire to compare. There are hardly any practical questions that require comparing them.

Sometimes, the best way to handle two dimensions is not to use two dimensions.


 The original article notes that "Of the three most popular smartphone operating systems, Android seems to attract more young consumers." In the chart shown below,  Redo_phoneos we assume that the business question is the relative popularity of phone operating systems across age groups. 

The right metric for comparison is the market share of each OS within an age group.

For example, tracing the black line labeled "Android", this chart tells us that Android has 37% of the 18-24 market while it has about 20% of the 65 and up market. 

Android has an overall market share of about 30%, and that average obscures a youth bias that is linear with age.

On the other hand, the iPhone (green line) has also an average market share of about 30% but its profile is pretty flat in all age groups except 65 and up where it has considerable strength.

Further, the gap between Android and iPhone at the older age group actually opens up at 55 years and up. In the 55-64 age group, the iPhone holds a market share that is similar to its overall average while the Android performs quite a bit worse than its average. We note that Palm OS has some strength in the older age groups as well while the Blackberry also significantly underperforms in 65 and over.

Why aren't all these insights visible in the mosaic chart? It all because the chosen denominator of the entire market (as opposed to each age group) makes a lot of segments very small, and then the differences between small segments become invisible when placed beside much larger segments.

Now, the reconstituted chart gives no information about the relative sizes of the age groups. The market size for the older groups is quite a bit smaller than the younger groups. This information should be provided in a separate chart, or as a little histogram tucked under the age-group axis.



Peek into beauty

This graphic feature is the best from the NYT team yet. I particularly love the two columns on the right which allows us to see regional differences.  For example, this "New in Town" movie was much popular in Minneapolis than any of the other metropolitan areas, and was particularly unwatched in New York.  Also, note the choice of sorting allowed on the top right.

Click here and enjoy!


Reference: "A Peek into Netflix Queues", New York Times, Jan 10 2009.


Nyt_unemploy_sm In an article called "Off the Charts", Floyd Norris wanted to let readers know that unemployment does not hit citizens equally -- it affects some age groups and men/women to differing degrees. 

As befits the article's title, he included several charts, from which I extracted the one shown on the right.  At first glance, this seems like a normal chart.

But when one pays attention, one notices that the chart is rather complicated.  This chart is like a piece of modern music, in which the composer allows two voices to jar and talk past one another.

Think of it as a data table vying for attention with a bar chart.  The data table is a cross-tabulation of the change in employment by age and gender.  In this view, the men sit on the left, and the women on the right.

Lurking around is a bar chart, for which the point of zero change sits in the middle.  Positive growth extends to the right, while negative growth points to the left.  The gender labels at the top are irrelevant  for this bar chart: the narrow black bars indicate women, the fat colored ones, men.  The data labels are also irrelevant: see, for example, the 45-54 age group, the label for females, at -2.3, should really be placed on the left side of the middle divide!

Here is how these two charts look, disentangled: (I have converted the bar chart to a dot chart.)




Reference: "Off the Charts: Job losses mount, enduring and deep", New York Times, Nov 14 2009.

Pie Cubed

Cascading_Pie_charts Omegatron also did away with a set of cascading pie charts on Wikipedia, a particularly ineffective use of this chart type.  Whenever there are more than two or three categories, the necessary use of many colors can really make one's head spin.

Here, the cascade is being used like a log scale, to artificially elevate the small pieces, which unfortunately are also the least significant pieces of the energy pie.  There is no reason for nuclear, bio-mass, hydro and "others" to add to 100% except that the author decides to group them together.  The 41% nuclear or the 41% solar heating in the second and third charts, respectively, have no meaning in the larger context.

In deference to the original author, Omegatron's new version preserves the arbitrary three-level cascade.  He converts to stacked bar charts, which brings out the differences better.

He also sensibly exposes the original proportions rather than the arbitrary relative proportions.  For example, nuclear energy accounts for 6% of the total, not 41% of the arbitrary "others" bucket which in turn contains 14% of the total.

I'd prefer an even cleaner presentation with unstacked bar charts.  This can be done in either one chart with all eleven categories, or in two charts, as shown below.  The two-chart version assumes that the reader have two key questions: alternative energy sources as a proportion of the total, and the mix of different sources within the alternative category.


With the ordinary bar chart, many fewer colors are needed, and there is no need to print out each data point, nor a need to use guides to point to labels and data.  The trouble of the latter is its tendency to draw attention to the least important aspects of the data.

With this further example, I continue to find the Wikipedia rule to discourage text annotations on graphs bewildering.   Such a rule apparently does not apply to data labels, as can be seen here.  Of course, a graph without any labeling of categories is robbed of meaning but if labels can be saved, so should annotations!

Reading comprehension

Note: I am in the middle of a holiday and so posting will be limited.

Andrew posted a pretty chart that caught my attention.  This is the sort of sophisticated chart that rewards careful reading. 


Below is a guide to reading the chart:

  • It is a small multiples chart with the components arranged in two dimensions (income levels, and a race-religion hybrid category).  The top row is a summary of voters of all race-religion grouped by income.  Note that there is no corresponding summary column for voters of all incomes grouped by race-religion.
  • Source of data: 2000 poll but applied to 2008 demographic patterns.  In other words, there is an underlying assumption that opinions have stayed stable within the demographic groups.
  • The chart is in fact three dimensional because each map gives us the geographical (state by state) breakdown.
  • It is useful to figure out the smallest unit of data: in this case, this is the percentage support of federal school vouchers by voters of a given race-religion-income-and-state category.
  • The color scheme is such that red represents highest support and blue lowest support, with pink and purple in the middle
  • It's almost always better to start from the aggregate (that is, the average) and then study variations along different dimensions, and this is how the chart is arranged from top to bottom
  • On the top row, the higher income groups tended to favor vouchers more than lower income groups, with a break point around $75k; even here, the regional differences are significant, with northeast and southwest hotter for vouchers at all income levels
  • As we move from row to row, we realize that the aggregate data hides many disparities.  For example, white Catholics (second row) are more likely to support vouchers regardless of income level while white non-evangelical Protestants (fourth row) are much less likely than average to support vouchers at all income levels.
  • Notice that the statistician (Andrew) has carefully defined the race-religion categories to balance between collapsing subgroups that are distinct and showing too many subgroups so as to cloud the patterns.  That is why there are many more race-religion subgroups that are not shown.  The ones shown are of special interest.  Consider the white protestants, evangelical vs. non-evangelical (third and fourth rows).  If one were to fix the race, geography and income dimensions, and even fix half of the religion dimension, we still find the two subgroups to be on different ends of the spectrum relating to the voucher issue.  This is why the evangelical or not dimension has been included.
  • The white space is interesting.  Here, the issue faced by the statistician is sparse data when one gets down to multi-dimensional subgroups.  Andrew chose to ignore all the data, which is the wise thing to do.  With so few samples, it is particularly easy to draw bad conclusions.   
  • Because of the white space, we get additional information on the spatial distribution of the demographic subgroups.  The black population (at least the voters) are predominantly found in the southeast while Hispanics are in the southwest.  The subgroup of income higher than $150k is essentially all white.  Admittedly, this is a very crude read because we only have two levels (below 2% of state population and above).  Of the colored states, we cannot differentiate between densely populated and not.


Such rich graphics deserve careful reading.  Enjoy!