« March 2008 | Main | May 2008 »

Flows and partitions

Andrew M., a new but loyal reader, didn't like the flow charts used by the EPA to illustrate cleantech.  We had some lively discussion on flow charts before.  The bottom line seems to be that they are difficult beasts to tame, especially when the relationships are complex.  The example shown by Andrew (below) is not particularly horrid in this scheme of things.  It's the abundance of annotations and colors that cause dizziness.


Here's a view of the same data, using a partitioning approach.  The inputs are fixed at 100 units, which I find easier to comprehend, while the original fixed output at 30 units of electricity and 45 units of heat.  And of course, it is a tremendous service to readers not to have to work out the efficiencies.  Tacitness is a vice, not a virtue, in graph-making.


Reference: "Catalog of CHP Technologies", US EPA Combined Heat and Power Partnership.

Running in the rain

Reader Eduardo is unhappy about the embellishments in this Nikeplus chart of miles ran by day; "pretty but misleading" he wrote us to say.  This is a clear case of more is less.


As a data graphic, it doesn't work.  The reflections don't work.  Perhaps Nike wants to remind all you super-dedicated Nano-wearing runners what it's like to run in mist or rain!  To quote Eduardo: "The bars start at -1! I guess it is motivation."  An extra mile for everyone.  The rounded corners make it harder to read the level.

Startat8Speaking of bar charts, I want to follow up on an exchange from March.  In that example, we claimed that not starting bars at zero misrepresented the relative lengths of those bars.  The chart showed counts of baseball players implicated in the Mitchell Report by position.

This distortion arises from taking the same length off each bar regardless of the data.  As a result, the ratios of the lengths between the bars have been changed drastically.

For example, the ratio of P/3B in the top chart is 31/9 = 3.4 but in the bottom chart, it is 23/1 = 23!


Nyt_tuitionfree2 In celebrating the recent trend by "elite" colleges to lowering the cost of education, the Times printed this chart, the top part of which is shown here.

The three colors represent different levels of aid.  Blue means "grants replace loans"; red means "free tuition"; yellow means "parents pay nothing".  The colleges are grouped by the minimum qualifying income for the blue category.

The whole effect is of a knit.  We shall call this the "knit chart".

I believe a simple data table will do the job nicely.  If any reader has other ideas, please show us your work!

A few points to note about the original:

  • Ordering by the minimum income to qualify for "grants replace loans" is arbitrary, as is alphabetizing colleges within each group
  • Qualifying "at any income level" should be shown on the left of "$40,000 or below" rather than to the right of $100,000.  The current order is such that qualifying level increases with income from left to right, except from $100,000 to "any income", where it falls off a cliff.
  • Qualifying at any income level is better shown as a separate column on the right disconnected from the income scale.  The current configuration devalues the effort spent in making a proper income scale.
  • Too many lines of equal length, and too few yellow and red lines to make the knit chart effective
  • Should the graph cater to parents interested in seeing what aid they qualify for given their income level?  Or should the graph highlight the breadth of aid available at individual colleges?

Reference: "The (Yes) Low Cost of Higher Ed", New York Times, April 20 2008.

PS. The original point about the "any income level" was incorrect as pointed out by Chris below.  I have replaced that with a different issue.

PPS. Matias' version (see comments) is a superb demonstration of the power of data tables, well-applied.   It is clean and simple, and addresses both the questions pointed out in the last bullet point.  The only thing sacrificed was the visual representation of the relative size of the income requirements, which I agree is the least valuable part of the original.  As usual, many thanks to our readers for coming up with great ideas!


Statistical science fiction

Warning: this post is statistics-heavy.

Science fiction is faction (i.e. fact + fiction) before faction exists.  It's taking pieces from science textbooks and mixing in figments of the imagination.  That is what I have in mind when reading a recent article in Target Marketing magazine.

They started with the business problem: if a customer goes directly to the retailer's website to make an order, the retailer could not know if said customer read its catalog or not.  A lot of money is spent creating and mailing glossy catalogs to households.  Marketers believe that catalogs drive such "unmatched" Web orders but how does one prove such an assertion?

Then they offered a solution:

To see the effects of your catalog mailings on online ordering, run a correlation analysis using Microsoft Excel's Data Analysis Toolpak.

Okay, what variables are to be correlated?

You'll need two data sets: order counts by day for the catalog and unaccounted-for Web orders by day for the same period.

Now what?

What results is a modest table with a handful of numbers, the most important of which is the correlation coefficient, a number between zero and one that indicates the degree to which two variables are linearly related.

Just what the textbook ordered, plus bonus points for noting linear correlation.  The figments of the imagination started creeping in:

To get the real answer to the question: "How much does my catalog drive Web orders?" you must square the correlation coefficient to produce the coefficient of determination -- a measure of the proportion of each other's variability that two variables share.

If, for example, a correlation coefficient of 0.9 say there's a high level of linear relation, squaring the coefficient says that 81 percent of the variability is shared between phone and Web orders.  So, in this example, 81 percent of Web orders are directly related to phone orders.  And if phone orders are driven by the catalog, so must 81 percent of Web orders.

These two paragraphs are complete nonsense.  Allow us to briefly recap key ideas on simple linear regression while we separate fact from fiction.

Fact 1: squaring the correlation coefficient produces the coefficient of determination (more commonly called r-squared).

Fiction 1: squaring this particular correlation coefficient produces nothing of this sort.

Takeaway 1: R-squared measures how well the linear model fits the observed data.  A better-fitting model should produce predictions that are more correlated with the observed values.  In this case, we want the predicted catalog orders to be close to the actual catalog orders.  This correlation is what should be squared, not the correlation between catalog orders and unmatched Web orders.

Fact 2: R-squared measures how much of the variability in catalog orders is explained by unmatched Web orders.

Fiction 2: R-squared measures the proportion of "each other's variability that two variables share".

Takeaway 2: In regression analysis, we distinguish between the response variable (catalog orders) and the predictor (unmatched Web orders).  The predictor is used to explain the variability in the response.  There is no such thing as "shared variability" between two variables.  In correlation analysis, the two variables are put on equal footing.   In other words, one cannot start with a correlation analysis and end with a regression output -- only in science fiction.

Fiction 3:  R-squared allows us to split the sample into the proportion with a direct relationship and the proportion that doesn't.  In this example, it allows us to conclude that 81% of (unmatched) Web orders are related to phone orders while the remaining 19% do not.

Takeaway 3: As noted under Fact 2, R-squared splits the variance in phone orders into two parts.  It does not split the orders themselves.  R-squared measures the model not the data.

Fact 4: It is important to specify the underlying logical relationships between variables under study, and every effort must be made to ensure its validity.

Fiction 4: At the end, we learnt the following logic: a) phone orders are highly correlated with catalog orders (since "your phones ring because you mail catalogs") so phone orders are the same as catalog orders.  b) unmatched Web orders are highly correlated with phone orders so unmatched Web orders are the same as phone orders.  c) Catalogs drive phone orders and so catalogs drive unmatched Web orders.

This mind-bending logic we address in order:

Takeaway 4a: They use "phone orders" as a proxy for "catalog orders" since "phones ring because you mail catalog".   If that is so, then there won't be any Web orders and what's the point of looking for catalogs driving Web orders?  Even worse, an order that came on-line is an order that did not come through the call center.  So what exactly is Excel correlating?

Takeaway 4b: Completely unrelated things can have high correlation; a famous example is burglaries and full moons. High correlation certainly does not imply equivalence.

Takeaway 4c: Correlations are not usually transitive: I am like Alan because we are both impatient; I am like Alice because we are both talkative; now, Alan is like Alice?

In short, this is a great example of "knowing just enough to be dangerous".

Reference: "Making a match", Target Marketing Magazine, March 2008.


Cram it like Koby

You have to gradually build up your gut by eating larger and larger amounts of food, and then be sure to work it all off so body fat doesn't put a squeeze on the expansion of your stomach in competition  -- Takeru Kobayashi, six-time champion of the Coney Island hot dog eating contest

Kobayashi is a phenom.  He can stuff 60 hot dogs or 100 burgers in ten or twelve minutes and show no consequences.  Ordinary people can't hope to emulate these feats.

Junk Charts sees Kobayashi as a hero; an anti-hero really.  We are ordinary people; we can't hope to cram it like Koby.  A message we keep repeating here is: too much data sinks a chart.

Econ_anglosaxon Not long after this chart showed up in the Economist, several readers urged us to take a look.  It's a well-nourished chart indeed, one to challenge Kobayashi, but for all that it contains, the reader has to try very hard to find insights.  What with the multiple colors, iron-fisted gridlines, above-and-below boxes, dotted and solid lines, and a legend with nine pieces split in two spots?  Besides, the U.S. boxes grab all the attention by virtue of them being wider (country being more partisan).

The key to unraveling this chart is to identify the relevant comparisons:

  • UK average vs US average
  • UK left vs US left
  • UK right vs US right
  • UK independent vs US independent

And then for the gluttonous:

  • UK right vs US left
  • UK left vs independent vs right
  • US left vs independent vs right

In the junkchart version, we address these comparisons sequentially.

(Apologies for the tiny font.)

We are again using a small multiples approach that places four comparisons next to each other: average, left, independent, right. Consistently, the British is to the left of Americans.  The only places where the two cultures meet are where liberals agree on "ideology" and "military action".

Also note that we use a symmetric horizontal scale centered at 0.  There are too many charts out there where the center is not at the center!

A similar presentation addresses the other three comparisons.  Democrats in the U.S. are miles to the right of Tories in terms of "religion".  In the UK, Labor and Tories are not much different except on "ideology".  In the US, Independents lean closer to Democrats.


Joining the lines (I hear the grumbles) helps bring out the gap between the groups being compared.  Without lines, the chart would look like this.


It is often hard to keep track of which dot is which as they trade order from issue to issue.

PS. Anyone knows what is being measured on the horizontal axis?  The original graph mysteriously stated "respondents' views".


Eric Talmadge: "Pigout champion Kobayashi limbers up for hot dog gold" June 25, 2004

"Anglo-Saxon Attitudes", Economist, Mar 27 2008.

Progress and retrogress

Joran E. pointed to this "icky" chart he found on Clive Crooks' blog at the Atlantic. 

He ordered a "junkchart treatment", so here it comes.

First we wanted to process the triangles, dots and squares to make sense of this data.  We noted that the data came from a single year (2005) so the chart did not trace the development of the education sector over time.  But wait, it used a different route to get at the same idea.  The author compared different generations within each country to see if more and more citizens took university degrees.  So each vertical "arrow" was kind of a historical record of different generations within a country.  Under this criterion, Korea and Japan had come a long way while the US and China stagnated.

The chart is quite impossible to read as designed.  There is little reason to sort by 25-34-year-old proportion when the message concerns improvement over generations.  Besides, what about countries that apparently retrogressed?  (like Russia and Germany)

Redo_tertiary2For this data, I returned to my favored bumps chart.  Here is version one.  There are two ways to read this chart: across countries, we note that most of the European states (blue) had similar profiles showing roughly a constant rate of growth.  The Asian duo of Japan and Korea (brown) had the most marked growth.  Of North America (black), Canada diverged from the US since the 35-44 generation.

Alternatively, we can focus on the change generation-over-generation.  From 55-64 to 45-54, almost all countries in this sample (except Japan) grew at the same rate.  Then between 45-54 and 35-44, the two Asian countries clearly set the pace.  The generation between 35-44 and 25-34 is most interesting: Korea has not slowed, Japan has slowed a little but still grew as fast as Canada.  A trio of European countries (Spain, Ireland, France) outpaced their neighbors.

Below I show version two.  This one combines bumps chart with small multiples.  North America, Europe and Asia/Australia are now in separate charts.  This removes clutter.



Hanging tough


Reader Nick B. sent in this example calling it "interesting".  The chart tells a compelling story once we figure out what it is.  Grasping the tree structure is key.

It illustrates the important idea that averaging sometimes masks  variations in the data.  For example, while the province of Guerrero scored 78% on literacy, the municipalities within Guerrero had scores ranging from 28% to 90%.

It also shows that the gender gap was larger in lesser Metlatonoc municipality than in more literate Cuautitian.

In addition, it tells us that while Mexico on average measured very well on literacy, subpopulations within Mexico spanned the world's best and worst (from about Mali's level to Italy's).

While I find this chart adequate, the pieces hanging off each other did not seem ideal, especially the two overlapping municipality pieces which were placed next to each other.  However, it is tough to come up with an alternative.  Here's one attempt; the changes are mild.

Redo_literacy_2 I prefer the horizontal orientation.

The branches are emphasized (as opposed to the "T" junction) because that's a key part of the story.

The national level, especially the span between Mali and Italy, is de-emphasized; I treat it as gridlines.

Instead of placing the overlapping pieces next to each other, I let the ranges literally overlap, which serves to stress this feature.



An embarrassment

I find it embarrassing for the Economist to print an article like this one.  (Do they have a statistics editor?)


The subtitle asserting "causality" is offensive.  It is alleged that smoking bans in bars have "caused" more road accidents because people are forced to drive longer distances to find those bars that still allow smoking.

To assert causality so starkly for an undesigned observational study is unprofessional.  I doubt that the authors of the study they cited even went so far.  At best, they probably found a correlation.

Another problem is the practical significance of the finding.  There is a 13% increase in fatal accident rate in a "typical county containing 680,000 people".  There are two problems with this statement:

  • When I check the Census data, there are only about 85 counties in the entire U.S. with at least 680,000 people.  What do they mean by "typical"?
  • 13% is said to be an increment of 2.5 fatal accidents, presumably per year.  The crane accident in Manhattan a few weeks ago killed at least five people.  I just don't believe that one can prove definitively that such a tiny difference is not due to chance so even the correlation, let alone the causality, is suspect.

It appears that the paper is locked up in pre-publication.  If you have seen it, let us know if the authors actually asserted causality.

Reference: "Unlucky Strikes", The Economist, April 3 2008.


Gelman pointed to this Brendan Nyhan post dissecting David Sirota's chart purportedly showing a "race chasm" in the Democratic primaries.  The left chart is David's original and the right is a Nyhan revision.

Please see Nyhan for the political interpretation.  Here, I want to note a number of improvements Brendan made to the chart:

  • Sirota plotted the ranks of the percent of black population, which is misleading.  Nyhan plotted the actual percentages on his horizontal axis
  • Sirota connected the dots which highlighted the noise (ups and downs) in the data.  Nyhan fitted a linear model (he also tried other non-linear versions).
  • Sirota plotted Obama's overall margin of win/loss.  Nyhan plotted his margin among white voters only, which more directly addressed the issue.
  • Nyhan exposed the excluded states in a footnote.  Sirota didn't.  For this chart, this piece of information is very important since so many states were excluded.

Nyhan walked us through multiple charts he used to explore the data.  Much of the time was spent picking and choosing states to include or exclude.  We learnt that Sirota excluded states with large Hispanic populations, which Nyhan disagreed with while Nyhan wanted to exclude Florida, which Sirota decided against, even though Sirota excluded Michigan, which Nyhan consented but Nyhan also wanted to exclude the causus states, and so on...

Judging from the charts, this picking and choosing appears not to have changed the outcome in this case.  In general, one should exercise great care in such decisions because one might end up seeing what one wants to see.

The following chart is missing from the post, which I think points out something more telling than the negative correlation between Obama's margin with white voters and the proportion of black population.