« July 2005 | Main | September 2005 »

Transformations and regressions

Mahalanobis has a fascinating post on the role of transformations (square roots, logarithms, inverses) in presenting data.  He cited Howard Wainer's book.  Howard taught me intro stats when he was visiting Princeton, and he was the one who first exposed me to data-ink ratios for which I am forever thankful.

Cars01I reproduce one of the graphs here.  It shows the relationship between the price of a convertible and the rank order of its price against 46 other convertibles.  So the Toyota Paseo is the 2nd cheapest car among the 47 and it was priced at about US$17,000.

1. How do transformations work?

Transformations are often used by statisticians to "linearize" relationships.  This log price chart from Howard's book is a piece-wise linear curve; it is not linear because of unexpectedly large price gaps between one convertible and the next higher-ranked one. For example, there is no other convertible in the market with a price sticker higher than Chervolet Corvette and lower than Dodge Viper.

On the chart, if we really want to have one line instead of four line segments, we need to compress those large price gaps.  That is exactly what logarithms and inverses do: they penalize large jumps.   Thus, when Howard applied inverses to the same data, he got a straight line because inverses impose more severe penalties than logarithms.

2. Linear regression models

Linear regression is an extremely powerful invention but it would not have been so potent without the aid of transformations.  By using logarithms and inverses, the class of models has been expanded to non-linear ones.  For example, if the relationship between Y and Log(X) is linear, then that between Y and X is clearly non-linear.  In a sense, calling them "linear regression models" under-estimates their power.

3. Abuse of regressions

Unfortunately, this price-rank relationship represents a poor use of regression.  Conventionally, the predictor variable is plotted in the x-axis.  If that is so, then the regression solves this type of problem: if I want my convertible to have the 5th highest price in the market, how much should I price the car?  Not very useful, eh? [The answer is: 1 / (slope x 5 + intercept).]

The temptation to draw a straight line through any set of points has led many astray.  Be vigilant.

4. An easy improvement

If we release ourselves from the regression paradigm, then this graph can be made more readable by flipping the axes.  In so doing, the names of the convertibles can be read horizontally, or more naturally.  The data remain as they are, even the line can stay where it is, if desired.

Reference: "The Power of Transformations", Mahalanobis, August 17, 2005.




In Praise of the Bumps Chart III

Many authors have exposed and harangued statistical liars (e.g. "How to Lie with Statistics").  Likewise, I rant here once in a while.  However, not every distortion of reality is unwarranted.  Sometimes, distorted data actually bring out key insights.  I go back to the Bumps chart to illustrate this point.

Bumps_exIn a previous post, I remarked that the vertical axis can represent either ranking or boat locations along the river.  Reading the chart from left to right as if from start to finish of the race, we suggest the right-side list displays the ending ranks or ending locations of boats.

On second thought, the right-side list cannot give us the ending locations!  Physically, the boats would have moved downstream so the entire list needs to be shifted downwards to be precise.  But we feel comfortable with the current arrangement: this is a distortion of reality which does not affect our reading ability.  Indeed, it enhances our ability to see into the data because now a horizontal line means no change in ranks.

If one is very particular, then one should interpret the right side as next year's starting locations rather than the current year's ending locations.  Then all is well.

In many situations, reducing continuous data to ranks introduces significant distortion and is thus not advisable.  For the Bumps chart, because the Bumps rules require that all boats start next year the same distance apart, in essence wiping out the year-end separations, the form perfectly fits the function!  This distortion removed information not needed to grasp the key point of the chart, so no harm done!

As a side note -- Tim Granger has produced a side-by-side Bumps chart, even more marvellous than the single-period chart.  Redo_bumps_all_2In my junkart version, I removed the horizontal line segments linking one year to the next.  These line segments contain no data; besides, based on the discussion above, each right vertical axis should be interpreted as next year's starting locations rather than this year's ending locations, so these line segments are unnecessary.

PS. In case you're wondering, Tim colored some lines red to indicate boats that managed to bump up each of the four days in a specific year.  These teams win an award called the "blades".  If the purpose of the chart is to identify the rise and fall of boat club dynasties then we would have colored the trajectory of Pembroke (6) and Queens (19), for example.


Industry sector innovation indices

Here is a chart from MIT's Technology Review and a junkart version:

Redo_rd0_1

These are both great charts.  As always, it's important to marry form with function.  If one wants to read off the sector ranks, the dot chart works better; if one wants to focus on the change in ranks, the line chart works better.  If one wants to track sectors as they change over a longer period of time, the line chart works its magic: we can just stack a bunch of them next to each other.

The headline identifies 6 "improving" sectors.  This is difficult to see in the dot chart because the reader needs to associate orange with 2003 and yellow with 2004, regardless of which color appears on the left.  In the line chart, the improving sectors are the ones with lines going up; I colored them blue for clarity.

Moving from the graphical to the statistical, I have major problems with the creation of this "innovation index".

The Innovation Index is calculated by combining, with equal weights, 2004 R&D spending rank, percent change in R&D spending, absolute change in R&D spending, and R&D spending as a percentage of sales.

It is unclear why those variables (including ranks, percentages and dollars) were combined with equal weights.  And in fact, absolute change in R&D spending probably dominated the ranking since it has the largest and widest scale.

And then, the sector ranking is the average of the ranks (1-150) of the companies in the specified sector.  This is a useless average because it implicitly assumes that the difference between company #2 - company #1 is the same as the difference between company #150 - company #149.  It is an ordinal ranking imposed without justification.  Besides, some sectors consist of only 3 companies while others contain 28.  If I have time, I will illustrate these points with charts.


Reference: "R&D 2005", Technology Review, August 2005.


 


Productivity growth in the U.S.

We are led to believe that this chart clinches the case for productivity gains in the US economy as a great, wonderful and continuing phenomenon.  Alas, the chart does not convince because the reference period is arbitrary.  Why use 1973-1995 trend as reference?  Especially since this was a period of abnormally low productivity growth as the author indicated in the text.  All one can reasonably conclude from this chart is that growth was higher both after and before this  period of low growth.

There may be other evidence to support this assertion; being no economist myself, I can't comment on those.

Link: Marginal Revolution: The U.S. economy in a sentence.


The simplest chart: one data series

While on holiday, I picked up this interesting chart on the Argentine paper Clarin, showing the results of a (daily) on-line poll.  This presentation of percentages as a dot matrix of 100 points is the same concept that I described before as a "decile plot".  Not sure why but at clarin.com, a different graphic (the bar chart) was used to present the same data.  The bar chart is perhaps most visually appealing but the dot matrix allows readers to read off actual percentages by counting off the dots.

Clarincharts

For comparison, look at the pie and donut charts that many publications would no doubt choke us with if they get their hands on this data.

Redoclarin1_2

I have already voiced my distaste for pies and donuts, here and here.  OK, so they tell us the percentages add up to 100 and the bar chart doesn't.  But how important is that factoid?  So I still say: never use pies or donuts.

The dot matrix/decile plot has some potential but I'm not sure if it is better than the bar chart.  Here is one possible rendition:

Redoclarin2

  • I would rather not tip the square. (I also dislike the color scheme but have not altered it.)
  • Since I don't care to tell readers they add to 100%, I stacked the groups one over the other so that it is much easier to eyeball the exact percentages.  Because of this, I can omit printing the percentages.  Of course, now the onus is on the graphic designer to make sure there are 100 dots, no more no less
  • To reconcile form and function, I left off the decimal point.  When you plot 100 dots, you have made the decision that each 1% is important so why would you then print 47.6% rather than 48%?

The irony is that for one data series, just printing the table is as good as anything.

Redoclarin3


NYT´s blind(ing) spot(s)

The New York Times is probably the most committed of all media outlets worldwide to using data graphics, and I love them for that.  Most of their charts are of high quality; much chartjunk is avoided.

The NYT does have a blind spot.

Or might we say blinding spots?  Its obsession with bubble charts, bubbles fitted in grids, bubbles overlapping, bubbles bursting out of grids is maddening.  Examples abound:

Nytbubbles

The bubble chart, a particular fancy of professional consultants, is just behind the pie chart as a useless form of data graphics.  Note I said "data graphics", as bubbles have value, albeit limited, in conceptual diagrams. If one only cares about bigger vs smaller, then bubbles are okay. 

If one is concerned about how much bigger, then bubbles are misleading.  Further, bubbles contain no scale, implicit or explicit; one cannot decide if any given bubble is large or small (relative to what?).  Witness the Costco-Walmart chart above: is bubble "17" big or small, the chart gives no reference level, neither a range nor an average.

The human brain works linearly, and we tend to grossly misjudge differences in sizes of bubbles.  Blinding spots indeed!

MandmIf you don't believe me, try this test.  In the same section where I found the CEO options chart above, there was an article about the new "Mega" M&M.  See right.

What percent larger is the Mega?   (Click on the question under the image to reveal the answer.) 

 

 

 
 
 
I wish the NYT editors would take note and put an end to these blinding spots!

Thanks Pius for helping with the mouseover image while I´m travelling.


Beauty versus utility

This graphic, from NYT, was produced for admiration.  Lets stare at it for a minute.

31marsh_map_lg

Undoubtedly an uncommonly beautiful map, I rate it below the previously praised Wi-fi Nation map.  What renders it so visually alluring is its high resolution, what Tufte calls a high data-ink ratio.  Indeed, there is almost no wasted ink, every pixel carrying some data.  If we reduce the resolution, the map will for sure look a lot less impressive.  So this is one of the few instances where I'd allow such high data density.

Why is this short of perfect?

By itself, the map does not convey much insight.  If we ignore the text and the legend, pretty much the only message we can read from this map is a tale of two regions: the eastern half is heavily human-influenced, and most of the less-influenced land is in the west. We will be misled into thinking that the red and the green each takes up roughly half the area.   We can see highways and cities but that is not telling us much. If carefully examined, we can make out the Central Idaho Wilderness.

In other words, all the important insights are conveyed via add-ons such as the call-outs to the 4 most pristine areas of the country.  Also the information-laden legend, magnified here:

Marshlegend_2Through this legend, we learn the amount of wilderness.  Observe that red only constitutes less than 20% of the country yet we thought it covered half of the map.

Make no mistake: this is an extraordinary legend that is a graphic in its own right, a bar chart showing a distribution.

What separates a good graphic from a great graphic is the union of form and function. Here our brain can pick up region-level cues (eastern half vs western half) but the map provides us highly detailed data (municipality-level?), which contribute to the beauty but not our understanding of the data.

Reference: "Where the Human Footprint is Lightest", New York Times, July 31 2005.


A bad graph obliterates your message

A bad graph can be lethal, burying a good message six feet under.  The following graphic appeared with a McKinsey article on outsourcing.  Its message is purportedly that multi-nationals can't find sufficient qualified local talent in low-wage countries -- a message of hope for U.S. job-seekers.

The graphic, however, only manages to express gloom, despite the promising title "Fewer than you'd think".

Mckoutsource1_1The angry red lines immediately dominate our attention.  They run off the page, giving the impression of impressive magnitude.

Meanwhile, the authors focus on the "weighted averages", which are politely labelled in black.


There is a lot of verbiage on this chart, much of it perplexing.  "Weighted average" based on what weights?  "Average for university-educated young professionals" stick out like a sore thumb from the other categories of occupation.

Interesting phenomena are left unexplained.  Why are employers more reluctant to hire these people as "generalists" than for more demanding jobs?  Why is the bar for the "average" so much shorter than the other bars?

 

 

Here is my junkart version, which solves some of the problems:
Redomckoutsource1_2

  • I muted the high/low lines, shifting the focus onto the averages.
  • I expanded the scale to 100% so nothing flies off the page, and now the 10-19% averages truly express the headline: fewer than you'd think.
  • I banished "weighted average" to the footnote where it is clearly explained that the averaging is over 28 countries
  • The legend is placed helpfully next to the data, so the reader does not need to search for it in the corner.

The idea behind this data is extremely simple and we have shown two charts here, one being more effective than the other.

Reference: "Sizing the emerging global labor market", McKinsey Quarterly, 2005 No. 3.
Thanks to Annette for tipping me to this chart.

 


Economic Development

Mahalanobis created the following graph to illustrate how different regions of the world compare to OECD countries in terms of economic growth (Look here for entire post: Mahalanobis.)

MahaoecdAdmittedly, he did this over "two bottles of wine", imposing the blue lines on top of the red "base OECD trajectory".

The key messages that can be discerned from this chart are few:

  • China, India, East Asia have caught up hundreds of years during the 52 years between 1950 and 2001.
  • Despite fast growth, India and East Asia still lag significantly behind

After some thought, I came up with this graphic:

Redomahaoecd2Here I have lined up all the regions so they are separated by the same horizontal distance of 52 years apart.

The speed of growth is coded in the gradients of the line segments.

The OECD baseline trajectory is colored in blue so that any region's speed of growth can be contrasted with the baseline.

What additional insights can be drawn from this graphic?

  • The regions are bunched together much more tightly in 2001 than in 1950.
  • East Asia, India and China achieved the most spectacular growth but as of 2001, these regions still occupy 3 out of the last 4 slots.
  • Latin America was the only region to have growth slower than the baseline (smaller gradient).
  • Surprisingly (to me), Africa's speed of development approximated those of Asian Dragons and NIEs (similar gradients) in the last 52 years.

This set of data presents quite a challenge for visualization.  You're encouraged to submit alternative charts in the comments.