In Praise of the Bumps Chart III
How representative is your sample?

Transformations and regressions

Mahalanobis has a fascinating post on the role of transformations (square roots, logarithms, inverses) in presenting data.  He cited Howard Wainer's book.  Howard taught me intro stats when he was visiting Princeton, and he was the one who first exposed me to data-ink ratios for which I am forever thankful.

Cars01I reproduce one of the graphs here.  It shows the relationship between the price of a convertible and the rank order of its price against 46 other convertibles.  So the Toyota Paseo is the 2nd cheapest car among the 47 and it was priced at about US$17,000.

1. How do transformations work?

Transformations are often used by statisticians to "linearize" relationships.  This log price chart from Howard's book is a piece-wise linear curve; it is not linear because of unexpectedly large price gaps between one convertible and the next higher-ranked one. For example, there is no other convertible in the market with a price sticker higher than Chervolet Corvette and lower than Dodge Viper.

On the chart, if we really want to have one line instead of four line segments, we need to compress those large price gaps.  That is exactly what logarithms and inverses do: they penalize large jumps.   Thus, when Howard applied inverses to the same data, he got a straight line because inverses impose more severe penalties than logarithms.

2. Linear regression models

Linear regression is an extremely powerful invention but it would not have been so potent without the aid of transformations.  By using logarithms and inverses, the class of models has been expanded to non-linear ones.  For example, if the relationship between Y and Log(X) is linear, then that between Y and X is clearly non-linear.  In a sense, calling them "linear regression models" under-estimates their power.

3. Abuse of regressions

Unfortunately, this price-rank relationship represents a poor use of regression.  Conventionally, the predictor variable is plotted in the x-axis.  If that is so, then the regression solves this type of problem: if I want my convertible to have the 5th highest price in the market, how much should I price the car?  Not very useful, eh? [The answer is: 1 / (slope x 5 + intercept).]

The temptation to draw a straight line through any set of points has led many astray.  Be vigilant.

4. An easy improvement

If we release ourselves from the regression paradigm, then this graph can be made more readable by flipping the axes.  In so doing, the names of the convertibles can be read horizontally, or more naturally.  The data remain as they are, even the line can stay where it is, if desired.

Reference: "The Power of Transformations", Mahalanobis, August 17, 2005.


The comments to this entry are closed.