Standardization by default
Nov 16, 2005
Of myriad abuses of statistics, standardization of variables is one not discussed enough so I address it here.
This cautionary tale is inspired by a misleading comment in my first post on the suicide data. In presenting the nine scatter plots, I had said:
The x-axis represent locations; the y-axis represent the number of suicides at that location -- but on a standardized scale. The standardized scale allows us to compare across graphs.
Later, I realized I applied standardization by default. When faced with several variables with different scales, we often standardize, i.e. put them onto the same scale, before making comparisons. Sometimes this strategy backfires...
A typical standardization procedure involves two steps: 1) centering i.e. shift the average to 0; 2) scaling i.e. express the scale in units of standard deviation. The following table refers to Dataset 9 in the suicide data, followed by centered data and standardized data.
raw | centered | standardized | |
Average | 21.6 | 0.0 | 0.0 |
Std Dev | 11.2 | 11.2 | 1.0 |
Max | 61.0 | 39.4 | 3.5 |
Min | 4.0 | -17.6 | -1.6 |
Range | 57.0 | 57.0 | 5.1 |
The effect of centering (second column) is to move the average from 21.6 to 0.0 while keeping the spread of the data (see standard deviation or range) the same. Scaling squeezes the spread, shortening the range from 57.0 to 5.1 (third column). In effect, standardization puts the data onto a new scale: 0 in the new scale is 21.6 (the average) in the old scale; 1 in the new scale is the old average + 1 unit of standard deviation, or 21.6 + 1x11.2 = 32.8; 2 indicates 21.6 + 2x11.2 = 44.0, etc.
Observe that standardizing (or more precisely, scaling) compresses the data spread, which means artificially changing the shape of the distribution. It turns out that to show that the suicide locations are not random, we need to discover that Dataset 9 has significantly larger spread than the other datasets (see here and here). By standardizing the variables, I had inadvertently thrown away this key feature!
Proof 1: (cumulative distribution) Recall the red staircase line is Dataset 9 (real data); all other lines are random data.
Proof 2: (boxplot) The rightmost boxplot summarizes Dataset 9.
In both cases, Dataset 9 did not stand out in the standardized data plot.
So my advice is: never standardize by default; always understand how standardizing changes the distribution.
A recap of the series exploring the nature of randomness:
The Problem
The Data
First Analysis
The Boxplot
Standardization
You can uncover the trail of my mischief by looking at the axes of the various graphs in this series. When the scale is from -3 to 3, I was using standardized data; when it is from 0 to 60, I was using raw data.
Comments
You can follow this conversation by subscribing to the comment feed for this post.