There are lots of myths in data science which are repeated endlessly on social media and the internet. One popular myth says that standardizing variables is to make them normal. This is not true!
This myth is found explicitly in the Python documentation on the StandardScaler method in the popular Scikitlearn package (link):
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
Let me first show you a quick counter-example, then provide some intuition around the standardization procedure. (Warning: this is a rather long read that is more technical than usual.)
***
(a) I generate 10,000 random numbers between -5 and 7 from a uniform distribution. A uniform distribution means that each value between -5 and 7 has equal chance of appearing. This is clearly NOT a normal distribution (also known as the Bell curve), in which values cluster around the average - in a uniform distribution, the average value appears just as frequently as any other value in the range.
The following histogram shows the count of values for our 10,000 uniform random numbers, and as expected, we see an almost equal representation of every value in the range.
(b) The average of the 10,000 numbers is 1.0 (rounded to 1 decimal place). This falls unsurprisingly on the midpoint of the range -5 to 7. The standard deviation, which measures the spread of the data, is 3.5 (rounded to 1 decimal place).
(c) Now, I standardize the data, which means subtracting the average value and dividing by the standard deviation. This is what the histogram looks like after standardizing the data. You can immediately see that the standardized data do not look anything like a Bell curve.
(d) The average of the 10,000 standardized numbers is 0, and the standard deviation is 1. So the standardization formula did change the values of the average and standard deviation but it did not change the shape of the distribution. To see this more clearly, I made the horizontal axis range the same on both charts.
You can see that the middle of the distribution is shifted from 1 to 0. Also, the standardized numbers are squeezed into a smaller range than the original numbers. This reflects dividing each number by the standard deviation (which is dividing by 3.5).
***
Now, let's develop some intuition about this business of standardization, normal distributions, normal probability plots, and so on.
Here is a generic normal distribution, or a Bell curve:
One beautiful thing about normal distributions is that they are completely determined by its average value and its standard deviation. The average pins the midpoint of the distribution, where the familiar bulge of the distribution sits. The standard deviation controls how widely the data are spread out along the range of values shown on the horizontal axis.
The above normal distribution has average 5.1 and standard deviation 3.0.
We can standardize this distribution, which means subtracting the average and dividing by the standard deviation. As we learned earlier, the effect is to recenter the midpoint to zero, and to fix the standard deviation to 1. Here are the histograms of the original normal variables, and the standardized normal variables.
In this case, the standardized variables do look like a Bell curve. Well, that is because the original data have the shape of a Bell curve to begin with.
***
Let's introduce a tool that tests whether a given set of data has the shape of the Bell curve. This is called the normal probability plot. (Technically, it's a type of qqplot in which the reference "q" is the normal distribution.)
This plot exhibits a straight line if the test distribution is a normal distribution. Deviation from the normal distribution is revealed by bending on the left and right edges of the sequence of dots. In the following chart, I test the set of random uniform numbers from before. You can see the bending which tells us that those numbers are not shaped like a normal distribution.
The next normal probability plot asks whether the standardized uniform numbers we generated above is a normal distribution, and it is a resounding no! You can see the bending on both ends of the curve.
Can we salvage this situation by making more data? No! Here is what the plot looks like when I generate 1 million numbers instead of 10,000. The bending does not go away.
As before, the average value of the 1 million standardized numbers is 0 and the standard deviation is 1 but the shape after standardization is not normal. Standardizing variables does not make them normal.
***
The standard normal distribution has mean 0 and standard deviation 1. A normal distribution is completely determined by those two statistics. However, the reverse is not generally true! If a distribution has mean 0 and standard deviation 1, it does not mean we have a standard normal distribution, or any normal distribution.
The following uniform distribution in the range [-sqrt(3), sqrt(3)] has mean 0 and standard deviation 1, by construction. Here is what the histogram looks like:
This does not look anything like "standard normally distributed data", even though it has mean 0 and sd 1. Every value, including the average value, has similar representation.
(Standardizing this distribution does nothing at all since we'd be subtracting the mean of zero, and dividing by 1.)
***
Why then do we standardize data? First, note that the average of the standardized data is fixed to zero. Next, the standard deviation is fixed to 1 - standard deviation is the average of the squared distances of the data from the average: thus by fixing sd to 1 (relative to the average of zero), the dispersion of the standardized data is being constrained. If a particular value strays far away from zero, it causes the standard deviation to rise but the total is capped under 1, and so the range of possible values in the standardized data is restricted.
Standardization places different data sets on the same scale so that they can be compared systematically. It does not turn non-normal data into normal data.
Interesting - I hadn't realised that.
I guess there must exist another process that would create a normal distribution. Obviously it wouldn't be a simple linear transformation.
Does it have a name?
Posted by: Robert Creamer | 11/11/2019 at 09:15 AM
RC: the other process is called "normalization". One popular transform is to take the square root. You can infer from the normal probability plots that you need a transform that is non-linear, that pulls in the extreme values more severely than the middle values.
Posted by: Kaiser | 11/11/2019 at 01:28 PM