On my profile, I list "practical statistics" as an interest. This chart on U.S. Grade 8 test scores gives me an opportunity to explain what that means:
The reader is fed (force-fed?) two messages: that there has been a small but detectable improvement in math scores since 1990; and that most of the increases were "statistically significant" (behold the asterisks!)
- By using the start-at-zero rule (and max-at-500 because 500 is the maximum score), the small changes are immediately seen to be irrelevant; the line is almost flat. I haven't checked how the "scale score" is created but surely sub-300 scores out of 500 hardly constitute a record of pride
- Because "accommodations" (providing assistance to certain needy groups) clearly had a positive impact on the scores, and because this effect was not accounted for, a side-by-side comparison of the two periods is misleading and useless. When the dashed line (1990-6) is removed, the trend is further flattened.
Most destructive for the enterprise known as "statistical testing" is that asterisk next to 278: this asterisk asserts that the 1-point increase from 2003 to 2005 is "statistically significant" (at 95%). This result makes a mockery of statistics. Clearly, no one cares about a 1-point difference; everyone can agree that it is not practically meaningful*.
If you work for college admissions and you have two candidates, one scoring 278 and the other 279, would you accept the latter and reject the former based on the 1-point difference? If, further, you realize that the top score is 500, how would you rate these two candidates?
Practical statistics do not accept statistical results without first understanding if they are practically meaningful.
Reference: NAEP: the site contains a wealth of data and some interesting graphical presentations, worth a look!
* For those interested in the theory behind statistical significance: statisticians distinguish between the true (population) average and the sample (observed) average. In 2003, the average math score was observed to be 728 but the true average is likely to be 728 +/- 0.3 where 0.3 is known as the sampling error. This sampling error reflects our uncertainty in the true average because of random noise (such as measurement errors). Practically, this means that while we observe 728, the true average can be as low as 727.7 or as high as 728.3 or anything in between (most of the time).
Now instead of estimating 2003 score, estimate the difference between 2005 and 2003 scores. We observe a difference of 1. But practically, the true difference will lie in the interval 1 +/- X, where X again is the sampling error. If X > 1, the interval contains 0, which means that some of the time, the true difference can be zero or negative so we conclude that the difference is not statistically significant. If X < 1, then the difference is statistically significant.
So far, so good... until you realize what factors affect the size of X. One factor is the sample size. The larger the sample size, the smaller X is. This is called the Law of Large Numbers; the estimate of the true mean gets better and better when we get more data. So just by increasing the sample size, and thus reducing X, even very small differences (like 1) can become "statistically significant". But as we learn from this example, even if it is statistically significant, the tiny difference is practically meaningless.