« Receipts from a year of blogging the pandemic | Main | Primer on Regression Adjustments 2 »


Feed You can follow this conversation by subscribing to the comment feed for this post.

Michael Droy

Interesting topic and looking for the next episode.

A little nitpicking:
"For later reference, just remember that 0.5 inch (12 mm) is a big error on this scale. Half an inch is the difference between the median person and the 97.5th percentile person. So our tolerance for inaccuracy is described in small fractions of an inch."

Is this right? The standard deviation for persons (as opposed to samples of 900) is 7.8". So the median to 97.5th percentile should be 15.6"
97.5 percentile is where this sample is relative to all other samples of 900 persons. A person in the sample 0.5 inches above the median would be at the 52.5th percentile

And 7.8/sqrt(900)
From memory shouldn't this be 7.8/sqrt(899) (serious nitpicking!!)


MD: You raised a common point of confusion. To clarify the situation, think about the overall objective of a statistical study. We have a sample of data, and we want to extrapolate from that sample to the unknown population. So we don't know what the SD of the population is. All we have is data on 900 people, with a sample SD of 7.7. The sample SD is not a good estimate of the population SD - not surprising because 900 vs 25,000 people. One of the most magical formulas in all of statistics is the standard error, which measures the variability of the sample average from sample to sample.

Given our objective, the error is defined as how far our sample average is from the population average, therefore, we care about the variability of the sample average, hence the relevant quantity is the standard error.

Jason Kerwin

Does the population the sample is drawn from have an SD of 0.25 inches? That's what the gap between the mean and the 97.5th percentile being 0.5 inches implies.

A quick google search shows that the SD of human height is 3 inches, so the difference between the mean and the 97.5th percentile is 6 inches. That is consistent with my experience, too - I am around 70 inches tall, and an SD of 0.25 would imply that almost everyone is nearly exactly my height.


JK: The 0.25 inch gap is on the sampling distribution, not the population distribution. So, reverse the SE formula, 0.25 inch * sqrt(900) = 7.5 inches is an estimate of the population SD of heights. I took the population values from this CDC report (Table 8), and assumed a normal distribution on heights.

Instead of "almost everyone is nearly exactly my height", think almost every sample average height (from repeated drawing of 900 people) is nearly exactly the same value as the sample we're looking at.


It has always been difficult for me to understand the limits of the regression analysis so I am very happy that you are making this serie of post!
Thanks a lot for taking the time to write it and even more to share it with us here!
I am looking forward to reading it!

Michael Droy

I agree that 0.5 is a big error.
It was specifically the following statement:
"Half an inch is the difference between the median person and the 97.5th percentile person."
That seems odd to me.

Enjoying this series.


MD: Let me think this through. SE = 0.25 inch. Margin of error is 2*SE on each side of the mean. 2*SE = 0.5 inch. For a normal distribution, median = mean, and the margin of error is the middle 95%, spanning the 2.5th percentile to 97.5 percentile. So from the 50th to 97.5th percentile is half the margin of error, which is 2*SE. Did I screw something up?

Jason Kerwin

Kaiser: I think your calculation is right for the percentiles of the distribution of sample-average heights. I am fairly certain it's wrong for the percentiles of the distribution of people's actual heights. 50th to 97.5th percentile for the latter distribution should be 2 SDs, not 2 SEs.


JK: Thank you for persisting. I see what you and MD are complaining about. It's the word "person". I've changed it to sample.

The comments to this entry are closed.

Get new posts by email:
Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR, Wired.

See my Youtube and Flickr.


  • only in Big Data
Numbers Rule Your World:
Amazon - Barnes&Noble

Amazon - Barnes&Noble

Junk Charts Blog

Link to junkcharts

Graphics design by Amanda Lee

Next Events

Jan: 10 NYPL Data Science Careers Talk, New York, NY

Past Events

Aug: 15 NYPL Analytics Resume Review Workshop, New York, NY

Apr: 2 Data Visualization Seminar, Pasadena, CA

Mar: 30 ASA DataFest, New York, NY

See more here

Principal Analytics Prep

Link to Principal Analytics Prep