« Story time, says these education journalists | Main | What happens after the math is done? »


Feed You can follow this conversation by subscribing to the comment feed for this post.


You made one very common error in your discussion: the precision of a survey-based estimate has almost _nothing_ to do with what proportion of the population is being sampled (as long as you are not sampling almost the entire population). I am sure you know the soup-tasting analogy. So the wide margin of error of the estimates is not because 140,000 is a small proportion of all the businesses, but because the buisiness-to-business variability of the change in the number of employees is large.


In discussing the confidence interval you state:

What this means is that when they report -54,000, what they actually mean is that any number between +46,000 and -154,000 is consistent with the data that was observed. So in fact, the statisticians have no idea whether employment grew or shrank in August.

I think you are stretching it a bit to say that statistician have "no idea" whether employment grew or shrank. They do have an idea, their best guess is that employment shrank by 54,000 jobs. Yes, their sample was noisy, and even if the true value were 0, they would get sample estimates like this one more than 10 percent of the time, but I think it is a statistically valid claim to state "our data suggest it is more likely than not that employment shrank."


Aniko: You read a lot more into that sentence than I intended - and I realize that what I wrote could be misleading so thanks for bringing this up. If everyone in the population were surveyed, then we would have complete information and there could be no sampling error. If we can only collect partial information, then the larger the sample, the smaller the error. But after a certain point, increasing the sample doesn't reduce the error enough to matter so we like to say proportions don't matter. Hope I clarified that.

Also I want to clarify your statement that the large error is not due to sample size but due to variability. The sample size is designed to filter out a certain level of noise (conversely, read a certain level of signal); if the survey has been designed to read changes in employment of 10,000, then the statistician would have called for a lot more than 140K businesses to be surveyed.

Aaron: A margin of error comes at a certain level of confidence, in this case, 90%. Any statement like "our data suggest it is more likely than not that employment shrank" is valid ONLY if we accept a lower confidence level. 90% is already a lower threshold than typically used so one must be careful when issuing such statements.

I have fundamentally a strong objection to this line of thinking because it is equivalent to saying when the sampling error is large, just ignore the variability, and use the average value (point estimate) as the most likely value. It is precisely when the sampling error is large that we must pay attention to it. Otherwise, we might declare all the research on confidence levels and margins of error useless!


In Australia they include a trend line, but most commentators ignore it. I'm going to use the accuracy of unemployment figures as an example in a basic stats course this semester.

The comments to this entry are closed.

Get new posts by email:
Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR, Wired.

See my Youtube and Flickr.


  • only in Big Data
Numbers Rule Your World:
Amazon - Barnes&Noble

Amazon - Barnes&Noble

Junk Charts Blog

Link to junkcharts

Graphics design by Amanda Lee

Next Events

Jan: 10 NYPL Data Science Careers Talk, New York, NY

Past Events

Aug: 15 NYPL Analytics Resume Review Workshop, New York, NY

Apr: 2 Data Visualization Seminar, Pasadena, CA

Mar: 30 ASA DataFest, New York, NY

See more here

Principal Analytics Prep

Link to Principal Analytics Prep