« Introducing the statistical parable | Main | Two new reviews of Numbersense »


Feed You can follow this conversation by subscribing to the comment feed for this post.


Overall, everything you say makes perfect sense!

One small follow up question: Separate and apart from whether or not the base rates are correct, and the lack of attention to variation of those base rates, how do you know that a 1% or 2% difference in click rates is "tiny"? I don't know anything about e-mail marketing, but in direct (snail) mail marketing, a difference of 1% response rate is considered quite meaningful.


Stephanie: it depends on your reference point. 1% over 85% is likely to be noise but 1% over 5% is probably meaningful. One good way to gauge the underlying variability is to look at how much the rate moves around weekly, monthly, etc.


I still don't see how you could know what is "tiny" unless you have actual information on the variance of these particular data. Some data are quite stable, in which case a 1% change may well not be noise. Seems like you are guessing (perhaps in a reasonable, educated way if you have experience with this type of data), but who knows? I would not assume one way or the other myself. For example, assume most people have very predictable behavior and are either routine clickers or routine non-clickers. If there's something particular about a certain e-mail provider's setup that nudges a subset of people to switch behavior, well, there's a story of how 1 or 2% isn't noise.


How would I know? It's what I call numbersense. I wrote an entire book about it.
I'll just mention one aspect of it here. Go read the book for the rest.
There is no such thing as certainty in statistics. Every analysis, including the one derived from "actual" data, is part opinion.

Adam Schwartz

A philisophical question on approach here. Part of the false sense of confidence in the data is supported by the way confidence intervals are created, no? Assuming we're using sqrt(p(1-p) / N) as N grows the confidence interval must shrink lending the assumption of statistical significance even if the practical significance of a 1% difference in click rate is small. This assumes any test for statistical significance was done, of course.

Would you advocate for a sampling rather than N=All approach to counteract this? Setting aside the "design" of the study (using someone else's data, lack of control), when is a lot of data a good thing and when is it simply misleading? Or is a lot of data only misleading due to the "design"?



Subsampling like you said is something I do routinely. There is a better method, not always feasible. What those formulas do is to use a stochastic model to estimate the systematic variance. No one really checks whether the estimate is accurate or not. When N is very large, as you noticed, that estimate is almost surely wrong. The solution is to compute empirical estimates of your own systematic variance. That's why I talked about looking at your historical fluctuations. Box,Hunter,Hunter (1st editions) covers this even before they talk about the usual formulas.

Jake S

All great points to consider when looking at any email analysis.
One way to combat cross time changes for provider comparison purpose (or any other "segmentation") would be to have a random control group. Lift metric would be impacted by the Gmail or other providers actions independently. MailChimp would probably never recommend this approach because their revenue is tied to volume. Lifts in product sales from email will be small (but likely still positive ROI) which would push their client to consider optimization and better targeting tactics... but that realization in client is very good for MailChimp's business.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.


Post a comment

Your Information

(Name is required. Email address will not be displayed with the comment.)

Marketing and advertising analytics expert. Author and Speaker. Currently at Vimeo and NYU. See my full bio.

Spring 2015 Courses (New York)

Jan 26: Business Analytics & Data Visualization (14 weeks) Info

Feb 23: Statistics for Management (10 weeks) Info

Mar 28: Careers in Business Analytics & Data Science (one-day seminar) Register

Apr 7: The Art of Data Visualization Workshop (6 weeks) Register

Next Events

Sep: 28 Data Visualization New York Meetup, New York, NY

Oct: 5 Andrew Gelman’s Statistical Communications class, Columbia University

Oct: 13 AQR ProSeminar, NYU Sociology

Oct: 22 Leading Business Change Through Analytics, Columbia Business School

Oct: 30 Ray Vella’s Designing Infographics class, NYU

Past Events

See here

Junk Charts Blog

Link to junkcharts

Graphics design by Amanda Lee


  • only in Big Data