Comments on MailChimp Gmail study as an example of Big Data studies 2/2TypePad2014-01-22T04:22:47Zjunkchartshttp://junkcharts.typepad.com/numbersruleyourworld/tag:typepad.com,2003:http://junkcharts.typepad.com/numbersruleyourworld/2014/02/mailchimp-gmail-study-as-an-example-of-big-data-studies-22/comments/atom.xml/Jake S commented on 'MailChimp Gmail study as an example of Big Data studies 2/2'tag:typepad.com,2003:6a00d8341e992c53ef01a73d7229a9970d2014-02-07T17:28:08Z2014-02-09T00:56:32ZJake Shttp://www.one2onek.comAll great points to consider when looking at any email analysis. One way to combat cross time changes for provider...<p>All great points to consider when looking at any email analysis.<br />
One way to combat cross time changes for provider comparison purpose (or any other "segmentation") would be to have a random control group. Lift metric would be impacted by the Gmail or other providers actions independently. MailChimp would probably never recommend this approach because their revenue is tied to volume. Lifts in product sales from email will be small (but likely still positive ROI) which would push their client to consider optimization and better targeting tactics... but that realization in client is very good for MailChimp's business. </p>Kaiser commented on 'MailChimp Gmail study as an example of Big Data studies 2/2'tag:typepad.com,2003:6a00d8341e992c53ef01a51165c63c970c2014-02-06T15:43:56Z2014-02-07T05:42:23ZKaiserhttp://junkcharts.typepad.com/junk_chartsSubsampling like you said is something I do routinely. There is a better method, not always feasible. What those formulas...<p>Subsampling like you said is something I do routinely. There is a better method, not always feasible. What those formulas do is to use a stochastic model to estimate the systematic variance. No one really checks whether the estimate is accurate or not. When N is very large, as you noticed, that estimate is almost surely wrong. The solution is to compute empirical estimates of your own systematic variance. That's why I talked about looking at your historical fluctuations. Box,Hunter,Hunter (1st editions) covers this even before they talk about the usual formulas. </p>Adam Schwartz commented on 'MailChimp Gmail study as an example of Big Data studies 2/2'tag:typepad.com,2003:6a00d8341e992c53ef01a73d710c45970d2014-02-06T13:41:32Z2014-02-07T05:42:23ZAdam SchwartzA philisophical question on approach here. Part of the false sense of confidence in the data is supported by the...<p>A philisophical question on approach here. Part of the false sense of confidence in the data is supported by the way confidence intervals are created, no? Assuming we're using sqrt(p(1-p) / N) as N grows the confidence interval must shrink lending the assumption of statistical significance even if the practical significance of a 1% difference in click rate is small. This assumes any test for statistical significance was done, of course.</p>
<p>Would you advocate for a sampling rather than N=All approach to counteract this? Setting aside the "design" of the study (using someone else's data, lack of control), when is a lot of data a good thing and when is it simply misleading? Or is a lot of data only misleading due to the "design"? </p>
<p>Thanks!</p>Kaiser commented on 'MailChimp Gmail study as an example of Big Data studies 2/2'tag:typepad.com,2003:6a00d8341e992c53ef01a73d70abfc970d2014-02-06T01:08:38Z2014-02-06T03:24:05ZKaiserhttp://junkcharts.typepad.com/numbersruleyourworldHow would I know? It's what I call numbersense. I wrote an entire book about it. I'll just mention one...<p>How would I know? It's what I call numbersense. I wrote an entire book about it.<br />
I'll just mention one aspect of it here. Go read the book for the rest.<br />
There is no such thing as certainty in statistics. Every analysis, including the one derived from "actual" data, is part opinion.</p>Stephanie commented on 'MailChimp Gmail study as an example of Big Data studies 2/2'tag:typepad.com,2003:6a00d8341e992c53ef01a5116538de970c2014-02-05T23:35:27Z2014-02-06T01:02:22ZStephanieI still don't see how you could know what is "tiny" unless you have actual information on the variance of...<p>I still don't see how you could know what is "tiny" unless you have actual information on the variance of these particular data. Some data are quite stable, in which case a 1% change may well not be noise. Seems like you are guessing (perhaps in a reasonable, educated way if you have experience with this type of data), but who knows? I would not assume one way or the other myself. For example, assume most people have very predictable behavior and are either routine clickers or routine non-clickers. If there's something particular about a certain e-mail provider's setup that nudges a subset of people to switch behavior, well, there's a story of how 1 or 2% isn't noise. </p>Kaiser commented on 'MailChimp Gmail study as an example of Big Data studies 2/2'tag:typepad.com,2003:6a00d8341e992c53ef01a3fcb595b9970b2014-02-05T23:13:56Z2014-02-06T01:02:22ZKaiserhttp://junkcharts.typepad.com/junk_chartsStephanie: it depends on your reference point. 1% over 85% is likely to be noise but 1% over 5% is...<p>Stephanie: it depends on your reference point. 1% over 85% is likely to be noise but 1% over 5% is probably meaningful. One good way to gauge the underlying variability is to look at how much the rate moves around weekly, monthly, etc.</p>Stephanie commented on 'MailChimp Gmail study as an example of Big Data studies 2/2'tag:typepad.com,2003:6a00d8341e992c53ef01a51164f5d7970c2014-02-05T18:38:30Z2014-02-06T01:02:22ZStephanieOverall, everything you say makes perfect sense! One small follow up question: Separate and apart from whether or not the...<p>Overall, everything you say makes perfect sense!</p>
<p>One small follow up question: Separate and apart from whether or not the base rates are correct, and the lack of attention to variation of those base rates, how do you know that a 1% or 2% difference in click rates is "tiny"? I don't know anything about e-mail marketing, but in direct (snail) mail marketing, a difference of 1% response rate is considered quite meaningful.</p>