Seamless, the online restaurant delivery service, has been running a series of fun ads on the New York subway that has a statistics theme. Here is a snapshot of one of them:
The text on the ad says:
The Most Potassium-Rich Neighborhood
MURRAY HILL
Based on the Number of Banana Orders
No One’s Cramping Here
***
This ad is tongue-in-cheek. But it's making a data-driven argument. So I started unpacking it.
The conclusion is “No one’s cramping here (in Murray Hill).” It’s an exaggeration so I’m going to read this as “Most people don’t cramp here in Murray Hill.”
The data behind this conclusion is much harder to nail down. One would think it should be the proportion of orders containing bananas in Murray Hill relative to the same in other neighborhoods. The ad uses the phrase “number of banana orders.” What does that mean? Is it “orders with at least one banana”? Or “orders of bananas only”? Or “total number of bananas ordered (across all orders)”?
Between the data and the conclusion is a long, windy path. Let me draw this out:
Assumption 1
All the neighborhoods have similar total populations so that by proportion of banana orders, Murray Hill also ranks #1.
Assumption 2
“Banana orders” is defined meaningfully. For the sake of argument, we’ll assume a banana order is an order that contains at least one banana.
Assumption 3
The data analyst used the appropriate address data. For the sake of argument, we'll assume that the delivery address is the source of the neighborhood data.
Assumption 4
Everyone who has a “banana order” through Seamless lives in the neighborhood to which the banana(s) were delivered. This further requires
Assumption 5
Everyone who has a “banana order” through Seamless works in the same neighborhood as they live. This distinction is important for daytime orders.
Assumption 6
Murray Hill residents who has a “banana order” through Seamless are just like other Murray Hill residents
Assumption 7
The name on each “banana order” is the one person who consumes the banana(s). No dogs ate the bananas, nor did a co-worker, family member, or anyone else not known to Seamless
Assumption 8
Eating bananas prevents or massively reduces cramping. The health effect of bananas requires
Assumption 9
Published scientific reports reach a strong consensus on the effect of bananas on cramping (highly unlikely); or, Seamless data show that those with a “banana order” report the absence of cramps (which requires primary research). The causal interpretation further requires
Assumption 10
Knowing that the people who made “banana orders” through Seamless would have suffered cramps had they not ordered and consumed those bananas. This counterfactual scenario is never observed, so instead, we accept
Assumption 10b
Knowing that the people who did not make a “banana order” through Seamless did suffer cramps. This requires
Assumption 11
The people who live in Murray Hill and did not make a “banana order” through Seamless also did not order bananas from a different shop, or otherwise consume bananas. In addition, we require
Assumption 12
No one who is part of this analysis benefited from any other anti-cramping remedy; or at the minimum,
Assumption 13
That people who have “banana orders” through Seamless, and those who don’t, are equally likely to have used other forms of anti-cramping remedy
Assumption 14
One banana is effective at stopping cramps, meaning there is no dose-response effect, the presence of which would require us to define “banana order” differently under Assumption 2.
The above assumptions fall into three groups: obviously false (e.g. Assumption 1); possibly true; and most likely true. The probability of the conclusion depends on the probabilities of these individual assumptions.
***
tl;dr
Most data-driven arguments consist of one part data, and many parts assumptions. An analyst should not fear making assumptions. Assumptions should be supported as much as possible.
Comments
You can follow this conversation by subscribing to the comment feed for this post.