First, you have to read till the end for the 20 paper ideas.

And if you're wondering about the acronym, it's Driving Under the Influence of Weed on 420 Day, which I learned from Andrew Gelman's blog is a day of celebration of cannabis.

Andrew's blog post is about the exemplary work done by Sam Harper and Adam Palayew, debunking a highly-publicized JAMA study that claimed that 420 Day is responsible for a 12 percent increase in fatal car crashes.

The discussion provides great fodder for examining how to investigate observational data, which is what most of Big Data is about. It is a cautionary tale for what not to do.

***

The blog begins with Harper/Palayew channeling Staples/Redelmeier, the authors of the study: "fatal motor vehicle crashes increase by 12% after 4:20 pm on April 20th (an annual cannabis celebration)."

This short sentence captures the gist of the original study but it omits an important detail: to what **is the increase relative**?

If we ran an **experiment**, we would recruit a group of drivers, and select half of them at random to smoke weed on April 20. Then, we would count what proportion of drivers suffered fatal car crashes after 4:20 pm. The analysis would be straightforward: what's the difference in proportions between the two groups? With such an experiment, it is possible to draw a causal conclusion.

Alternatively, we could conduct a **case-control study**. The cases are the drivers who suffered fatal car crashes on April 20. We collect demographic data on these drivers. Then, we define a set of "controls", drivers who did not suffer car crashes on April 20 but on average, have the same demographic characteristics as the cases. Next, we need data on cannabis consumption, preferably on April 20. We want to show that the level of cannabis consumption is significantly higher for cases than for controls.

(For further discussion of these analysis designs, see Chapter 2 of **Numbers Rule Your World (link)**.)

The actual study was neither experiment nor case-control. It was a piece of pure data analysis, based on "found data". I like to call this "adapted data," the "A" in my OCCAM framework for Big Data - data collected for other purposes that the researcher has adapted for his/her own objectives. In this study, the adapted data come from a database of fatal car crashes.

So how was the adapted data analyzed? Harper/Palayew answer this question in their second description of the research:

Over 25 years from 1992-2016, excess cannabis consumption after 4:20 pm on 4/20 increased fatal traffic crashes by 12% relative to fatal crashes that occurred one week before and one week after.

The cases are the fatal car crashes that occurred after 4:20 pm on 420 Day. The comparison isn't to the drivers who did not suffer crashes on the same day. The reference group consisted of fatal car crashes that occurred after 4:20 pm on 4/13 and 4/27. The difference in the average number of crashes is taken to result from "excess cannabis consumption".

Notice that such a conclusion **requires a strong assumption**. We must believe that absent 420 Day, 4/13, 4/20 and 4/27 ought to have the same fatal crash frequencies.

***

You hopefully recognize that the analysis design for adapted data is on much shakier ground than either an experiment or a case-control study.

Harper/Palayew's initial debunking focused on one issue: what's so special about April 20? To answer that, they repeated the same analysis on every day of the year. The following pretty chart summarizes their finding:

The red line is the line of no difference (between the analyzed day and the two reference days from the week before/after). Each vertical line is the range of estimate of the difference for a specific day of the year. The range for 4/20 is highlighted, and several other days with elevated fatal crash counts are labeled.

The chart was originally published here, with the following commentary: "There is quite a lot of noise in these daily crash rate ratios, and few that appear reliably above or below the rates +/- one week." Andrew adds: "Nothing so exciting is happening on 20 Apr, which makes sense given that total accident rates are affected by so many things, with cannabis consumption being a very small part."

While the chart looks cool, and sophisticated, the following histogram of the same data helps the reader digest the information.

I took the daily estimates of the fatal crash ratios from Harper/Palayew's published data. Each ratio presents the crashes on the analysis day relative to the crashes on the two reference days. The histogram shows the day-to-day variability of the crash ratios, which is what we need to answer the question: how special is 4/20?

The histogram is roughly centered at 1.0 meaning no observed difference. The black vertical line shows the ratio for 4/20. It is leaning right - in fact, it is at the 94th-percentile. In classical terms, this is a p-value of 0.06, barely significant.

The following 21 days have more extreme ratios than 4/20:

Jul 4 Dec 23 Dec 21 Nov 21 Sep 1 Dec 20 Sep 2 Jul 3 Dec 31 Oct 31 Nov 23 Dec 18 Dec 6 Jul 14 Sep 4 Dec 22 Mar 17 May 25 Apr 1 Mar 7 Dec 19

Will JAMA editors accept one research paper for each of these days? The work is already done - the rest is story time.

P.S. [4/27/2019] Replaced the first chart with a newer version from Harper's site. This version contains the point estimates that the other version did not. Those point estimates are used to generate the histogram.

## Comments

You can follow this conversation by subscribing to the comment feed for this post.