Big Data, Plainly Spoken (aka Numbers Rule Your World) tag:typepad.com,2003:weblog-81246531496613644 2020-12-04T09:09:00-05:00 Kaiser Fung, author of "Numbers Rule Your World: the hidden influence of probability and statistics in everything you do" and "Numbersense: How to use Big Data to your advantage". Comments on how data science, algorithms, software shape current events. TypePad The importance of matching data tag:typepad.com,2003:post-6a00d8341e992c53ef026be4298158200d 2020-12-04T09:09:00-05:00 2020-12-04T00:18:33-05:00 Kaiser shows an example of matching dates in data processing, using covid19 data from Illinois and Iowa. junkcharts

We keep hearing that Covid-19 cases may be rising but the death rate is dropping. This is usually supported by appealing to the trends of cases and deaths. Here is an example for the state of Illinois: Both cases and deaths are expressed as index values relative to April 1st so they can be directly compared. If you pick a date and compute the death rate as deaths divided by cases, the graph above seemingly shows a declining death rate.

While we all wish this to be true, it is too good to be true. The simple analysis described above ignores the timing of deaths. Typically, someone who dies from Covid-19 does not die on the same day s/he gets diagnosed. Death may come up to a month or longer after the positive test. Take the deaths from December 1st. These patients likely tested positive mid October to mid November. So to compute the death rate, we should divide by cases from about a month ago, not the cases from December 1st.

Notice that cases surged in November, and most of those infected won't show up in mortality counts until December. If we divide deaths on Dec 1st by cases on Dec 1st, the surge of cases is what drives the death rate down. The declining "death rate" is less due to lower deaths but fast-growing cases.

A clearer analysis requires dividing deaths of each day by the cases from about 20 days ago. I described how this is done back in April in this post about Lombardia in Italy. Today, I apply this methodology to current data for the states of Illinois and Iowa.

***

In Illinois, the lag between cases and deaths is around 25 days. The thin gray line is the same line as in the first graph. Here, I lag it by 25 days, shifting the curve to the right. Doing so matches the deaths from one week with the cases from 25 days ago. You can see the matched dates on the bottom axis.

Both lines start at 100 on July 1st. You can see that deaths pretty much follow the shape of the trend of cases. This indicates that the chance of dying from Covid-19 given that one has tested positive has been relatively stable during these many months.

***

In Iowa, the lag between cases and deaths is around 17 days. A similar picture emerges.

It's difficult to compare state-level statistics because each state has its own set of rules for reporting. Also bear in mind that all dates are reporting dates, and don't necessarily reflect the day of infection or the day of death. So the error bar around the exact time of lag is pretty wide.

It's time to visit Florida again - graphically tag:typepad.com,2003:post-6a00d8341e992c53ef0263e97c0d72200b 2020-12-02T08:44:00-05:00 2020-11-30T12:53:39-05:00 Kaiser decides to visit Florida again - in visualizing excess deaths due to coronavirus. junkcharts

In May, I showed the following chart that presents a way to understand excess deaths in Florida (link to post): It was only two months into the pandemic: because actual deaths from death certificates take time to count, it was just a sign of things to come. We start with a projection of expected deaths as by seasonal flu ("pneumonia & influenza") made based on prior flu seasons, expressed as a percentage of total deaths by any cause. In the first two months of the pandemic, the reported deaths by seasonal flu was as much as double the typical percentage of all deaths. On top of that, there were deaths related to Covid-19, which were roughly equal to deaths by seasonal flu during April.

There was no particular reason why seasonal flu deaths should be significantly above normal during the tail end of the 2019-20 flu season so it was suspected that Florida might have been undercounting Covid-19 deaths.

Months later, most of the dust has settled on these earlier months, and many of those deaths appeared to have been reclassified as Covid-19 related. I revisited this analysis below.

***

Let's start with seasonal flu statistics for the past five seasons (from 2015-6 to 2019-2020). Flu-counting starts in late September/early October each season. According to the CDC, flu season typically reaches a peak in the winter, and lingers on till as late as May. Pneumonia & influenza accounts for 1 to 2 percent of total deaths in any given week of the year, on average in the four seasons prior to 2019-2020. The 2019-20 flu season was a bit late arriving so when Covid-19 cases started getting detected in March, flu deaths were just easing from the seasonal peak.

The above chart - specifically, the thick gray line - establishes what we expect the flu fatalities to be during the pandemic months from March to Oct 2020. The orange lines show the actual deaths attributed to seasonal flu. These counts are definitely smaller than the numbers I found back in May. The revised counts make more sense since the flu is seasonal. (I left out the November data because they are incomplete at the time of writing.)

What happens when we layer on the Covid-19 deaths? The spike of data ran off the page.

I kept the scale of the lower part the same as before. In March and April, while pneumonia and influenza accounted for 1 to 2 percent of all deaths in Florida, when Covid-19 were lumped with the other two causes, they reached about 8 percent of all deaths, roughly four times as many. In the summer months, over a quarter of the deaths in Florida were linked to Covid-19, pneumonia or influenza.

In a normal flu season, we might see 50 to 100 deaths per week in Florida due to pnuemonia or influenza. Since April, Florida has suffered Covid-19 deaths in the hundreds, and during the peak in the summer, over a thousand per week, roughly 10 times above expectation.

Here's a gif that shows the top of the chart: The press-release derby has set an unrealistic bar for the coronavirus vaccine tag:typepad.com,2003:post-6a00d8341e992c53ef0263e97c3022200b 2020-12-01T08:48:00-05:00 2020-11-30T23:47:00-05:00 Kaiser explores how sensitive Pfizer's claimed 95 percent efficacy is to the trial outcomes. junkcharts

Regarding the Pfizer vaccine trial results, statisticians are tempted to say that the signal (the vaccine's share of cases) was so strong that we can let down our myriad defenses. As I explain today, this is not quite true. It depends on which signal we're talking about.

If the question is whether the Pfizer vaccine is at least 50 percent effective, then we don't have much to worry about. Nevertheless, the press release derby has created extremely high expectations - if the question is whether the Pfizer vaccine is at least 90 percent effective, then the finding is highly sensitive to shifts of just a few cases. I'll show below that we can be secure in making a statement such as that the Pfizer vaccine is at least 80 percent effective.

***

The basic strategy of a Bayesian analysis yields a probability estimate of the efficacy of the vaccine being tested (VE), given the results from the vaccine trial. The outcome is a probability estimate for any value of VE. According to the Pfizer protocol, the U.S. regulators are interested in a specific estimate: the chance that VE is higher than 30 percent.

The result from the vaccine trial is expressed as the vaccine's share of cases (VSC), which is the proportion of detected cases that came from the vaccine arm of the trial (the other arm being the placebo). Intuitively, the better is the vaccine, the lower its share of cases.

Enrollment in the Pfizer trial can be stopped when the trial records 170 cases. The most recent press release from Pfizer reported that out of those 170 cases, only 8 came from the vaccine arm, thus the observed VSC was 8/170 = 5 percent. This translates to 95% VE: in other words, the case rate in the vaccine arm was merely 5% that in the placebo arm.

(Notice that the average case rate is very low: 170/43000 = 0.4% about 2 months since enrollment, therefore it is wrong to say that 95 percent of vaccinated people will be protected - most people in the trial have not yet been exposed to the virus!)

If any of the above is unclear, please review my prior posts about the Pfizer analysis (here and here).

***

I was wondering how sensitive the result is to the observed VSC. If the trial had found 9 or 10 cases, instead of 8, among the vaccine arm, how much would our conclusion change?

I use the chart below to answer this question: First, look at the red line. This shows the probability that vaccine efficacy (VE) is over 90 percent, given the trial results. The actual trial result is indicated by the down arrow - eight cases in the vaccine arm out of 170 total cases. The corresponding dot above says there is a 98% chance that VE > 90%. That's what the Pfizer press release wants to tell us.

Now, notice how steeply the line drops as we move from 8 to 10 to 12 cases: 98% -> 93% -> 81%. To achieve the typical standard of 95% confidence, this probability has to clear 97.5%. This means that the 8 reported cases were sitting on the edge. One more case, and the claim of over 90% VE is shaken.

This might sound like bad news for Pfizer. If it does, it's Pfizer's own doing. Because of the press release derby, it feels like VE needs to be above 90%. Remember those days when we hoped VE is at least 50 percent?

So I did the same analysis for the probability that VE is over 50 percent, as the number of cases in the vaccine arm increases. This is shown as blue dots. Notice that these dots hug the 100% line for dear life. Even if the vaccine arm found 25 cases (out of 170), we can still say with complete confidence that the VE of Pfizer's vaccine is over 50 percent.

Let's double the number of cases coming from vaccinated participants to 16. If you draw a vertical line at 16, it will hit the light pink dots at 99 percent. Those dots represent VE greater than 80 percent. We are highly confident that VE is over 80 percent, even if the vaccinated participants suffered twice as many cases as observed in the vaccine trial.

***

When discussing these vaccine trial data, please bear in mind the following:

1) Recall that the 40,000 or so participants are not monitored every day. The cases are self-reported, and the participants who report symptoms are then tested. At most one mandatory follow-up of the entire set of participants has occurred so far.

2) The claim of 90% or above efficacy is delicate. Shifts of one or a few cases matter. Bear in mind that a small number of participants may be excluded from the analysis. These can have completely legitimate reasons; notstanding, such exclusions can swing the outcome. That said, claiming about 80% efficacy shuold be very solid.

Pfizer modeling in Python tag:typepad.com,2003:post-6a00d8341e992c53ef0263e97981d4200b 2020-11-28T13:44:00-05:00 2020-11-28T15:49:09-05:00 Kaiser publishes the Python code that can be used to do Bayesian analysis of the vaccine trials. junkcharts

Here is code for those who want to play around with the Bayesian analysis used in the Pfizer vaccine trial study (Related blog post here). I got a request for Python code so here it is. The charts I published the other day were made in R.

``````
# Python code for Pfizer vaccine trial analysis

import pandas as pd
import numpy as np
from scipy.stats import beta
import matplotlib.pyplot as plt

# functions

def ve2vsc (ve):
if isinstance(ve, pd.Series) is False:
ve = pd.Series(ve)
return ((1-ve)/(2-ve)).round(3)

def vsc2ve (vsc):        if isinstance(vsc, pd.Series) is False:                vsc = pd.Series(vsc)
return ((1-2*vsc)/(1-vsc)).round(3)

ve = pd.Series(np.linspace(0,1,11))
vsc = ve2vsc(ve)

# line chart showing VSC and VE relationship

plt.plot(ve, vsc, color="darkgray", lw=3)
plt.title("Relationship between Vaccine Efficacy and Vaccine's Share of Cases")
plt.xlabel("ve")
plt.ylabel("vsc")

ax = plt.gca()
ax.spines['right'].set_color("none")
ax.spines['top'].set_color("none")
ax.spines['bottom'].set_position(('data',0))
ax.spines['left'].set_position(('data',0))

xlabels = (ve*100).astype('int').astype('str')+"%"
xlabels[list(range(1,11,2))] = ""
plt.xticks(ve, xlabels)
plt.yticks(pd.Series(np.linspace(0,0.5,6)), (pd.Series(np.linspace(0,50,6))).astype('int').astype("str")+"%")

ax.hlines(ve2vsc(0.5), 0, 0.5, color="brown", ls="dotted")
ax.vlines(0.5, 0, ve2vsc(0.5), color="brown", ls="dotted")
ax.hlines(ve2vsc(0.9), 0, 0.9, color="brown", ls="dotted")
ax.vlines(0.9, 0, ve2vsc(0.9), color="brown", ls="dotted")

plt.text(0.02,ve2vsc(0.5)+0.02,(ve2vsc(0.5)*100).astype("int").astype("str")+"%", color="brown", fontsize=8)
plt.text(0.02,ve2vsc(0.9)+0.02,(ve2vsc(0.9)*100).astype("int").astype("str")+"%", color="brown", fontsize=8)

plt.show()

# plotting the posterior

a1 = 0.700102	        # From protocol
b1 = 1			# From protocol
data_x = 8		# Number of cases in vaccine arm
data_n = 94		# Total number of cases

def a2 (a1, x):
return a1 + x

def b2 (b1, n, x):
return b1 + n - x

def vsc_posterior(samp, data_x):
return pd.Series(beta.pdf(samp, a2(a1,data_x), b2(b1,data_n, data_x)))

vsc_sample = pd.Series(np.linspace(0,0.5,72))

# single posterior curve vs vsc

plt.plot(vsc_sample, vsc_posterior(vsc_sample, 8), lw=3)
plt.title("Predicted Prob. of Vaccine's Share of Cases")
plt.xlabel("VSC")
plt.ylabel("Probability")

ax = plt.gca()
ax.spines['right'].set_color("none")
ax.spines['top'].set_color("none")
ax.spines['bottom'].set_position(('data',0))
ax.spines['left'].set_position(('data',0))

plt.xticks(pd.Series(np.linspace(0,0.5,6)), (pd.Series(np.linspace(0,50,6))).astype('int').astype('str')+"%")
plt.yticks(pd.Series(range(0,21,4)), [""]*6)

plt.show()

# posterior vs ve

plt.plot(vsc2ve(vsc_sample), vsc_posterior(vsc_sample, 8), lw=3)

xlabels = (pd.Series(range(0,110,10))).astype("str")+"%"
xlabels[list(range(1,11,2))] = ""
xlabels

plt.xticks(pd.Series(np.linspace(0,1.0,11)), xlabels)
plt.yticks(pd.Series(range(0,21,4)), [""]*6)
plt.title("Predicted Prob. of Vaccine Efficacy")
plt.xlabel("ve")
plt.ylabel("probability")

plt.show()

# three posterior curves vs ve

plt.plot(vsc2ve(vsc_sample), vsc_posterior(vsc_sample, 3), color="blue", lw=1.5, linestyle="dashed", alpha=0.4)
plt.plot(vsc2ve(vsc_sample), vsc_posterior(vsc_sample, 47), color="purple", lw=1.5, linestyle="dashed", alpha=0.4)
plt.plot(vsc2ve(vsc_sample), vsc_posterior(vsc_sample, 8),
color="darkgray", lw=3)

ax = plt.gca()
ax.spines['right'].set_color("none")
ax.spines['top'].set_color("none")
ax.spines['bottom'].set_position(('data',0))
ax.spines['left'].set_position(('data',0))

xlabels = (pd.Series(np.linspace(0,100,11))).astype("int").astype("str")+"%"
xlabels[list(range(1,10,2))] = ""
xlabels

plt.xticks(pd.Series(np.linspace(0,1.0,11)), xlabels)
plt.yticks(pd.Series(range(0,21,4)), [""]*6)
plt.title("Predicted Prob. of Vaccine Efficacy")
plt.xlabel("ve")
plt.ylabel("probability")

plt.text(0.02,vsc_posterior(vsc_sample, 47).max().round(2)+0.2,"47-47", color="purple", fontsize=8)
plt.text(0.02,vsc_posterior(vsc_sample, 47).max().round(2)+1,"No Effect", color="purple", fontsize=8)
plt.text(0.88,vsc_posterior(vsc_sample, 3).max().round(2)-0.2,"3-91", color="blue", fontsize=8)
plt.text(0.88,vsc_posterior(vsc_sample, 3).max().round(2)+0.5,"BEST", color="blue", fontsize=8)
plt.text(0.82,vsc_posterior(vsc_sample, 8).max().round(2)+0.2,"8-86", color="darkgray", fontsize=8)
plt.text(0.82,vsc_posterior(vsc_sample, 8).max().round(2)+1,"DATA", color="darkgray", fontsize=8, weight="bold")

plt.show()

# prior and posterior vs ve

vsc_prior = pd.Series(beta.pdf(vsc_sample, a1, b1))

plt.plot(vsc2ve(vsc_sample), vsc_prior, color="magenta", lw=1.5, ls="dashed", alpha=0.5)
plt.plot(vsc2ve(vsc_sample), vsc_posterior(vsc_sample, 8),
color="darkgray", lw=3)

ax = plt.gca()
ax.spines['right'].set_color("none")
ax.spines['top'].set_color("none")
ax.spines['bottom'].set_position(('data',0))
ax.spines['left'].set_position(('data',0))

xlabels = (pd.Series(np.linspace(0,100,11))).astype("int").astype("str")+"%"
xlabels[list(range(1,10,2))] = ""
xlabels

plt.xticks(pd.Series(np.linspace(0,1.0,11)), xlabels)
plt.yticks(pd.Series(range(0,21,4)), [""]*6)
plt.title("Prior and Posterior Prob. of Vaccine Efficacy")
plt.xlabel("ve")
plt.ylabel("probability")

plt.text(0.15, 2,"prior", color="magenta", fontsize=8)
plt.text(0.76,12,"posterior", color="darkgray", fontsize=8)

plt.show()

# compute probability VE > 30%, 70% and 90% given 8-86 split of cases

beta.cdf(ve2vsc(0.3), a2(a1,8), b2(b1,data_n, 8)).round(3)
beta.cdf(ve2vsc(0.7), a2(a1,8), b2(b1,data_n, 8)).round(3)
beta.cdf(ve2vsc(0.9), a2(a1,8), b2(b1,data_n, 8)).round(3)

``````