In the previous post of this series, I describe the standard process of using regression to "adjust" or "correct" for biases in a dataset using regression. If we use the unadjusted sample average height from Sample A2 to estimate the average height in the population, we will be off by almost 1 inch, which is a large error of two standard errors. The under-estimation is caused by over-sampling of women, who are shorter on average than men. The gender-adjusted sample average height almost exactly replicates the population average height, as if the analyst had waved a magic wand!

Using regressions to correct for biases is an extremely popular method for dealing with selection biases endemic in observational datasets. Most real-world studies of vaccine effectivness deploy regression adjustments. Nevertheless, any serious scholar of observational data analyses would not think regression is sufficient to solve the key problems in the field.

In this post, I reveal why these regression adjustments are not the magic wand they are made out to be.

The previous post ended with this question: does the software executing regression know that Sample A2 contains a gender bias, and how?

I hope you realize that the software does not know, and cannot know. Knowing there is gender bias requires knowing what the gender ratio is in the entire population of 25,000 people. There is no way anyone can infer that from inputting 900 records from Sample A2 into the regression software.

Since we ran the regression, and it gave us an equation, and the adjusted sample average is tantalizingly close to the population average, something good must have happened, right?

Assuming you have no idea what the gender ratio ought to be, and you are forced to make an assumption, what might you do?

You might assume that the population is evenly split by gender. This is exactly what the software did. Let's verify.

Take the average of the male average (69.5) and the female average (63.7) from the above table, and you get 66.6, the exact value of the gender-adjusted sample average height.

This is a hidden assumption of all such regression models. Recall the regression equation:

66.6 + 2.9 if Male, -2.9 if Female

For men, the equation adds 2.9 inches and for women, it subtracts 2.9 inches. In order for the sample average to be equal to 66.6, you need males to balance out females, which implies a 50-50 split. (Your specific software may choose a different way of coding the gender variable but since these are equivalent formulations, the math works out the same.)

When adjusting for gender, the software is computing what the sample average height would have been if the sample were evenly split between the two genders. The software does not have any foresight into whether the sample has a gender bias, or how much that bias is -- it just assumes that the population gender ratio is 50-50, and then waves the magic wand.

**

How then can we explain the "magic" regression did to Sample A2? It's simple. I have designed Population A to be evenly split by gender, and so the naive assumption proves correct in this case.

Let's see what happens in Population B, which is 60% males, and 40% females.

Since this population of 25,000 skews male, the population average height is 67.1, slightly higher than for Population A.

As before, we give the analyst a sample of 900 records, Sample B. As she quickly discovers, Sample B also has a majority of females. The sample average height is 66.1.

Once again, if the unadjusted sample average was used to estimate the population average, we would be off by 1 inch, which is a large error. The analyst waves her magic wand, and dumps the data into regression software. Out pops the following equation:

66.6 + 2.6 if Male - 2.6 if Female

The gender-adjusted sample average height is 66.6, which is closer to the population average but still not close enough.

We can verify that this adjustment embeds the assumption of a 50-50 gender split. The average of 69.3 and 64.0 is 66.6.

If we knew Population B should be 60% males, we could enforce this. Take the sample averages for males and females from the regression, and then apply this gender ratio.

60% x 69.3 + 40% x 64.0 = 67.2

Magic! Our new sample average is now an accurate estimate for the population average of 67.1. In other words, regression works really well if you know the true gender split, which is to say, you know the exact amount of gender bias. Otherwise, it makes the naive assumption that the population is balanced by gender.

Almost nothing in the real world is balanced equally. Imagine doing age-group adjustments. Is there any population that is evenly split by age groups?

***

The analyst receives yet another sample, Sample B2, of 900 people from Population B. She finds that males account for 60% of the new sample, and the unadjusted sample average height is 66.6.

Since she doesn't know what the underlying gender ratio should be, she dumps the data into a regression software - to pre-emptively cure any gender bias. Out pops the equation:

66.1 + 2.4 if Male - 2.4 if Female

Using the standard process that her teachers have taught her, she now reports that the gender-adjusted sample average height is 66.1. This is an estimate cured of any potential selection bias by gender. She calls this a debiased estimate of the population average height, after removing any effect of gender.

Since we in fact created Population B, we know the following statistics:

We know that the population average height is 67.1, and therefore, the gender-adjusted sample average is off by 1 inch, another large error. What's worse, the gender adjustment shifted the sample average in the wrong direction from 66.6 to 66.1!

Waving the magic wand appears to have produced undesirable effects. What happened?

It turns out that Sample B2 is a random sample from Population B. This explains why the proportion of men in Sample B2 is 62%, similar to that in the population. There is in fact no gender bias at all in Sample B2.

Nevertheless, the regression software defines unbiased as a 50-50 gender split, and the adjusted sample average uses those weights. Since in Population B, the correct gender split is 60-40, the adjustment moves the number in the wrong direction!

***

Now that you understand how regression adjustments work, do you think this type of regression adjustments have successfully corrected for "a huge list of confounders" in the real-world studies of vaccine effectiveness?

There is a reason why when I covered real-world studies last year, I focused on several studies that deployed more sophisticated methods, such as matching and propensity scores. Studies that just dump data into a regression software will not generate good estimates unless the analysts have a good sense of what the population distributions are of all the relevant confounding variables. Or, unless you believe in the fairy tale of equal distributions everywhere.

***

In the next post, I will wade deeper into this morass, and study other scenarios.

I did the simulation and analyses using JMP software and all the figures come from it. I may put up some R and Python code if this series generates enough interest.

I leave you with another question, perhaps rhetorical. If it is true that regression is the magic wand that cures all biases, why is there a need to run a randomized clinical trial? We should just collect observational data, dump them into a regression that corrects for all confounders, and the resulting adjusted vaccine effectiveness will accurately replicate the true value.

***

Click here to go back to the first post of the series that sets up this post.

[9-1-21] Post #3 is now available.

These past two posts focus a lot on the "regression doesn't know". I think what you needed to spell out more clearly is that the naive process is to estimate the regression and then use the estimate of the intercept to infer the population average. You're obviously right that this is mistaken when the groups in the population are unbalanced. It would make for a good introduction to posterior predictive checks or Gelman's Mr. P. (multi-level regression with post-stratification)

Posted by: John Hall | 08/27/2021 at 11:56 AM

JH: Yes, one of the next posts will explain what regression is good for and why we do what we do.

"You're obviously right that this is mistaken when the groups in the population are unbalanced." Couldn't have said it better - and when in the real world do we have populations that are balanced?

And yes, what is in this post directly relates to Mr. P.

Posted by: Kaiser | 08/27/2021 at 02:48 PM

Thanks again for this well-explained post. It would be nice if you could make a post in this serie on multi-level regression with post-stratification. So far I have not been able to develop a good intuitive understanding of it. I am asking you that because I really like how you explain difficult ideas, by using examples that directly show the mistakes we make if we don't have a good understanding of how it works. Your examples are simple and yet very powerful. You are really gifted for explaining all this!

Posted by: Clur | 08/28/2021 at 08:36 AM