This is the third post in a series on using regression to correct for biases in observational data.
Let's review what we've learned so far. In the first post, I gave an example that shows regression will correct bias in a biased dataset. We looked at a sample of height data, in which women are over-represented relative to the underlying population. Without adjustment, the sample average height under-estimates the population average since women on average are shorter than men. After the regression adjustment, the gender-adjusted sample average height is almost identical to the population average.
In the second post, I expose a hidden assumption in the so-called regression-adjusted estimates, that everything is evenly distributed. In a gender adjustment (with only males and females in the population), the assumption is 50% males, 50% females. In an age-group adjustment (with 5 levels), the assumption is 20% in each level. If age is entered as a continuous variable, the age-adjusted estimate concerns persons of average age in the sample.
Needless to say, most real-world populations are not evenly balanced, nor do they contain only average persons. Therefore, if the goal is to estimate the population value, the regression adjustment may only correct the bias partially, and may even backfire, moving the sample average further from the population average. In other words, don't assume that dumping data into a regression software is a magic wand that cures all biases. And don't trust "experts" who wave this magic wand.
***
Chapter 3 of Numbers Rule Your World (link) is titled "the dilemma of being together." The chapter explores a fundamental problem in statistical analysis: the tension between capturing complex relationships in the data and finding generalizable features. The analyst resolves this question by deciding how much to aggregate the data. In regression models, this relates to how many averages do we include in the equation.
In the most naive model, the analyst summarizes the entire sample in one number, the sample average, and asserts that it is the best estimate of the population average. Thus:
[Naive] Population Average Height = Sample Average Height
Recall the statistics of our sample B of 900 people:
The naive model simply states:
[Naive] Population Average Height = 66.1
The average height of Population B is 67.1 inches, and the error of 1 inch is massive, equal to the width of a 95% confidence interval, as explained in post #1. This model makes systematic mistakes: the errors for males tend to be negative while the errors for females tend to be positive. A good model is one in which errors are as if random.
The underlying problem is that this naive model is too simple. On average, males are taller than females. So we err by summarizing everyone to one average.
To deal with this issue, we add complexity to the regression model. The so-called gender adjustment adds a gender term:
[Gender-stratified] Population Average Height = 66.6 + 2.6 if Male - 2.6 if Female
This is the regression output if I throw the data into software. It's the same equation I showed in post #2. People then take the constant term (66.6) and call that the "gender-adjusted" sample average height.
Note that the above equation is equvalent to saying:
[Gender-stratified 2] Population Average Height for Males = 69.2 while Population Average Height for Females = 64.0.
The reason for calling this "gender-stratified" should now be clear. In fact, this is a better representation of the model because the value of the constant term is not unique.
A mathematically equivalent formulation of the above model is:
[Gender-stratified 2b] Population Average Height = 69.2 - 5.2 if Female
This little wrinkle catches many software users by surprise. Depending on how the software developer codes the data, the regression outputs are different, even though they represent the same model. If you don't believe it, check that the new formula is also identical to:
[Gender-stratified 2] Population Average Height for Males = 69.2 while Population Average Height for Females = 64.0.
***
The standard way of thinking about gender adjustment is that it makes gender irrelevant. A gender-adjusted estimate removes the effect of gender, allowing us to focus on the outcome of interest.
A different - in my view, better - way of thinking about gender adjustment is that it acknowledges gender's effect on our outcome of interest. The regression adjustment involves building gender into the structure of the model.
The gender-stratified model finds that males and females have different heights: for males, we estimate, an average of 69.2 inches while for females, we estimate an average of 64.0 inches. By contrast, the naive model treats everyone as average, and we estimate the average height to be 66.1.
Gender adjustment does not prove that gender is unimportant. It's the opposite: building gender into the model means gender cannot be ignored. If gender does not matter, then we expect the gender-stratified model to reduce to the naive model. In other words, the coefficient of the gender term is zero (statistically speaking).
[Gender-stratified 3: Gender not important] Population Average Height for Males = 66.1 while Population Average Height for Females = 66.1.
And, formulated with a gender term:
[Gender-stratified 3: Gender not important] Population Average Height = 66.1 + 0 if Male - 0 if Female
So the average women in this special place is as tall as the average men.
What about the gender-adjusted sample average height of 66.6? This is the number that researchers in certain fields would have extracted from the stratified models -- recently, they have featured prominently in studies of vaccine effectiveness, including those by CDC scientists.
Mechanistically, these researchers have built gender-stratified regression models, then they ignore the gender coefficient, and report the constant term as gender-adjusted. As explained in post #2, this procedure is equivalent to averaging the gender-level height estimates, placing equal weights on male and female.
In statistics textbooks, we routinely warn readers that the coefficients of a regression model ought to be interpreted holistically. Should this rule apply to vaccine studies? If not, why not?
I'll pick this question up in the next part of this series.
***
If you missed the previous posts, you can still catch up on post #1 and post #2.
Recent Comments