I bet you can't believe I'm going to say I don't need/want data on a blog about intelligent thinking about data.
I'm going to argue that intelligent thinking about data includes recognizing when you don't need data - and by extension, you don't need more data.
This post is motivated by opinions that are circulating about how certain public health measures against infectious diseases such as masks and lockdowns are 100% useless - although if you start thinking about other domains, you can find pertinent examples.
Let's start with lockdowns first as that's more black and white. I don't need any data to believe that lockdowns reduce infections. I'd need data if I want to establish the magnitude of protection but for the direction, I don't need data. Why? Because I have theory.
What theory? Infectious diseases spread by infections, which requires contacts. Turning this around, one can say no infection can occur unless there is contact. During the early part of the pandemic, and especially before vaccines were available, in key hotspots, schools and businesses were closed; people were asked to stay home; employees were allowed to work from home. As a result, the frequency of contacts each of us had with others was drastically reduced, and for some, reduced to near zero. By the theory, the dramatic reduction of contacts resulted in a reduction of infections.
Do I need data to prove that statement? No. [The word "theory" is weird. By theory, I mean an immutable phenomenon, such as incorporating the law of gravity in a physical model. There's probably a better word for this. Structural model?]
***
Would I like to have data? Sure. But it's not a must-have. If I had data, I hope to answer more questions, like how much reduction? does it affect all demographic segments equally?
Would I not want to have data? There is even an argument for this (although this sentiment won't be universally shared). I suspect strongly that any data that could be made available to me would be almost impossible to interpret - because we cannot run randomized experiments on lockdowns. Thus, it would take a lot of effort to clean the data, and to adjust the data, and none of this can be accomplished without making lots of subjective assumptions, sure to enrage some and delight others.
For example, all such data confound the effects of lockdowns and vaccinations because most places that had lockdowns also pushed vaccines simultaneously. What's more, masking, social distancing, and other measures were also simultaneously put to work. So, if someone had the data, it is likely to confuse rather than illuminate. Let's not forget about enforcement of lockdowns, and compliance to lockdowns. Contacts might not have been reduced if the lockdown was in name only!
So if I had data, it could show infections going up, down or sideways, and the effect of lockdowns would be masked by all the other different factors. More data could be useless.
***
More data, however, could not overturn my theory... otherwise, my theory is just conjecture. If the data happened to show that, after adjusting for all kinds of other factors, lockdowns have zero or negative impact on infections, I'd be much more likely to reject some of the subjective assumptions or the indirect data proxies, rather than to question the theory, which in this case, means I have discovered evidence that infections occur without contacts.
Even if surveillance data show that compliance to lockdowns was nonexistent, the proper way to interpret the data is to say that lockdowns by themselves should reduce infections but its effect is masked by compliance or lack thereof, so that data that confound both factors show that infections did not drop. Embedded in there is still the theory that reduced contacts reduces infections.
The theory - the structural model - is an immutable part of the larger statistical model. We expect data to conform to this structure. We don't use data to modify the theory. (This structural model is not the same thing as a prior in Bayesian models. Bayesian priors are subject to updating on observing data so it does not represent anything immutable.)
***
There are other examples in other domains as well. Here is Andrew's model of golf putting (link). The section called "modeling from first principles" is an example of a statistical model that embeds a (geometric) theory. He also later uses the term "geometry-based" model.
The simple geometry is not enough, and further iterations of the model add other factors but the original geomtry is still in there.
***
This post records my first thoughts on this topic. I'll get to masks next.
Recent Comments