A couple of researchers recently claimed to have found evidence of “a battle of the thermostat”: specifically, they argued that women perform significantly worse in a math test when it is administered at lower room temperatures while men’s performance does not decline. They did not find a similar gender gap in a word test.

The analysis strategy is similar to one that is employed by business analysts so readers here might be interested in how it went off track.

First, the researchers looked at the effect of temperature on test scores in aggregate. They found none. See the chart below, which is a replicate of similar charts in the research paper.

Note that in my chart, the scores have been standardized (to be explained below). For both math and word scores, the difference in scores between the extreme temperatures of 15 to 32 celsius (60 to 90 Fahrenheit) is smaller than 0.25 standard units, clearly not statistically significant by any measure.

The next step in the analysis strategy is to rotate through other factors that might influence test scores, such as gender, language spoken, and college major. In the business world, this is often called a **deep-dive analysis**. The question being asked is whether the effect of temperature on test scores differed by [factor X]. The researchers struck gold when splitting the data by gender. Here are the relevant charts:

In the top chart displaying math scores, it appears that the trend line for men is almost flat while women’s scores progressively increase as temperature increases. According to the usual convention of statistical significance, the effect is strong enough to be published. It is even possible to explain this observation using "battle of the thermostat".

***

Many other researchers have expressed skepticism that the reported effect can be replicated in other settings. See, for example, here.

I explored the data a bit more. Here is the raw data shown in side-by-side boxplots. I have arranged the sessions by the average temperatures, ordered from lowest to highest.

The following features caught my attention:

- The math scores of women (top chart, red boxplots) show unusually low variability. The dispersion of these scores is much below that of men’s math scores. The dispersion of these women’s math scores is also much below that of women’s word scores (bottom chart, red boxplots).
- Focusing only on the math scores (top chart), and inspecting the data session by session, from the right to the left, I found that the gender difference is hard to see (often obscured by the high variability of the male scores), except in a few sessions with the lowest temperatures (boxed area).

In terms of math scores, the shape of the distribution for men is clearly different from that for women. The scores for men were generally higher, and contain more extreme values on the high side. If a male and a female student score the same in absolute terms, the scores do not mean the same thing in relative terms – each relative to other students of the same gender.

***

One way to deal with this difference in variability is to standardize the scores by gender. This leads to the following chart, in which the scores are expressed relative to each gender’s distribution.

Notice how the math and word charts look almost identical. Also compare to the chart above using the raw data. This shows the fragility of statistical significance: the top chart shows marginally significant while the bottom chart shows not significant.

***

[Technical note: Here is a scatter plot of the standardized math scores against the raw scores, grouped by gender. You can see that the raw women's scores are in a narrower range, and that men's scores are more disperse and contain some high extremes.]

***

Deep-dive analysis sometimes lead to unexpected discoveries. Aggregate data sometimes mask subgroup differences. But at other times, the subgroup difference is a mirage. At the disaggregated level, the sample size is smaller, and there may also be a difference in variability.

For more discussion of subgroup analysis, see **Chapter 3** of **Numbers Rule Your World (link)**, in which I discuss how the insurance industry and the educational testing community tackle this problem.

You always have to wonder how much data dredging they did to find this result, although they don't seem to have many choices for their subgroups analysis. Although you've mentioned some they don't look like the type of hypotheses that would be sensible. Arts majors are more sensitive to temperature changes is not likely to get an acceptance, even in PLos one.

There does seem to be some justification for the hypothesis. Women are different physically and physiologically which may make them more sensitive to temperature, and a quick Google does show up some evidence of a preference for higher temperatures.

For the difference in error variance I would try a nonparametric bootstrap. There are also a couple of programs that allow for models allowing for this. It wouldn't surprise me if that produced an even more convincing result.

Posted by: Ken | 06/16/2019 at 07:54 AM