Andrew Gelman nails it again with this post titled "combining apparently contradictory evidence." He uses the example of repeated tests given to the same student, such as the scores from multiple assignments within the same course. One student might get 80,80,80 for three equally-weighted assignments while another student might get 80,100,60. The issue is that the sample size of three is too small to judge not only the average score but also the variability in score.
I made a comment about exactly the same problem I encountered when reading applications for the MS program at Columbia. Most of the applicants have good STEM undergrad degrees and no meaningful work experience. At first, I thought the three reference letters would be useful to differentiate the applicants.
It turns out that most applicants get 3 good references, almost always from professors who taught them. Occasionally, an applicant would get a poor reference, i.e. the professor is not recommending the student. However, in all such cases, the one poor reference is contradicted by two good references. So who do I believe? I typically don't know the authors of these references, and therefore have no external information about their reliability.
I am very aware that the sample of three is too small. One is tempted to think that because this applicant got inconsistent references while most other students did not is a "signal" that this applicant must be worse than the average, but drawing that conclusion is to ignore the small sample size - and the small-sample problem is even worse when drawing conclusions using the observed inconsistency of grades!
***
Thinking back to the grad school admissions process also makes me more sympathetic about the practical rationale for Princeton's decision to walk back its grade-deflation policy (see critical post here).
You might think undergrad GPAs are useful for making admissions decisions. The decades of grade inflation have vanquished this metric, as almost every recent graduate has a GPA say in the 3.5 to 3.8 range.
In fact, when metrics are gamed, it is usually not just uninformative - it can be anti-informative. Such metrics can lead to very bad conclusions.
The difference in GPAs no longer reflects a difference in ability between students. They are more likely to be influenced by other factors such as (a) when the student graduated and (b) whether the school or department uses grade deflation (or a grading curve).
Take date of graduation. If someone has a GPA lower than 3.0, it is almost always the case that this student graduated in the 1990s or before. But the GPA numbers are typically not presented together with year of graduation - so if the analyst does not recognize and adjust for this long-term grade-inflation trend, then the older candidates face a systematic discrimination.
This line of thinking takes me back to Princeton's decision to end grade deflation. Same problem here - when the admissions officer reads the GPA, it's not typically presented next to the college that grants it. There are hundreds of colleges the officer might come across during the admissions process, so it's impossible to hold in one's head the grading policies of so many colleges. In fact, even though I know a lot about Princeton's grading policies over the years, it still requires unreasonable effort to bring this contextual information into the decision-making. For one thing, I'd have to be aware of the different periods of grade inflation, then deflation, then inflation, etc. Therefore, I believe that the grade-deflation policy put Princeton graduates at a disadvantage when competing for scholarships, grad-school spots, etc.
***
If schools are required to release data on grades, then it would be possible to overcome the interpretation problems with the GPAs. Knowing the grade distributions by major, by school, by year is a good start.
Comments
You can follow this conversation by subscribing to the comment feed for this post.