Recently, there have been a load of criticism of the initiative by the College Board known officially as “Environmental Context Dashboard” and dubbed as “adversity scores” by its opponents. I wrote about a similar issue in Numbers Rule Your World (link), a chapter subtitled "The Dilemma of Being Together", in which I explain how the College Board tries to eliminate test questions that may be biased against certain demographic groups.
The underlying problem is the interpretation of test scores: if Bob scores 1200 and Cindy scores 1280, what does the score difference of 80 points mean?
One can simply say Cindy is superior to Bob since Cindy has the higher score.
The score difference contains information on not just the direction but also the magnitude of the comparison. Cindy is 80 points better than Bob.
But the number 80 carries no meaning unless one knows the scale. It’s not enough to know that the valid scores range from 400 to 1600. Scores are not evenly spread out inside that interval; by design, few test-takers get the extreme scores.
One way the College Board helps us interpret the score differential is by converting the scores into a “percentile” scale. Cindy’s score of 1280 is the 89th percentile: she did better than 89% of the test-takers. Bob’s is at the 81st percentile. How much better is Cindy? One answer is: there are 8 percent of test-takers who score higher than Bob but lower than Cindy. (This PDF from the College Board has a table that converts between test scores and percentiles.)
***
A third student, Angela, scores 1000. How much better is Cindy compared to Angela? Since a score of 1000 is 48th percentile, we know that 41 percent of test-takers “sit” between Cindy and Angela. Cindy appears to be much better than Angela.
The percentile scale contains an implicit comparison of each student against all test-takers. Our next concern is what group should a student be compared against. This is what statisticians call the “control group”.
Let’s dig a bit deeper. If we know further that Angela went to a public school in a low-income neighborhood while Cindy went to an elite private school in New York, one may choose to interpret their respective test scores differently. Angela’s score of 1000 puts her at the top 10% of her school, and other similar schools while Cindy’s score of 1280 puts her in the bottom 25% of her school, and other similar schools. To measure this type of comparisons, one can compute a different set of percentiles – instead of computing percentiles against all test-takers, one use test-takers from similar schools and backgrounds.
In this new percentile scale, Angela’s score of 1000 might translate into 90th percentile while Cindy’s might become 28th percentile. The difference in interpretation is due to different definitions of “all else being equal”.
It’s important to realize that when we interpret differences, we implicitly make an assumption of “all else being equal”. How “all else being equal” is defined matters a lot.
***
The initiative by the College Board is an attempt to define “all else being equal” in a more rigorous manner. The new “scores” allow admissions officers to establish control groups and look at relative rather than absolute comparisons of scores. It’s not that one comparison or the other is correct.
Angela is - on an absolute basis - worse than Cindy but she is relatively better than Cindy, when each is compared to her peers. Both statements are true at the same time, and do not contradict each other.
***
The Environmental Context Dashboard or adversity scoring is an attempt to look at relative comparisons of subgroups of test-takers. Vox has a nice round-up of recent coverage. For a more positive story, see this US News report. Slate's take is a criticism of "black-box" models, in which users are not told how scores are constructed.
The contents of Chapter 3 of Numbers Rule Your World (link) play this out at the level of individual test questions, instead of aggregate test scores. What does differential performance on a specific test question say about the different abilities of test-takers? What if one finds that a test question is systematically scoring lower? Read the chapter to learn more.
Comments