In the prior post, I featured the following chart:

This is a simulation of the distribution of GPAs of Yale graduates based on the just-published 2023 distribution of course grades.

What Yale released is a distribution of grades across all courses and departments during the last decade. The press picked up the 2023 grade distribution in particular, and I also used that as a basis for the simulation.

The published information is a distribution of grades, which is not the same as a distribution of GPAs. The GPA is an average over 36 courses, which is the requirement for graduation at Yale. In order to bridge the gap between course grades and student GPAs, we need to know how grades are distributed within students.

Since this information is missing, a statistician will leverage a "model" to fill in the blank.

The most naive model is that all students are alike, everyone is like the average student. In this case, everyone has the same distribution of grades as the average student. This means 22 As, 7 A-s, 4 B+s and 3 B or lower, for the 36 courses taken. This model replicates the average GPA of 3.70. In fact, every student has a GPA of 3.70.

An obvious improvement is to insert more variability so that different students have different GPAs while keeping the average GPA at 3.70.

***

The model that underlies the chart above is called "multinomial". It assumes all courses are like the average course. Any student enrolling in any course faces the following probabilities: 60% chance of getting an A, 20% chance of getting an A-, 10% B+, and 10% C+. (The last category C+ is a catch-all category: in real life, it stands for a grade B or below; based on the released information, the average of all grades in this category is 2.30, which equates to a C+.)

Each student takes 36 courses, and for each course, the student can obtain one of those four grades, with those stated probabilities. It turns out there are about 9,100 different grade combinations, starting with 36 As (perfect) and ending with 36 C+s. Each combination leads to a GPA value that ranges between 2.30 and 4.00.

Those 9,100 combinations reduce to under 200 distinct GPA values, if we round all GPAs to two decimal places. This is because there are many ways to obtain the same GPA. For example, anyone with 35 As and 1 A-s has a 3.99 GPA. But the one A- can come from the 1st, .... or 36th course taken.

For each value of GPA (e.g. 4.00, 3.99), I aggregate all the different grade combinations, and the probabilities of a student getting those combinations. Those probabilities are what you see in the chart above. The probabilities are cumulative meaning that they represent the probability of getting a GPA at or below some value.

For example, the probability of getting a GPA of 3.70 or below is 50%. This is often interpreted as half the students get 3.70 or below. The probability of getting a GPA at 3.87 or above (i.e. between 3.87 and 4.00) is 1 percent so the top 1 percent of graduates earn GPAs at 3.87 or above.

Notice how tightly the GPAs cluster. Half the student body is bunched between 3.64 and 3.75. Ninety-percent of the GPAs are between 3.48 and 3.87.

Since each GPA is an average of 36 grades, classic statistical theory predicts that the distribution of GPAs is substantially less variable than the distribution of grades. While 10 percent of grades are B or below, it is almost impossible to get a GPA of B+ or below.

***

The multinomial model itself uses many assumptions.

We assume every student faces the same odds. In fact, the grading distribution depends on which department, which professor, which course, etc. We also make many assumptions of independence. We assume that all students are independent of each other - this would be violated if some students collaborate on homework, or form study groups, for example. We assume that all courses are independent of each other. An obvious violation of this assumption is that some departments have stricter grading policies than others. As pointed out in my first post, it is also likely that a student who gets an A in one class tends to get an A in another class.

I also assume that the grade distribution at the course level is constant through time. There is disclosure in Yale's report that this assumption is not realistic - they talk about a Covid-19 effect that causes another inter-year jump in grades. Note that the first 9 courses taken by any graduating senior were from four years ago, and so I should have used the course grade distribution from that year, rather than that from 2023. This is an assumption of convenience, which is different from the structural assumptions described in the prior paragraph. I can easily fix this assumption by making the model more complex, and utilizing the additional data released by Yale. I just chose not to bother with the extra complexity for a blog post.

The independence assumptions, however, cannot be simply removed. I'd have to gather details like course departments and professors, and study groups, and so on. None of those data are simple to obtain.

On a high level, every assumption papers over some missing information. Many modern statisticians, especially those of the Bayesian variety, are highly comfortable making assumptions. Doing so is akin to clearing hurdles. Once past, the terrain is much smoother. However, it is very easy to forget about these assumptions, some of which may represent reality poorly.

## Comments

You can follow this conversation by subscribing to the comment feed for this post.