US News school rankings are rigged. That much you should know after reading Chapter 1 of **Numbersense (link)**.

Is it the school administrators' fault that they one-up each other gaming the ranking to the nth degree? Or is it the people's fault for selecting one school over another because it's ranked higher on US News? Should US News have fact-checked submitted data? Can we blame US News for providing a product that apparently is deemed highly valuable?

This investigative report by Columbia professor Michael Thaddeus (tip from Andrew Gelman's blog) serves two purposes:

a) it exposes the tricks of the trade that have pushed Columbia up the ranking table from #10-ish to #2 over the last two decades

b) it confirms that prior expose of bad behavior have not deterred gamers (and Columbia is certainly not the only school doing this)

***

Prof. Thaddeus started out by repeating US News's selling point that 80% of the ranking is "based entirely on numerical data collected by the institution" - data which are presumed to be "objective." (N.B. self-reported, unaudited)

He proceeded to follow the numbers trail, and in each case, discovered that such supposedly "objective" "numerical data" are not objective at all - which is a key lesson of Chapter 1 of **Numbersense (link)**. In the professor's words, "several of the key figures supporting Columbia’s high ranking are inaccurate, dubious, or highly misleading." Most data are subjective, period.

Here as in Chapter 1, it is reasonable for a casual consumer of data insights to wonder how data such as "undergraduate class size", "percent of faculty who are full-time", and "student-faculty ratio" can be subjectively interpreted.

Aren't there clear, simple formulas for each item?

**Undergraduate class size** is the total number of class enrollments by all undergraduates divided by the total number of classes.

**Percent of faculty who are full-time** is the number of full-time faculty divided by the total number of faculty.

**Student-faculty ratio** is the number of enrolled students divided by the total number of faculty.

Is there any room for interpretation for such simple statistics?

***

Let Prof. Thaddeus tell you. It's more than your eyes can see!

According to US News, Columbia said 83% of its undergraduate classes have <20 students while 9% have over 50 students, so about 8% have between 20-50 students.

Prof. Thaddeus pointed out that the 83% number is well above most colleges in the U.S. The median value for all National Universities was 48%. The second and third highest numbers behind Columbia in the top 100 were both at 78%.

This statistic seems to contradict hearsay as I have never heard someone describe Columbia as a place with small class sizes, in fact, the opposite (admittedly, most of my sources are in graduate schools).

Prof. Thaddeus learned that US News applies the same set of criteria to determine class size as used in the "Common Data Set", which publishes standardized data on U.S. colleges. (Columbia is not part of this group, and is one of the eight schools in the Top 100 that do not belong.)

This is where you learn that "undergraduate class size" is not what you think it is. It's not what the simple formula above computes.

First, an undergraduate class is "a class that enrolled at least one undergraduate". This means that many graduate school classes are counted as undergraduate just because one out of 20 students is an undergraduate. Further, if a student is doing a joint bachelor's/master's program, then that student by him/herself will cause all his/her graduate-level classes to be classified as undergraduate!

Second, a long list of class types is excluded no matter how many undergraduates enroll in them. An internship class that grants course credit, for example, is not considered an undergraduate class.

Back to Columbia. Thaddeus now made a major discovery - that Columbia prepares two separate datasets (non public): one for "Columbia College and Columbia Engineering" and one for "School of General Studies". You may be reminded of 'In labor force' and "Not in Labor Force" when governments compute unemployment rates (Chapter 6 of **Numbersense (link)**). As I read this part, I'm anticipating that as this story unfolds, the students in the SGS will be forgetten and abandoned.

***

Leaving Columbia aside for a second. The class size metric is misleading even if schools provide honest data. The problem is the units. When US News say 83% of Columbia's classes have fewer than 20 students, the base unit is a class. At most schools, class size has a skewed distribution: a small number of courses (such as ECON 101) have enormous enrollments while a long tail of courses have relatively few students.

Let's say a school with 100 students offers 20 undergraduate classes. Each student takes 4 classes, resulting in 400 total enrollments. Two of these classes are compulsory so the enrollment is 100 each. The other 200 enrollments are split between 18 classes, so each of the smaller classes enrolls 11 students. Each student therefore has the same schedule, 2 large classes (n=100) and 2 small classes (n=11). Therefore, the average class size experienced for each student is (100+11)/2 = 56.

What if we apply the industry formula for computing class size? It's based on classes not enrollments. So 18 of the classes have size 11 while 2 of them have size 100, thus the average class size is 20.

Ninety percent of the classes have enrollment <= 20 students and yet each student experienced an average class size of 56. Ouch!

[P.S. If enrollments rather than classes were counted, then 50% of enrollments are in classes with size 100 while 50% were in classes below 15. This presents a more accurate description of the reality.]

***

Undeterred by the lack of disclosure, Prof. Thaddeus does the prototypical data science thing - scrape the web, in particular, Columbia's Directory of Classes, for information on every class and its maximum enrollment size.

While analyzing scraped data provides a good approximation, this method (like most web scraping exercises) cannot claim to be accurate. One problem is that capacity is not the same as actual enrollment. In fact, some classes may be offered in the catalog and later cancelled. But this information is sufficient to establish an upper bound for the class size, as well as a reasonable guess at the true average. Using this, the professor concluded roughly 63 to 67% of classes are <20 students, "nowhere near the figure of 82.5% claimed by Columbia."

For the other metrics, I'll refer you to Prof. Thaddeus report.

## Comments

You can follow this conversation by subscribing to the comment feed for this post.