In class last week, I discussed this New York Times article with the students. One of the claims in the article is that the U.S. News ranking of colleges is under threat by newcomers whose rankings are more relevant because they more directly measure outcomes such as earnings of graduates.
This specific claim in the article makes me head hurt: "If nothing else, earnings are objective and, as the database grows into the millions, reliable."
The entire Chapter 1 of Numbersense (link) is devoted to blowing apart the myth that school rankings are "objective." In fact, I go on to assert that even objective-sounding metrics like company revenues are not objective at all... if you know GAAP and the games accountants play with those numbers. If someone buys a car on eBay from another person for $20,000, does eBay book $20,000 in revenues or just x% of the revenues that the seller pays to eBay?
Objective implies there is a ground truth that can be verified. There is no true school ranking, nor is there true revenue.
Where does this "objective" earnings data come from? Apparently a company called Payscale, whose methodology is explained here. They say their data come from "individuals who fill out the PayScale Salary Survey." How did these people discover the survey? We don't really know. Are these people representative of the universe of employed people? Most likely not but we again do not know.
Do people give out their real salaries voluntarily? Not in my experience. In fact, none of my co-workers in my 15+ years in the corporate world has ever told me how much money they make. PayScale claims that the salary number "combines base annual salary or hourly wage, bonuses, profit sharing, tips, commissions, overtime, and other forms of cash earnings, as applicable." Do people have their aggregate salaries at their finger tips? I highly doubt it (unless they make just a base salary).
In return for filling out the surveys, the individuals receive a free salary report. Does this encourage people to make up fake data just to obtain the salary report? Take a guess.
PayScale claims that it "rigorously tests and verifies" the data. Given that most employers wouldn't even do salary verification for other employers, I don't believe the salary data can be verified, and absolutely not "every data point" as claimed in their marketing materials.
The other ludicrous claim is that the "reliability" of the data improves with scale. This is only true if the data is a proper random sample of the population of all salaries, which it clearly isn't. Dumping more garbage on top of garbage is still garbage.
Consider this sequence of scenarios: if
(a) you are the Devious Dean of Admissions at a college, and
(b) improving your college's ranking is on your annual performance management plan, and
(c) you know that the Economist ranking is largely based on the PayScale Salary Report, and
(d) PayScale's data come from the Salary Surveys which do not require explicit identity verification, and
(e) you have access to various devices that can access these Salary Surveys
why are you not sending in a bunch of fake reports of outsized compensation?
Oh, my Mom reads this blog so let me not promote unethical behavior. You don't need to fake data. You can just target a bunch of alumni who have had successful careers, and encourage them to send in their true reports.
***
The other key sentence in the article is: "[The Economist] took the College Scorecard earnings data and performed a multiple regression analysis to assess how much a school’s graduates earn compared with how much they might have made had they attended another school."
I asked the students what are the explanatory (X) variables that might be found in such a regression. Some of the answers were: job title, gender, location of job, GPA, college major, number of years of work experience, family background. Essentially the "all else equal" requires a lot of covariates.
It turns out PayScale has a product called MarketMatch (link) which gives us some hints. Here is a description of this product:
The MarketMatch algorithm uses a two-step process for producing compensation data in a PayScale report. The first step is to understand which of our more than 250 compensable factors are important when it comes to pricing a job and how that job's pay is affected by these compensable factors. This is done in order to define a pay distribution for this job. The mix of compensable factors and their effect on pay is highly dependent upon the job. For example, coding languages and locations are important compensable factors for a Software Developer, while average sales prices and annual sales are important for an Account Executive.
This description leads to a very complex multiple regression model, with "more than 250" covariates, and a host of interaction effects (e.g. allowing the effect of location to depend on the job title). This model has at least 250 main effects. If it has all pairs of 2-way interactions, the regression equation has over 31,000 more terms.
PayScale did earlier impress us with their 1.4 million salary profiles (which, for the following discussion, we assume to be objective and reliable.) This, they say, translate to anywhere from 50 to 4000 profiles per school. While the lower limit is 50 students, PayScale actually does not publish results for schools with fewer than 325 profiles.
Even with 4000 profiles, you can't estimate tens of thousands of regression coefficients with any resemblance of accuracy. If each of the 250 factors were binary ("Yes"/"No"), you would have created 2^250 unique types of individuals. That number has 75 zeroes in it, and you only have 4,000 observations. For overwhelming majority of these types of individuals for which you are issuing predicted salaries, you have zero data.
Is the resulting salary predictions "reliable"? You decide.
"The first step is to understand which of our more than 250 compensable factors are important when it comes to pricing a job and how that job's pay is affected by these compensable factors." says to me that they do not use the same set of 250 for each occupation. For example statistician/data scientist or similar the computer languages known are important but are not at all for a human resources graduate, so wouldn't be asked.
I expect that the way of obtaining data is to offer people who search the web for salaries the possibility to compare theirs to others by filling out a survey. So it is biased towards people who are either looking for an increase in salary or a new job. Also people have a tendency to ask hypothetical questions, for example if I got a Masters would an extra $20,000 per year be typical.
Posted by: Ken | 10/28/2016 at 09:01 PM
Maybe they use regularization to not add even all the potential main effects. Maybe they are smarter than your naive strawman?
Posted by: Chris | 11/02/2016 at 07:42 PM
Chris: There are no technical solutions to the first problem, which is unreliable input. As for the second problem, they explicitly use an interaction effect as an example of value add so they clearly have *significant* interaction effects in their model. You can use regularization but if you don't have enough data, either your effects are artifacts of your model, or your effects are shrunk to zero. How do you think regularization help?
Posted by: Kaiser | 11/03/2016 at 10:47 AM