Right now, the home page of New York Times online shows the following headline for me:
[Link to article]
The chart is an example of great data visualization. It is concise without being over-elaborate. For example, the axes are not labeled in detail but it doesn't take much effort to understand the message - that the proportion of whites and Asians are "over-represented" in "top" colleges while that of Hispanics and Blacks are "under-represented." The representation gap is "worsening" over time.
I have placed some words in quotes because those words are subjective even though they look like they are supported by data analysis. A good instance demonstrating that data analysis cannot be separated from the assumptions underlying them.
A key issue is the definition of this representation gap. The gap is defined using an underlying reference level, which is the ethnic composition of the aggregate U.S. population. Under this perspective, a college is at fault if the proportion of Whites or Asians in its student body is higher than the respective proportions in the population at large.
But the U.S. population is not an appropriate reference level. A more relevant reference level might be all college applicants or all college-age Americans.
The authors actually switches the reference level in the first few paragraphs. Right after the chart showing proportions of aggregate population, they showed a different set of charts showing proportions of college-age population. For example:
[The terms "elite" schools and "top" schools are not explained in the article.]
Which schools are included in the list of "top" colleges affects the conclusion. The detailed charts shown in the article itself show a lot of diversity amongst the schools. For example:
Stanford and Berkeley are both located in the Bay Area so they are comparable from a geographical perspective. They are not comparable in other ways. One is a private university; the other is public. Berkeley has for a number of years based admissions on more objective metrics like test scores (after a ban on affirmative action was imposed) while Stanford uses more "holistic" criteria.
There are clear differences between ethnic composition by geography. For example, this analysis of 45 major cities by Priceonomics shows the proportion of whites ranging from less than 10% to over 50%, the proportion Asians ranging from 0% to 30%, and the proportion of Hispanics from under 10% to over 60%. These numbers are all based on aggregate population, rather than college-age population, or college applicants.
Aggregate statistics are useless when investigating this issue. This sort of naive analysis is frequently used in political discourse. But naive analysis is naive. In fact, naive analysis can be quite dangerous as it promotes an over-simplified view of the world.
In the case of college admissions, there is another side to it. For a number of years now, some Asian parents have been suing top colleges for "under"-representing Asians. In this perspective, they are unfairly treated because the admission rate of students with comparable credentials should be roughly equal, and evidence seems to show that Asians of comparable credentials have a lower admission rate than other ethnicities.
The crux of the analysis is driven by how one defines which groups to compare.
Chapter 3 of Numbers Rule Your World covers statistical concepts useful in thinking about this debate.