Some behind-the-scenes comments on my recent article on New York's restaurant inspection grades; it appeared on FiveThirtyEight this Tuesday.
The Nature of Ratings
This article is about the ratings of things. I devoted a considerable amount of pages to this topic in Numbersense (link) - Chapter 1 is all about the US News ranking of schools. A few key points are:
- All rating schemes are completely subjective.
- There is no "correct" rating scheme, therefore no one can prove that their rating scheme is better than someone else's rating scheme.
- A good rating scheme is one that has popular acceptance. If people don't trust a rating scheme, it won't be used. (This is a variant of George Box's quote: "all models are false but some are useful".)
- Think of a rating scheme as a way to impose a structure on unwieldy data. It represents a point of view.
- All rating schemes will be gamed to death, assuming the formulae are made public.
Based on that, you can expect that my goal in writing the 538 article is not to praise or damn the city's health rating scheme. My intention is to describe how the rating scheme works based on the outcomes. I want to give readers information to judge whether they like the rating scheme or not.
The restaurant grade dataset is an example of OCCAM data. It is Observational, it has no Controls, it has seemingly all the data (i.e. Complete), it will be Adapted for other uses and will be Merged with other data sets to generate "insights". In my article, I did not do A or M.
Hidden Biases in Observational Data
Each month (or week, check), the department puts up a dataset on the Open Data website. There is only one dataset available and the most recent copy replaces the previous week's dataset. The size of the dataset therefore expands over time.
Anyone who analyzes grade data up to the most recent few months is in for a nasty surprise. As the chart on the right shows, the proportion of grades that are not A, B or C (labeled O and gray) spikes up by 10 times the normal amount during the last two months. This chart is for an August dataset, and is not an anomaly. It's an accurate description of the ongoing reality.
On first inspection, if a restaurant is given a B or C, the restaurant has the right to go through a reinspection and arbitration process. During this time, the restaurant is allowed to display the "Grade Pending" sign. It appears that it can take up to four months for most of the B- or C-graded restaurants to be finished with this process. Over this period of time, many of the pending grades will flip to one of A, B or Cs. The chance that they will flip to B or C is much higher than the average restaurant (i.e. for which we don't know they have a Grade Pending).
Indeed, the proportion of As in the most recent two months is vastly biased upwards as a result of the lengthy reinspection process.
For this reason, I removed the last two months from my analysis.
How might this bias affect your analysis?
If you drop all Pending grades from your analysis (while retaining the A, B, and C grades), you have created an artificial trend in the last two months.
If you keep the last available grade for each restaurant, you have not escaped the problem at all. In fact, you introduce yet another complication: B- and C- graded restaurants have older inspection dates than the A-graded restaurants. Meanwhile, those Pending grades are still dropped.
If you automatically port this data to a mapping tool, or similar, you are displaying the biased data and the unknowing users are misled. In fact, the visualization no longer can be interpreted.
IMPORTANT NOTE: The data is NOT WRONG. Data cleaning/pre-processing does not just mean finding bad data. Much of what statisticans do when they explore the data is to identify biases or other tricky features.
The Nature of Statistical Analysis
[Captain Hindsight here.] Of course, I didn't know or guess that the Grade Pending bias would be a problem. I did the first analysis of the data using a July dataset, and by the time I was drafting the article for FiveThirtyEight, it was already August so I "refreshed" the analysis with the latest dataset. That's when I discovered some discrepancies that led me to the discovery.
This is the norm in statistical analysis. Every time you sit down to write something up, you notice additional nuances or nits. Sometimes, the problem is severe enough I have to re-run everything. Other times, you just decide to gloss over it and move on.