Dan at Eye Heart New York has a fantastic post relating to the recent release of restaurant health inspection data by New York City. This has caused a furor among the restaurant owners because they are now required to wear their A/B/C badges front and center. Dan collected some data (which he also posted), made some charts, and reported some interesting insights.
Here is an overview chart that shows the distribution of scores (the higher the score, the lower the grade). He called it a "scatter plot" but it is really a histogram where the bucket size is 1 except for the rightmost bucket.
I like the use of green, yellow and red colors to indicate (without words) the conversion scale from scores (violation points) to grades (A/B/C). The legend "Count" is an Excel monstrosity. I'd have used a bucket size of at least 5, which would smooth out the gyrations in the green zone.
A more typical way to summarize numeric data in groups is Tukey's boxplot, as shown below.
I use Dan's raw data on this chart. 1 = A, 2 = B, 3 = C. What is group 4?
It turns out Dan has removed this group from all of his analysis. A little research shows that group 4 are restaurants that have been closed by the Dept of Health. Interestingly, the scores of these restaurants are spread widely so the DOH appears to be closing restaurants not just for health violations. (In the rest of this post, I have removed group 4.)
For those not familiar with box plots, the box contains the middle 50% of the data (in this case, the scores of the middle half of the restaurants in the respective group); the line inside the box is the median score; the dots above (or below, though nonexistent here) the vertical lines are outliers. As Dan pointed out, group C has lots of outliers on the high end of the score.
Just for fun, I pulled the violations of the highest scoring restaurant (111 violation points). What I find intriguing is the huge fluctuation in scores over the last 5 inspections. Does this happen to other restaurants too? What does that say about the grading system?
Next, Dan then attempted to address the questions: did scores vary across the 5 boroughs? and did scores vary across cuisine groups? This is the concept covered in Chapter 1 of my book: always look at the variation around averages, that's where the most interesting stuff is.
He calculated the means and standard deviations of different subgroups. It is simpler to visualize the data, again using boxplots.
Here's one dealing with boroughs, and it is clear that there is not much to pick between them. You could possibly say Staten Island is better than the other 4 boroughs.
Here's one dealing with cuisine groups, using Dan's definitions.
The order of the cuisine groups is by median score from lowest on the left to highest on the right. Again, there is no drastic difference. It is certainly not the case that Asian/Latin American restaurants are worse than say European or American ones.
About half of the restaurants under desserts, drinks, misc., african, and others received As while a bit less than half of the other cuisine groups got As. Some of the cuisine groups had few egregious violators (African, Middle East) - but this data is perhaps skewed by the removal of the "closed" restaurants.
One shortcoming of the traditional boxplot is the omission of how large each group is. For groups that are too small, it is difficult to draw any statistical conclusions. We know from Dan's table, for instance, that there were only 17 restaurants classified as "African".
(Unfortunately, Excel does not have built-in capability for generating boxplots.)