Small data sets present graphing challenges

Jun 15, 2011

Felix Salmon, a blogger and foodie, investigated whether a restaurant changes its pricing based on the number of stars it gets from Sam Sifton, the New York Times' food critic. His conclusion is that "price hikes happen all over the place, from the worst-reviewed restaurants to the best." This plot was used in the post.

His message doesn't jump out of his chart. We would have to recognize that it's the dark green pieces we should be focused on, and it's the relative heights of these pieces within each stacked column. I was also misdirected by the two axis labels: number of stars and number of reviews aren't the primary dimensions. So, I thought one could find a better alternative.

***

This data turn out to be harder to plot than expected. The problem is that the sample size is small, and because of this, the data have ragged edges. We are better at reading patterns from smooth objects.

Here is what I ended up with, a small multiples chart with grouped columns. I adopted Felix's color scheme although no differentiation of color is really necessary in this version. Relative percentages are plotted instead of raw number of reviews. Each set of four columns can be viewed as a histogram or probability distribution. (Again, with more samples, the histograms will look smoother, revealing the pattern more clearly.)

I agree with Felix that there is not much correlation between star rating and pricing. However, this applies truly only to the middle three categories. At the edges, there are a couple of observations: all of the 4-star restaurants hiked their prices while the only restaurant that closed since it got reviewed received zero stars.

I'm a fan of annotating charts and so I'd recommend sticking a note on the 4 stars column, another note on the single gray column, and a third note bracketing the middle three categories, telling readers that there is nothing to see here.