When should we use bar charts?
Jul 03, 2024
Two innocent looking column charts.
These came from an article in Significance magazine (link to paywall) that applies the "difference-in-difference" technique to analyze whether the superstitious act of skipping the number 13 when numbering floors in tall buildings causes an inflation of condo pricing.
The study authors are quite careful in their analysis, recognizing that building managers who decide to relabel the 13th floor as 14th may differ in other systematic ways from those who don't relabel. They use a matching technique to construct comparison groups. The left-side chart shows one effect of matching buildings, which narrowed the gap in average square footage between the relabeled and non-relabeled groups. (Any such gap suggests potential confounding; in a hypothetical, randomized experiment, the average square footage of both groups should be statistically identical.)
The left-side chart features columns that don't start as zero, thus the visualization exaggerates the differences. The degree of exaggeration here is tame: about 150 got chopped off at the bottom, which is about 10% of the total height. But why?
***
The right-side chart is even more problematic.
This chart shows the effect of matching buildings on the average age of the buildings (measured using the average construction year). Again, the columns don't start at zero. But for this dataset, zero is a meaningless value. Never make a column chart when the zero level has no meaning!
The story is simple: by matching, the average construction year in the relabeled group was brought closer to that in the non-relabeled group. The construction year is an ordinal categorical variable, with integer values. I think a comparison of two histograms will show the message clearer, and also provide more information than jut the two average values.
Comments
You can follow this conversation by subscribing to the comment feed for this post.