Losing the plot while stacking up the bars
Aug 25, 2023
I came across this chart from an infographics that claims to show which zip codes in the U.S. are the "dirtiest" (link). I won't go into the data analysis in this post - it's the usual "open data" style analysis that takes whatever data they could find (in this case, 311 calls) and make some hay out of it.
It's amazing how such analyses frequently land on the Top N, Bottom N table. Top/Bottom N is euphemistically called "insights". But "insights" should answer at least one of these following questions: Where are these zip codes? What's the reason why 11216 has the highest rate of complaints while 11040 has the lowest? What measures can be taken to make the city cleaner?
***
The basic form chosen for this graphic is the bar chart. The data concerns the number of complaints per 100,000 people (about sanitation - they didn't disclose how they classified a complaint as about sanitation).
To mitigate the "boredom" of bar charts, the designer made the edges of the bars swiggly, and added icons of items found in trash inside the bars. These are thankfully not too intrusive.
Why are all the data printed on the chart? Try mentally wiping the data labels, and you'll understand why the designer did it.
If readers look at data labels rather than the bars, then the data visualization surely has failed. I'd prefer to use an axis.
If you spend a few more minutes on the chart, you may notice the gray parts. This is not the simple bar chart but a stacked bar chart. In effect, every bar is referenced to the first bar, which shows the maximum number of complaints per 100K people. For example, zip code 10474 has about 90% of the complaints experienced in zip code 11216, the "dirtiest" place in New York.
***
The infographic then moves on to Los Angeles, and repeats the Top N/Bottom N presentation:
With this, the plot is lost.
For an inexplicable reason, the dirtiest zip code in LA does not occupy the entire length of the bar. The worst zip code here fills out 87% of the bar length, implying that the entire bar represents the value of 34,978 complaints per 100K people. How did the designer decide on this number?
As a result, every other value is referenced to 34,978 and not to the rate of complaints in the dirtiest zip code!
***
The infographic eventually covers Houston. Here are the dirtiest two zip codes in Houston:
How does one interpret the orange section of the second bar? The original intention is for us to see that this zip code is about 80% as dirty as the dirtiest zip code. However, the full length of the bar does not here represent the dirtiest zip code.
***
We also got a hint as to why this entire analysis is problematic. The values in LA are way bigger than those in NY, about 4 times higher at the top of the table. Is LA really that much dirtier than NY? Or perhaps the data have not been properly aligned between cities?
P.S. [8-26-2023] Added link to the infographic.
Comments
You can follow this conversation by subscribing to the comment feed for this post.