Getting simple charts right
Feb 06, 2023
Ian K. submitted this chart on Twitter:
The chart comes from a video embedded in this report (link) about Chicago cops leaving their jobs.
Let's start with the basics. This is an example of a simple line chart illustrating a time series of five observations. The vertical axis starts at 10,000 instead of 0. With this choice, the designer wants to focus on the point-to-point change in values, rather than its relation to the initial value.
Every graph has add-ons that assist cognition. On this chart, we have axis labels, gridlines and data labels. Every add-on increases reading time so we should be sparing.
First consider the gridlines. In the following chart, I conduct a self-sufficiency test by removing the data labels from the chart:
You can see that the last three values present no problems. The first two, especially the first value, are hard to read - because the top gridline is missing! The next chart restores the bounding gridline, so you can see the difference that one small detail can make:
Next, let's compare the following versions of the chart. The left one contains data labels without gridlines and axis labels. The right one has the gridlines and axis labels but no data labels.
The left chart prints the entire dataset onto the chart. The reader in essence is reading the raw data. That appears to be the intention of the chart designer as the data labels are in large size, placed inside shiny white boxes. The level of the boxes determines the reader's perception as those catch more of our attention than the dots that actually represent the data.
The right chart highlights the dots and the lines between them. The gridlines are way too thick and heavy so as to distract rather than abet. This chart presumes that the reader isn't that interested in the precise numbers as she is in the trend.
As Ian pointed out, one of the biggest problems with this chart is the appearance of even time intervals when all except one of the date values are January. This seemingly innocent detail destroys the chart. The line segments of the chart encodes the pre-post change in the staffing numbers. For most of the line segments, the metric is year-on-year change but the last two line segments on the right show something else: a 19-month change, followed by a 5-month change.
I did the following analysis to understand how big of a staffing problem CPD faces.
First I restored the January 2022 time value, while shifting the Aug 2022 value to its rightful place on the time axis. Next, I added the dashed brown line, which represents a linear extension of the trend seen between January 2020-2021, before the sudden dip. We don't know what the true January 2022 value is but the projected value based on past trend is around 12,200. By August, the projected value is around 11,923, about 300 above the actual value of 11,611. By January 2023, the projected value is almost exactly the same as the actual value.
This linear trending analysis is likely too simplistic but it offers a baseline to start thinking about what the story is. The long-term trend is still down but the apparent dip in 2022 may not be meaningful.
How about the "break" in the y-axis? Without having values 0-10,000 on the y-axis, changes in the y-value (in this case total CPD Staff) are accentuated, sometimes misleadingly so.
In this case if you cover up the y-axis and just look at the graph it appears that CPD staffing is half of what it was at the beginning of the graph, when actually the decrease is around 12%.
I don't think that this is technically "wrong," but a strategy that chart-makers use to overemphasize differences.
Posted by: James McKee | Feb 07, 2023 at 02:07 PM
Here's a dumb one: https://en.wikipedia.org/wiki/United_States_congressional_apportionment#/media/File:House_Seats_by_State_1789-2020_Census.png
Posted by: Guest | Feb 11, 2023 at 03:26 PM