Graph workflow and defaults wreak havoc
May 12, 2023
For the past week or 10 days, every time I visited one news site, it insisted on showing me an article about precipitation in North Platte. It's baiting me to write a post about this lamentable bar chart (link):
This chart got problems, and the problems start with the tooling, which dictates a workflow.
I imagine what the chart designer had to deal with.
For a bar chart, the tool requires one data series to be numeric, and the other to be categorical. A four-digit year is a number, which can be treated either as numeric or categorical. In most cases, and by default, numbers are considered numeric. To make this chart, the user asked the tool to treat years as categorical.
Many tools treat categories as distinct entities ("nominal"), mapping each category to a distinct color. So they have 11 colors for 11 years, which is surely excessive.
This happens because the year data is not truly categorical. These eleven years were picked based on the amount of rainfall. There isn't a single year with two values, it's not even possible. The years are just irregularly spaced indices. Nevertheless, the tool misbehaves if the year data are regarded as numeric. (It automatically selects a time-series line chart, because someone's data visualization flowchart says so.) Mis-specification in order to trick the tool has consequences.
The designer's intention is to compare the current year 2023 to the driest years in history. This is obvious from the subtitle in which 2023 is isolated and its purple color is foregrounded.
How unfortunate then that among the 11 colors, this tool grabbed 4 variations of purple! I like to think that the designer wanted to keep 2023 purple, and turn the other bars gray -- but the tool thwarted this effort.
The tool does other offensive things. By default, it makes a legend for categorical data. I like the placement of the legend right beneath the title, a recognition that on most charts, the reader must look at the legend first to comprehend what's on the chart.
Not so in this case. The legend is entirely redundant. Removing the legend does not affect our cognition one bit. That's because the colors encode nothing.
Worse, the legend sows confusion because it presents the same set of years in chronological order while the bars below are sorted by amount of precipitation: thus, the order of colors in the legend differs from that in the bar chart.
I can imagine the frustration of the designer who finds out that the tool offers no option to delete the legend. (I don't know this particular tool but I have encountered tools that are rigid in this manner.)
Something else went wrong. What's the variable being plotted on the numeric (horizontal) axis?
The answer is inches of rainfall but the answer is actually not found anywhere on the chart. How is it possible that a graphing tool does not indicate the variables being plotted?
I imagine the workflow like this: the tool by default puts an axis label which uses the name of the column that holds the data. That column may have a name that is not reader-friendly, e.g. PRECIP. The designer edits the name to "Rainfall in inches". Being a fan of the Economist graphics style, they move the axis label to the chart title area.
The designer now works the chart title. The title is made to spell out the story, which is that North Platte is experiencing a historically dry year. Instead of mentioning rainfall, the new title emphasizes the lack thereof.
The individual steps of this workflow make a lot of sense. It's great that the title is informative, and tells the story. It's great that the axis label was fixed to describe rainfall in words not database-speak. But the end result is a confusing mess.
The reader must now infer that the values being plotted are inches of rainfall.
Further, the tool also imposes a default sorting of the bars. The bars run from longest to shortest, in this case, the longest bar has the most rainfall. After reading the title, our expectation is to find data on the Top 11 driest years, from the driest of the driest to the least dry of the driest. But what we encounter is the opposite order.
Most graphics software behaves like this as they are plotting the ranks of the categories with the driest being rank 1, counting up. Because the vertical axis moves upwards from zero, the top-ranked item ends up at the bottom of the chart.
Moving now from the V corner to the D corner of the Trifecta checkup (link), I can't end this post without pointing out that the comparisons shown on the chart don't work. It's the first few months of 2023 versus the full years of the others.
The fix is to plot the same number of months for all years. This can be done in two ways: find the partial year data for the historical years, or project the 2023 data for the full year.
(If the rainy season is already over, then the chart will look exactly the same at the end of 2023 as it is now. Then, I'd just add a note to explain this.)
Here is a version of the chart after doing away with unhelpful default settings: