An interactive map that mostly works, except for the color scale

A reader pointed me to this piece of data journalism by the folks at FiveThirtyEight (link). The project examines the impact of the potential ban of abortion clinics in various U.S. states, and how that affects women who want abortions.

The key data visualization is an interactive map. The default map shows the current state of affairs before certain states pass abortion bans.

Fivethirtyeight_abortionmap_default

I have highlighted Coconino county in Arizona. The nearest clinic accessible to Coconino residents is in nearby Maricopa county, as seen on the map. The distance of travel is about 162 miles. This county is given a purplish-green color, which means the 162-mile distance is considered long distance in the context of the whole country, and the clinic in Maricopa has middling capacity.

Hovering over each county presents the same information about where women go to get abortions today.

Next, the designer presents a series of simulations.

Fivethirtyeight_abortionmap_selector

By pressing one of the state buttons, the reader can explore what happens if that state decided to ban abortion clinics. Naturally, we'd expect the counties of that state to be most impacted by the ban.

Here is a close-up of Coconino county after I pressed the Arizona button:

Fivethirtyeight_abortionmap_coconino_azban

 

Instead of going to Maricopa county, the women are expected to cross the state line and use the clinic in Clark county, Nevada.

In general, the colors within Arizona are darker, which means either the women have to travel further, or that they have to patronize more crowded clinics. Darker is worse.

This is what the map looks like if I light up all the boxes, i.e. the states deemed by FiveThirtyEight as having a chance of enacting abortion bans.

Fivethirtyeight_abortionmap_allbans

All in all, I think this dataviz project has many virtues. It addresses a pressing and important issue relevant to many people. The interactive components are well designed, and actually useful. Legends and annotations pop up as readers hover over the map. Lots of calculations have been performed to help answer the question of how much further someone has to go, as well as how much more congested would the facility be.

***
Nevertheless, the blog reader who told me about this project dislikes the section called "How to Read This Map".

Fivethirtyeight_abortionmap_howtoread

I agree that this color legend is difficult.

I find the three grids confusing. The green one is telling me the first column is green, and darker green represents longer travel to the clinic. The second grid is telling me the first row is pink, and darker pink indicates more congested clinics. Those are not hard.

The third grid is hard to reconcile with the rest. It appears to tell me that the diagonal row is purple, and darker purple indicates high values of both metrics.

I'm trying to juggle those three thoughts in my head, trying to reconcile them, and when I read the map below, I'm seeing a county's specific color, say medium purple, and I want to know what it means without having to refer back to the color legend. It's a fail if I keep having to look up to the legend.

And then, the real problem with the "How to Read This Map" rears its head. Up above, I clicked on Arizona for no reason other than it's the first button on the list. I hovered over Coconino as it's one of the largest counties in Arizona. Here is a close-up of what I see:

Fivethirtyeight_abortionmap_coconino_noban

Are you noticing what the problem is? The color of this county is a purplish-green mixture. It's not any of the greens, reds, or purples shown in the "How to Read This Map" section! So, I had to find the real color legend, which is elsewhere on the map. This one:

Fivethirtyeight_abortionmap_colorlegend

The color for Coconino is the top-middle cell which happens to be one of two cells that were missing in the "How to Read This Map" section. The designer correctly sensed the difficulty of this complicated, two-dimensional legend, and offered help but I feel that the effort hasn't paid off.

***

For this graphic, I think they can simplify the legend, and make it about congestion only. The reason being that the distance dimension is already captured by the lines that show up on hovering. This alternative design does turn the color to one dimensional, but it is less baffling.

Another idea is to convert distance into travel time, and congestion into service time, and then both can be summed to yield a unidimensional color scale. Quite a bit more analytical work must be done to turn congestion into service time.

***

I have a few comments on the analytics behind the dataviz; I'll put them on the book blog in the next day or two.

 

 

 


Some chart designs bring out more information than others

I forgot where I found this chart but here it is:

Nbwa_beer_index

The designer realizes the flaw of the design, which is why the number 50 is placed in a red box, and there is another big red box  placed right in our faces telling us that any number above 50 represents growing, while all below 50 shrinking.

The real culprit is the column chart design, which treats zero as the baseline, not 50. Thus, the real solution is to move away from a column chart design.

There are many possibilities. Here's one using the Bumps chart form:

Junkcharts_redo_nbwabeerpurchasersindex

There are several interesting insights buried in that column chart!

First we learn that almost all segments were contracting in both years.

Next, there are some clustering of segments. The Premium Regular and Cider segments were moving in sync. Craft, FMB/SEltzer and Below Premium were similar in 2022; intriguingly, Below Premium diverged from the other two segments.

In fact, Below Premium has distinguished itself as the only segment that experienced an improved index relative to 2022!

 

 

 

 


Finding the story in complex datasets

In CT Mirror's feature about Connecticut, which I wrote about in the previous post, there is one graphic that did not rise to the same level as the others.

Ctmirror_highschools

This section deals with graduation rates of the state's high school districts. The above chart focuses on exactly five districts. The line charts are organized in a stack. No year labels are provided. The time window is 11 years from 2010 to 2021. The column of numbers show the difference in graduation rates over the entire time window.

The five lines look basically the same, if we ignore what looks to be noisy year-to-year fluctuations. This is due to the weird aspect ratio imposed by stacking.

Why are those five districts chosen? Upon investigation, we learn that these are the five districts with the biggest improvement in graduation rates during the 11-year time window.

The same five schools also had some of the lowest graduation rates at the start of the analysis window (2010). This must be so because if a school graduated 90% of its class in 2010, it would be mathematically impossible for it to attain a 35% percent point improvement! This is a dissatisfactory feature of the dataviz.

***

In preparing an alternative version, I start by imagining how readers might want to utilize a visualization of this dataset. I assume that the readers may have certain school(s) they are particularly invested in, and want to see its/their graduation performance over these 11 years.

How does having the entire dataset help? For one thing, it provides context. What kind of context is relevant? As discussed above, it's futile to compare a school at the top of the ranking to one that is near the bottom. So I created groups of schools. Each school is compared to other schools that had comparable graduation rates at the start of the analysis period.

Amistad School District, which takes pole position in the original dataviz, graduated only 58% of its pupils in 2010 but vastly improved its graduation rate by 35% over the decade. In the chart below (left panel), I plotted all of the schools that had graduation rates between 50 and 74% in 2010. The chart shows that while Amistad is a standout, almost all schools in this group experienced steady improvements. (Whether this phenomenon represents true improvement, or just grade inflation, we can't tell from this dataset alone.)

Redo_junkcharts_ctmirrorhighschoolsgraduation_1

The right panel shows the group of schools with the next higher level of graduation rates in 2010. This group of schools too increased their graduation rates almost always. The rate of improvement in this group is lower than in the previous group of schools.

The next set of charts show school districts that already achieved excellent graduation rates (over 85%) by 2010. The most interesting group of schools consists of those with 85-89% rates in 2010. Their performance in 2021 is the most unpredictable of all the school groups. The majority of districts did even better while others regressed.

Redo_junkcharts_ctmirrorhighschoolsgraduation_2

Overall, there is less variability than I'd expect in the top two school groups. They generally appeared to have been able to raise or maintain their already-high graduation rates. (Note that the scale of each chart is different, and many of the lines in the second set of charts are moving within a few percentages.)

One more note about the charts: The trend lines are "smoothed" to focus on the trends rather than the year to year variability. Because of smoothing, there is some awkward-looking imprecision e.g. the end-to-end differences read from the curves versus the observed differences in the data. These discrepancies can easily be fixed if these charts were to be published.


Thoughts on Daniel's fix for dual-axes charts

I've taken a little time to ponder Daniel Z's proposed "fix" for dual-axes charts (link). The example he used is this:

Danielzvinca_dualaxes_linecolumn

In that long post, Daniel explained why he preferred to mix a line with columns, rather than using the more common dual lines construction: to prevent readers from falsely attributing meaning to crisscrossing lines. There are many issues with dual-axes charts, which I won't repeat in this post; one of their most dissatisfying features is the lack of connection between the two vertical scales, and thus, it's pretty easy to manufacture an image of correlation when it doesn't exist. As shown in this old post, one can expand or restrict one of the vertical axes and shift the line up and down to "match" the other vertical axis.

Daniel's proposed fix retains the dual axes, and he even restores the dual lines construction.

Danielzvinca_dualaxes_estimatedy

How is this chart different from the typical dual-axes chart, like the first graph in this post?

Recall that the problem with using two axes is that the designer could squeeze, expand or shift one of the axes in any number of ways to manufacture many realities. What Daniel effectively did here is selecting one specific way to transform the "New Customers" axis (shown in gray).

His idea is to run a simple linear regression between the two time series. Think of fitting a "trendline" in Excel between Revenues and New Customers. Then, use the resulting regression equation to compute an "estimated" revenues based on the New Customers series. The coefficients of this regression equation then determines the degree of squeezing/expansion and shifting applied to the New Customers axis.

The main advantage of this "fix" is to eliminate the freedom to manufacture multiple realities. There is exactly one way to transform the New Customers axis.

The chart itself takes a bit of time to get used to. The actual values plotted in the gray line are "estimated revenues" from the regression model, thus the blue axis values on the left apply to the gray line as well. The gray axis shows the respective customer values. Because we performed a linear fit, each value of estimated revenues correspond to a particular customer value. The gray line is thus a squeezed/expanded/shifted replica of the New Customers line (shown in orange in the first graph). The gray line can then be interpreted on two connected scales, and both the blue and gray labels are relevant.

***

What are we staring at?

The blue line shows the observed revenues while the gray line displays the estimated revenues (predicted by the regression line). Thus, the vertical gaps between the two lines are the "residuals" of the regression model, i.e. the estimation errors. If you have studied Statistics 101, you may remember that the residuals are the components that make up the R-squared, which measures the quality of fit of the regression model. R-squared is the square of r, which stands for the correlation between Customers and the observed revenues. Thus the higher the (linear) correlation between the two time series, the higher the R-squared, the better the regression fit, the smaller the gaps between the two lines.

***

There is some value to this chart, although it'd be challenging to explain to someone who has not taken Statistics 101.

While I like that this linear regression approach is "principled", I wonder why this transformation should be preferred to all others. I don't have an answer to this question yet.

***

Daniel's fix reminds me of a different, but very common, chart.

Forecastvsactualinflationchart

This chart shows actual vs forecasted inflation rates. This chart has two lines but only needs one axis since both lines represent inflation rates in the same range.

We can think of the "estimated revenues" line above as forecasted or expected revenues, based on the actual number of new customers. In particular, this forecast is based on a specific model: one that assumes that revenues is linearly related to the number of new customers. The "residuals" are forecasting errors.

In this sense, I think Daniel's solution amounts to rephrasing the question of the chart from "how closely are revenues and new customers correlated?" to "given the trend in new customers, are we over- or under-performing on revenues?"

Instead of using the dual-axes chart with two different scales, I'd prefer to answer the question by showing this expected vs actual revenues chart with one scale.

This does not eliminate the question about the "principle" behind the estimated revenues, but it makes clear that the challenge is to justify why revenues is a linear function of new customers, and no other variables.

Unlike the dual-axes chart, the actual vs forecasted chart is independent of the forecasting method. One can produce forecasted revenues based on a complicated function of new customers, existing customers, and any other factors. A different model just changes the shape of the forecasted revenues line. We still have two comparable lines on one scale.