## Lines, gridlines, reference lines, regression lines, the works

##### Apr 16, 2018

This post is part 2 of an appreciation of the chart project by Google Newslab, advised by Alberto Cairo, on the gender and racial diversity of the newsroom. Part 1 can be read here.

In the previous discussion, I left out the following scatter bubble plot.

This plot is available in two versions, one for gender and one for race. The key question being asked is whether the leadership in the newsroom is more or less diverse than the rest of the staff.

The story appears to be a happy one: in many newsrooms, the leadership roughly reflects the staff in terms of gender distribution (even though both parts of the whole compare disfavorably to the gender ratio in the neighborhoods, as we saw in the previous post.)

***

Unfortunately, there are a few execution problems with this scatter plot.

First, take a look at the vertical axis labels on the right side. The labels inform the leadership axis. The mid-point showing 50-50 (parity) is emphasized with the gray band. Around the mid-point, the labels seem out of place. Typically, when the chart contains gridlines, we expect the labels to sit right around each gridline, either on top or just below the line. Here the labels occupy the middle of the space between successive gridlines. On closer inspection, the labels are correctly affixed, and the gridlines drawn where they are supposed to be. The designer chose to show irregularly spaced labels: from the midpoint, it's a 15% jump on either side, then a 10% jump.

I find this decision confounding. It also seems as if two people have worked on these labels, as there exists two patterns: the first is "X% Leaders are Women", and second is "Y% Female." (Actually, the top and bottom labels are also inconsistent, one using "women" and the other "female".)

The horizontal axis? They left out the labels. Without labels, it is not possible to interpret the chart. Inspecting several conveniently placed data points, I figured that the labels on the six vertical gridlines should be 25%, 35%, ..., 65%, 75%, in essence the same scale as the vertical axis.

Here is the same chart with improved axis labels:

Re-labeling serves up a new issue. The key reference line on this chart isn't the horizontal parity line: it is the 45-degree line, showing that the leadership has the same proprotion of females as the rest of the staff. In the following plot (right side), I added in the 45-degree line. Note that it is positioned awkwardly on top of the grid system. The culprit is the incompatible gridlines.

The solution, as shown below, is to shift the vertical gridlines by 5% so that the 45-degree line bisects every grid cell it touches.

***

Now that we dealt with the purely visual issues, let me get to a statistical issue that's been troubling me. It's about that yellow line. It's supposed to be a regression line that runs through the points.

Does it appear biased downwards to you? It just seems that there are too many dots above and not enough below. The distance of the furthest points above also appears to be larger than that of the distant points below.

How do we know the line is not correct? Notice that the green 45-degree line goes through the point labeled "AVERAGE." That is the "average" newsroom with the average proportion of female staff and the average proportion of leadership staff. Interestingly, the average falls right on the 45-degree line.

In general, the average does not need to hit the 45-degree line. The average, however, does need to hit the regression line! (For a mathematical explanation, see here.)

Note the corresponding chart for racial diversity has it right. The yellow line does pass through the average point here:

***

In practice, how do problems seep into dataviz projects? It's the fact that you don't get to the last chart via a clean, streamlined process but that you pass through a cycle of explore-retrench-synthesize, frequently bouncing ideas between several people, and it's challenging to keep consistency!

And let me repeat my original comment about this project - the key learning here is how they took a complex dataset with many variables, broke it down into multiple parts addressing specific problems, and applied the layering principle to make each part of the project digestible.