Lines, gridlines, reference lines, regression lines, the works

This post is part 2 of an appreciation of the chart project by Google Newslab, advised by Alberto Cairo, on the gender and racial diversity of the newsroom. Part 1 can be read here.

In the previous discussion, I left out the following scatter bubble plot.

Goog_newsrooms_gender_2

This plot is available in two versions, one for gender and one for race. The key question being asked is whether the leadership in the newsroom is more or less diverse than the rest of the staff.

The story appears to be a happy one: in many newsrooms, the leadership roughly reflects the staff in terms of gender distribution (even though both parts of the whole compare disfavorably to the gender ratio in the neighborhoods, as we saw in the previous post.)

***

Unfortunately, there are a few execution problems with this scatter plot.

First, take a look at the vertical axis labels on the right side. The labels inform the leadership axis. The mid-point showing 50-50 (parity) is emphasized with the gray band. Around the mid-point, the labels seem out of place. Typically, when the chart contains gridlines, we expect the labels to sit right around each gridline, either on top or just below the line. Here the labels occupy the middle of the space between successive gridlines. On closer inspection, the labels are correctly affixed, and the gridlines  drawn where they are supposed to be. The designer chose to show irregularly spaced labels: from the midpoint, it's a 15% jump on either side, then a 10% jump.

I find this decision confounding. It also seems as if two people have worked on these labels, as there exists two patterns: the first is "X% Leaders are Women", and second is "Y% Female." (Actually, the top and bottom labels are also inconsistent, one using "women" and the other "female".)

The horizontal axis? They left out the labels. Without labels, it is not possible to interpret the chart. Inspecting several conveniently placed data points, I figured that the labels on the six vertical gridlines should be 25%, 35%, ..., 65%, 75%, in essence the same scale as the vertical axis.

Here is the same chart with improved axis labels:

Jc_newsroomgender_1

Re-labeling serves up a new issue. The key reference line on this chart isn't the horizontal parity line: it is the 45-degree line, showing that the leadership has the same proprotion of females as the rest of the staff. In the following plot (right side), I added in the 45-degree line. Note that it is positioned awkwardly on top of the grid system. The culprit is the incompatible gridlines.

  Jc_newsroomgender_1

The solution, as shown below, is to shift the vertical gridlines by 5% so that the 45-degree line bisects every grid cell it touches.

Jc_newsroomgender_3

***

Now that we dealt with the purely visual issues, let me get to a statistical issue that's been troubling me. It's about that yellow line. It's supposed to be a regression line that runs through the points.

Does it appear biased downwards to you? It just seems that there are too many dots above and not enough below. The distance of the furthest points above also appears to be larger than that of the distant points below.

How do we know the line is not correct? Notice that the green 45-degree line goes through the point labeled "AVERAGE." That is the "average" newsroom with the average proportion of female staff and the average proportion of leadership staff. Interestingly, the average falls right on the 45-degree line.

In general, the average does not need to hit the 45-degree line. The average, however, does need to hit the regression line! (For a mathematical explanation, see here.)

Note the corresponding chart for racial diversity has it right. The yellow line does pass through the average point here:

Goog_newsrooms_race_2

 ***

In practice, how do problems seep into dataviz projects? It's the fact that you don't get to the last chart via a clean, streamlined process but that you pass through a cycle of explore-retrench-synthesize, frequently bouncing ideas between several people, and it's challenging to keep consistency!

And let me repeat my original comment about this project - the key learning here is how they took a complex dataset with many variables, broke it down into multiple parts addressing specific problems, and applied the layering principle to make each part of the project digestible.

 

 


A look at how the New York Times readers look at the others

Nyt_taxcutmiddleclass

The above chart, when it was unveiled at the end of November last year, got some mileage on my Twitter feed so it got some attention. A reader, Eric N., didn't like it at all, and I think he has a point.

Here are several debatable design decisions.

The chart uses an inverted axis. A tax cut (negative growth) is shown on the right while a tax increase is shown on the left. This type of inversion has gotten others in trouble before, namely, the controversy over the gun deaths chart (link). The green/red color coding is used to signal the polarity although some will argue this is bad for color-blind readers. The annotation below the axis is probably the reason why I wasn't confused in the first place but the other charts further down the page do not repeat the annotation, and that's where the interpretation of -$2,000 as a tax increase is unnatural!

The chart does not aggregate the data. It plots 25,000 households with 25,000 points. Because of the variance of the data, it's hard to judge trends. It's easy enough to see that there are more green dots than red but how many more? 10 percent, 20 percent, 40 percent? It's also hard to answer any specific questions, say, about households with a certain range of incomes. There are various ways to aggregate the data, such as heatmaps, histograms, and so on.

For those used to looking at scientific charts, the x- and y-axes are reversed. By convention, we'd have put the income ranges on the horizontal axis and the tax changes (the "outcome" variable) on the vertical axis.

***

The text labels do not describe the data patterns on the chart so much as they offer additional information. To see this, remove the labels as I have done below. Try adding the labels based on what is shown on the chart.

Nyt_taxcutmiddleclass_2

Perhaps it's possible to illustrate those insights with a set of charts.

***

While reading this chart, I kept wondering how those 25,000 households were chosen. This is a sample of  households. The methodology is explained in a footnote, which describes the definition of "middle class" but unfortunately, they forgot to tell us how the 25,000 households were chosen from all such middle-class households.

Nyt_taxcutmiddleclass_footnote

The decision to omit the households with income below $40,000 needs more explanation as it usurps the household-size adjustment. Also, it's not clear that the impact of the tax bill on the households with incomes between $20-40K can be assumed the same as for those above $40K.

Are the 25,000 households is a simple random sample of all "middle class" households or are they chosen in some ways to represent the relative counts? It's also useful to know if they applied the $40K cutoff before or after selecting the 25,000 households. 

Ironically, the media kit of the Times discloses an affluent readership with median household income of almost $190K so it appears that the majority of readers are not represented in the graphic at all!

 


Excellent visualization of gun violence in American cities

I like the Guardian's feature (undated) on gun violence in American cities a lot.

The following graphic illustrates the situation in Baltimore.

Guardian_gunviolence_baltimore

The designer starts by placing where the gun homicides occured in 2015. Then, it leads readers through an exploration of the key factors that might be associated with the spatial distribution of those homicides.

The blue color measures poverty levels. There is a moderate correlation between high numbers of dots (homicides) and deeper blue (poorer). The magenta color measures education attainment and the orange color measures proportion of blacks. In Baltimore, it appears that race is substantially better at explaining the prevalence of homicides.

This work is exemplary because it transcends description (first map) and explores explanations for the spatial pattern. Because three factors are explored together in a small-multiples layout, readers learn that no single factor can explain everything. In addition, we learn that different factors have different degrees of explanatory power.

Attentive readers will also find that the three factors of poverty, education attainment and proportion black are mutually correlated.  Areas with large black populations also tend to be poorer and less educated.

***

I also like the introductory section in which a little dose of interactivity is used to sequentially present the four maps, now superimposed. It then becomes possible to comprehend the rest quickly.

Guardian_guncrimemaps_stlouis_2

 ***

The top section is less successful as proportions are not easily conveyed via dot density maps.

Guardian_guncrime_map_prop

Dropping the map form helps. Here is a draft of what I have in mind. I just pulled some data from online sources at the metropolitan area (MSA) level, and it doesn't have as striking a comparison as the city-level data, it seems.

Redo_guardiangundeathsprop

 

 PS. On Twitter, Aliza tells me the article was dated January 9, 2017.


Getting into the head of the chart designer

When I look at this chart (from Business Insider), I try to understand the decisions made by its designer - which things are important to her/him, and which things are less important.

Incomegendergapbystate-both-top-2-map-v2

The chart shows average salaries in the top 2 percent of income earners. The data are split by gender and by state.

First, I notice that the designer chooses to use the map form. This decision suggests that the spatial pattern of top incomes is of top interest to the designer because she/he is willing to accept the map's constraints - namely, the designer loses control of the x and y dimensions, as well as the area and shape of the data containers. For the U.S. state map, there is no elegant solution to the large number of small states problem in the Northeast.

Second, I notice the color choice. The designer provides actual values on the visualization but also groups all state-average incomes into five categories. It's not clear how she/he determines the boundaries of these income brackets. There are many more dark blue states than there are light blue states in the map for men. Because women incomes are everywhere lower than men, the map at the bottom fits all states into two large buckets, plus Connecticut. Women incomes are lower than men but there is no need to break the data down by gender to convey this message.

Third, the use of two maps indicates that the designer does not care much about gender comparisons within each state. These comparisons are difficult to accomplish on the chart - one must involuntarily bob one's head up and down to make the comparisons. The head bobbing isn't even enough: then you must pull out your calculator and compute the ratio of women to men average. If the designer wants to highlight state-level comparisons, she/he could have plotted the gender ratio on a single map, like this:

Screen Shot 2017-09-18 at 11.47.23 PM

***

So far, I infer that the key questions are (a) the gender gap in aggregate (b) the variability of incomes within each gender, or the spatial clustering (c) the gender gap within each state.

(a) is better conveyed in more aggregate form. Goal (b) is defeated by the lack of clear clustering. (c) is not helped by the top-bottom split.

In making the above chart, I discover a pattern - that women fare better in the smaller states like Montana, Iowa, North & South Dakota. Meanwhile, the disparity in New York is of the same degree as Oklahoma and Wyoming.

  Jc_redo_top2pcincomes2b

 This chart tells readers a bit more about the underlying data, without having to print the entire dataset on the page.

 

 

 


A pretty good chart ruined by some naive analysis

The following chart showing wage gaps by gender among U.S. physicians was sent to me via Twitter:

Statnews_physicianwages

The original chart was published by the Stat News website (link).

I am most curious about the source of the data. It apparently came from a website called Doximity, which collects data from physicians. Here is a link to the PR release related to this compensation dataset. However, the data is not freely available. There is a claim that this data come from self reports by 36,000 physicians.

I am not sure whether I trust this data. For example:

Stat_wagegapdoctor_1

Do I believe that physicians in North Dakota earn the highest salaries on average in the nation? And not only that, they earn almost 30% more than the average physician in New York. Does the average physician in ND really earn over $400K a year? If you are wondering, the second highest salary number comes from South Dakota. And then Idaho.  Also, these high-salary states are correlated with the lowest gender wage gaps.

I suspect that sample size is an issue. They do not report sample size at the level of their analyses. They apparently published statistics at the level of MSAs. There are roughly 400 MSAs in the U.S. so at that level, on average, they have only 90 samples per MSA. When split by gender, the average sample size is less than 50. Then, they are comparing differences, so we should see the standard errors. And finally, they are making hundreds of such comparisons, for which some kind of multiple-comparisons correction is needed.

I am pretty sure some of you are doctors, or work in health care. Do those salary numbers make sense? Are you moving to North/South Dakota?

***

Turning to the Visual corner of the Trifecta Checkup (link), I have a mixed verdict. The hover-over effect showing the precise values at either axes is a nice idea, well executed.

I don't see the point of drawing the circle inside a circle.  The wage gap is already on the vertical axis, and the redundant representation in dual circles adds nothing to it. Because of this construct, the size of the bubbles is now encoding the male average salary, taking attention away from the gender gap which is the point of the chart.

I also don't think the regional analysis (conveyed by the colors of the bubbles) is producing a story line.

***

This is another instance of a dubious analysis in this "big data" era. The analyst makes no attempt to correct for self-reporting bias, and works as if the dataset is complete. There is no indication of any concern about sample sizes, after the analyst drills down to finer areas of the dataset. While there are other variables available, such as specialty, and other variables that can be merged in, such as income levels, all of which may explain at least a portion of the gender wage gap, no attempt has been made to incorporate other factors. We are stuck with a bivariate analysis that does not control for any other factors.

Last but not least, the analyst draws a bold conclusion from the overly simplistic analysis. Here, we are told: "If you want that big money, you can't be a woman." (link)

 

P.S. The Stat News article reports that the researchers at Doximity claimed that they controlled for "hours worked and other factors that might explain the wage gap." However, in Doximity's own report, there is no language confirming how they included the controls.

 


It's your fault when you use defaults

The following chart showed up on my Twitter feed last week. It's a cautionary tale for using software defaults.

Booksaleschart_sourceBISG_fromtwitter

 At first glance, the stacking of years in a bar chart makes little sense. This is particularly so when there appears not to be any interesting annual trend: the four segments seem to have roughly equal length almost everywhere.

This designer might be suffering from what I have called "loss aversion" (link). Loss aversion in data visualization is the fear of losing your data, which causes people to cling on to every little bit of data they have.

Several challenges of the chart come from the software defaults. The bars are ordered alphabetically, making it difficult to discern a trend. The horizontal axis labels are given in single dollars and units, and yet the intention of the designer is to use millions, as indicated in the chart titles.

The one horrifying feature of this chart is the 3D effect. The third dimension contains no information at all. In fact, it destroys information, as readers who use the vertical gridlines to estimate the lengths of the bars will be sadly misled. As shown below, readers must draw imaginary lines to figure out the horizontal values.

Twitter_booksalescategories_0

The Question of this chart is the distribution of book sales (revenues and units) across different genres. When the designer chose to stack the bars (i.e. sum the yearly data), he or she has decided that the details of specific years are not as important as the total - this is the right conclusion since the bar segments have similar measurement within each genre.

So let's pursue the revolution of averaging the data, plotting average yearly sales.

Redo_twitter_bookssalescategories

This chart shows that there are two major types of genres. In the education world, the unit prices of (text)books are very high while sales are relatively small by units but in aggregate, the dollar revenues are high. In the "adult" world, whether it's fiction or non-fiction, the unit price is low while the number of units is high, which results in similar total dollar revenues as the education genres.

***

Simple lesson here: learn to hate software defaults


Sorting out what's meaningful and what's not

A few weeks ago, the New York Times Upshot team published a set of charts exploring the relationship between school quality, home prices and commute times in different regions of the country. The following is the chart for the New York/New Jersey region. (The article and complete data visualization is here.)

Nyt_goodschoolsaffordablehomes_nyc

This chart is primarily a scatter plot of home prices against school quality, which is represented by average test scores. The designer wants to explore the decision to live in the so-called central city versus the decision to live in the suburbs, hence the centering of the chart about New York City. Further, the colors of the dots represent the average commute times, which are divided into two broad categories (under/over 30 minutes). The dots also have different sizes, which I presume measures the populations of each district (but there is no legend for this).

This data visualization has generated some negative reviews, and so has the underlying analysis. In a related post on the sister blog, I discuss the underlying statistical issues. For this post, I focus on the data visualization.

***

One positive about this chart is the designer has a very focused question in mind - the choice between living in the central city or living in the suburbs. The line scatter has the effect of highlighting this particular question.

Boy, those lines are puzzling.

Each line connects New York City to a specific school district. The slope of the line is, nominally, the trade-off between home price and school quality. The slope is the change in home prices for each unit shift in school quality. But these lines don't really measure that tradeoff because the slopes span too wide a range.

The average person should have a relatively fixed home-price-to-school-quality trade-off. If we could estimate this average trade-off, it should be represented by a single slope (with a small cone of error around it). The wide range of slopes actually undermines this chart, as it demonstrates that there are many other variables that factor into the decision. Other factors are causing the average trade-off coefficient to vary so widely.

***

The line scatter is confusing for a different reason. It reminds readers of a flight route map. For example:

BA_NYC_Flight_Map

The first instinct may be to interpret the locations on the home-price-school-quality plot as geographical. Such misinterpretation is reinforced by the third factor being commute time.

Additionally, on an interactive chart, it is typical to hide the data labels behind mouseovers or clicks. I like the fact that the designer identifies some interesting locales by name without requiring a click. However, one slight oversight is the absence of data labels for NYC. There is nothing to click on to reveal the commute/population/etc. data for central cities.

***

In the sister blog post, I mentioned another difficulty - most of the neighborhoods are situated to the right and below New York City, challenging the notion of a "trade-off" between home price and school quality. It appears as if most people can spend less on housing and also send kids to better schools by moving out of NYC.

In the New York region, commute times may be the stronger factor relative to school quality. Perhaps families chose NYC because they value shorter commute times more than better school quality. Or, perhaps the improvement in school quality is not sufficient to overcome the negative of a much longer commute. The effect of commute times is hard to discern on the scatter plot as it is coded into the colors.

***

A more subtle issue can be seen when comparing San Francisco and Boston regions:

Nyt_goodschoolsaffordablehomes_sfobos

One key insight is that San Francisco homes are on average twice as expensive as Boston homes. Also, the variability of home prices is much higher in San Francisco. By using the same vertical scale on both charts, the designer makes this insight clear.

But what about the horizontal scale? There isn't any explanation of this grade-level scale. It appears that the central cities have close to average grade level in each chart so it seems that each region is individually centered. Otherwise, I'd expect to see more variability in the horizontal dots across regions.

If one scale is fixed across regions, and the other scale is adapted to each region, then we shouldn't compare the slopes across regions. The fact that the lines are generally steeper in the San Francisco chart may be an artifact of the way the scales are treated.

***

Finally, I'd recommend aggregating the data, and not plot individual school districts. The obsession with magnifying little details is a Big Data disease. On a chart like this, users are encouraged to click on individual districts and make inferences. However, as I discussed in the sister blog (link), most of the differences in school quality shown on these charts are not statistically meaningful (whereas the differences on the home-price scale are definitely notable). 

***

If you haven't already, see this related post on my sister blog for a discussion of the data analysis.

 

 

 

 


February talks, and exploratory data analysis using visuals

News:

In February, I am bringing my dataviz lecture to various cities: Atlanta (Feb 7), Austin (Feb 15), and Copenhagen (Feb 28). Click on the links for free registration.

I hope to meet some of you there.

***

On the sister blog about predictive models and Big Data, I have been discussing aspects of a dataset containing IMDB movie data. Here are previous posts (1, 2, 3).

The latest instalment contains the following chart:

Redo_scorebytitleyear_ans

The general idea is that the average rating of the average film on IMDB has declined from about 7.5 to 6.5... but this does not mean that IMDB users like oldies more than recent movies. The problem is a bias in the IMDB user base. Since IMDB's website launched only in 1990, users are much more likely to be reviewing movies released after 1990 than before. Further, if users are reviewing oldies, they are likely reviewing oldies that they like and go back to, rather than the horrible movie they watched 15 years ago.

Modelers should be exploring and investigating their datasets before building their models. Same thing for anyone doing data visualization! You need to understand the origin of the data, and its biases in order to tell the proper story.

Click here to read the full post.

 

 


Here are the cool graphics from the election

There were some very nice graphics work published during the last few days of the U.S. presidential election. Let me tell you why I like the following four charts.

FiveThirtyEight's snake chart

Snake-1106pm

This chart definitely hits the Trifecta. It is narrowly focused on the pivotal questions of election night: which candidate is leading? if current projections hold, which candidate would win? how is the margin of victory?

The chart is symmetric so that the two sides have equal length. One can therefore immediately tell which side is in the lead by looking at the middle. With a little more effort, one can also read from the chart which side has more electoral votes based only on the called states: this would be by comparing the white parts of each snake. (This is made difficult by the top-bottom mirroring. That is an unfortunate design decision - I'd would have preferred to not have the top-bottom reversal.)

The length of each segment maps to the number of electoral votes for the particular state, and the shade of colors reflect the size of the advantage.

In a great illustration of less is more, by aggregating all called states into a single white segment, and not presenting the individual results, the 538 team has delivered a phenomenal chart that is refreshing, informative, and functional.

 Compare with a more typical map:

Electoral-map

 New York Times's snake chart

Snakes must be the season's gourmet meat because the New York Times also got inspired by those reptiles by delivering a set of snake charts (link). Here's one illustrating how different demographic segments picked winners in the last four elections.

 

Nytimes_partysupport_by_income

They also made a judicious decision by highlighting the key facts and hiding the secondary ones. Each line connects four points of data but only the beginning and end of each line are labeled, inviting readers to first and foremost compare what happened in 2004 with what happened in 2016. The middle two elections were Obama wins.

This particular chart may prove significant for decades to come. It illustrates that the two parties may be arriving at a cross-over point. The Democrats are driving the lower income classes out of their party while the upper income classes are jumping over to blue.

While the chart's main purpose is to display the changes within each income segment, it does allow readers to address a secondary question. By focusing only on the 2004 endpoints, one can see the almost linear relationship between support and income level. Then focusing on the 2016 endpoints, one can also see an almost linear relationship but this is much steeper, meaning the spread is much narrower compared to the situation in 2004. I don't think this means income matters a lot less - I just think this may be the first step in an ongoing demographic shift.

This chart is both fun and easy to read, packing quite a bit of information into a small space.

 

Washington Post's Nation of Peaks

The Post prints a map that shows, by county, where the votes were and how the two Parties built their support. (Link to original)

Wpost_map_peaks

The height represents the number of voters and the width represents the margin of victory. Landslide victories are shown with bolded triangles. In the online version, they chose to turn the map sideways.

I particularly like the narratives about specific places.

This is an entertaining visual that draws you in to explore.

 

Andrew Gelman's Insight

If you want quantitative insights, it's a good idea to check out Andrew Gelman's blog.

This example is a plain statistical graphic but it says something important:

Gelman_twopercent

There is a lot of noise about how the polls were all wrong, the entire polling industry will die, etc.

This chart shows that the polls were reasonably accurate about Trump's vote share in most Democratic states. In the Republican states, these polls consistently under-estimated Trump's advantage. You see the line of red states starting to bend away from the diagonal.

If the total error is about 2%, as stated in the caption of the chart, then the average error in the red states must have been about 4%.

This basic chart advances our understanding of what happened on election night, and why the result was considered a "shock."

 

 


What if the RNC assigned seating randomly

The punditry has spoken: the most important data question at the Republican Convention is where different states are located. Here is the FiveThirtyEight take on the matter:

Seatinchart.jpg_large

They crunched some numbers and argue that Trump's margin of victory in the state primaries is the best indicator of how close to the front that state's delegation is situated.

Others have put this type of information on a map:

Rnc-seating1

The scatter plot with the added "trendline" is often misleading. Your eyes are drawn to the line, and distracted from the points that are far away from the line. In fact, the R-squared of the regression line is only about 20%. This is quite obvious from the distribution of green shades in the map below.

***

So, I wanted to investigate the question of how robust this regression line is. The way statisticians address this question is as follows: imagine that the seating has been assigned completely at random - how likely would the actual seating plan have arisen from random assignment?

Take the seating assignments from the scatter plot. Then randomly shuffle the assignment to create simulated random seating plans. We keep the same slots, for example, four states were given #1 positions in the actual arrangement. In every simulation, four states got #1 positions - it's just that which four states were decided by flipping coins.

I did one hundred simulated seating plans at a time. For each plan, I created the scatter plot of seating position versus Trump margin (mirror image of  the FiveThirtyEight chart), and fitted a regression line. The following shows the slopes of the first 200 simulations:

Redo_rncseating_slopes_2

The more negative the slope, the more power Trump margin has in explaining the seating arrangement.

Notice that even though all these plans are created at random, the magnitude of the slopes range widely. In fact, there is one randomly created plan that sits right below the actual RNC plan shown in red. So, it is possible--but very unlikely--that the RNC plan is randomly drawn up.

Another view of this phenomenon is the histogram of the slopes:

Redo_rncseating_hist_2

This again shows that the actual seating plan is very unlikely to be produced by a random number generator. (I plotted 500 simulations here.)

In statistics, we measure rarity by "standard errors". The actual plan is almost but not quite three standard errors away from the average random plan. A rule of thumb is that 3 standard errors or more is rare. (This corresponds to over 99% confidence.)

 

PS. Does anyone have the data corresponding to the original scatter plot? There are other things I want to do with the data but I'd need to find (a) the seating position by state and (b) the primary results nicely set in a spreadsheet.