The tech world in which everyone is below average

Laura pointed me to an infographic about tech worker salaries in major tech hubs (link).

What's wrong with this map?

Entrepreneur_techsalaries_map

The box "Global average" is doubly false. It is not global, and it is not the average!

The only non-American cities included in this survey are Toronto, Paris and London.

The only city with average salary above the "Global average" is San Francisco Bay Area. Since the Bay Area does not outweigh all other cities combined in the number of tech workers, it is impossible to get an average of $135,000.

***

Here is the second chart.

What's wrong with these lines?

Entrepreneur_techsalaries_lines

This chart frustrates the reader's expectations. The reader interprets it as a simple line chart, based on three strong hints:

  • time along the horizontal axis
  • data labels show dollar units
  • lines linking time

Each line seems to show the trend of average tech worker salary, in dollar units.

However, that isn't the designer's intention. Let's zoom in on Chicago and Denver:

Entrepreneur_techsalaries_lines2

The number $112,000 (Denver) sits below the number $107,000 (Chicago). It appears that each chart has its own scale. But that's not the case either.

For a small-multiples setup, we expect all charts should use the same scale. Even though the data labels are absolute dollar amounts, the vertical axis is on a relative scale (percent change). To make things even more complicated, the percent change is computed relative to the minimum of the three annual values, no matter which year it occurs.

Redo_entrepreneurtechsalarieslines2

That's why $106,000 (Chicago) is at the same level as $112,000 (Denver). Those are the minimum values in the respective time series. As shown above, these line charts are easier to understand if the axis is displayed in its true units of percent change.

The choice of using the minimum value as the reference level interferes with comparing one city to the next. For Chicago, the line chart tells us 2015 is about 2 percent above 2016 while 2017 is 6 percent above. For Denver, the line chart tells us that 2016 is about 2 percent above the 2015 and 2017 values. Now what's the message again?

Here I index all lines to the earliest year.

  Redo_junkcharts_entrepreneurtechsalaries_lines

In a Trifecta Checkup analysis (link), I'd be suspicious of the data. Did tech salaries in London really drop by 15-20 percent in the last three years?

 

 


When your main attraction is noise

Peter K. asked me about this 538 chart, which is a stacked column chart in which the percentages appear to not add up to 100%. Link to the article here.

538-cox-evangelicals-1Here's my reply:

They made the columns so tall that the "rounding errors" (noise) disclosed in the footnotes became the main attraction.

***

The gap between the highest and lowest peaks looks large but mostly due to the aspect ratio. The  gap is only ~2% at the widest (101% versus 99%) so it is the rounding error disclosed below the chart.

The lesson here is to make sure you suppress the noise and accentuate your data!

 

 


A chart Hans Rosling would have loved

I came across this chart from the OurWorldinData website, and this one would make the late Hans Rosling very happy.

MaxRoser_Two-centuries-World-as-100-people

If you went to Professor Rosling's talk, he was bitter that the amazing gains in public health, worldwide (but particularly in less developed nations) during the last few decades have been little noticed. This chart makes it clear: note especially the dramatic plunge in extreme poverty, rise in vaccinations, drop in child mortality, and improvement in education and literacy, mostly achived in the last few decades.

This set of charts has a simple but powerful message. It's the simplicity of execution that really helps readers get that powerful message.

The text labels on the left and right side of the charts are just perfect.

***

Little things that irk me:

I am not convinced by the liberal use of colors - I would make the "other" category of each chart consistently gray so 6 colors total. Having different colors does make the chart more interesting to look at.

Even though the gridlines are muted, I still find them excessive.

There is a coding bug in the Vaccination chart right around 1960.

 


Getting into the head of the chart designer

When I look at this chart (from Business Insider), I try to understand the decisions made by its designer - which things are important to her/him, and which things are less important.

Incomegendergapbystate-both-top-2-map-v2

The chart shows average salaries in the top 2 percent of income earners. The data are split by gender and by state.

First, I notice that the designer chooses to use the map form. This decision suggests that the spatial pattern of top incomes is of top interest to the designer because she/he is willing to accept the map's constraints - namely, the designer loses control of the x and y dimensions, as well as the area and shape of the data containers. For the U.S. state map, there is no elegant solution to the large number of small states problem in the Northeast.

Second, I notice the color choice. The designer provides actual values on the visualization but also groups all state-average incomes into five categories. It's not clear how she/he determines the boundaries of these income brackets. There are many more dark blue states than there are light blue states in the map for men. Because women incomes are everywhere lower than men, the map at the bottom fits all states into two large buckets, plus Connecticut. Women incomes are lower than men but there is no need to break the data down by gender to convey this message.

Third, the use of two maps indicates that the designer does not care much about gender comparisons within each state. These comparisons are difficult to accomplish on the chart - one must involuntarily bob one's head up and down to make the comparisons. The head bobbing isn't even enough: then you must pull out your calculator and compute the ratio of women to men average. If the designer wants to highlight state-level comparisons, she/he could have plotted the gender ratio on a single map, like this:

Screen Shot 2017-09-18 at 11.47.23 PM

***

So far, I infer that the key questions are (a) the gender gap in aggregate (b) the variability of incomes within each gender, or the spatial clustering (c) the gender gap within each state.

(a) is better conveyed in more aggregate form. Goal (b) is defeated by the lack of clear clustering. (c) is not helped by the top-bottom split.

In making the above chart, I discover a pattern - that women fare better in the smaller states like Montana, Iowa, North & South Dakota. Meanwhile, the disparity in New York is of the same degree as Oklahoma and Wyoming.

  Jc_redo_top2pcincomes2b

 This chart tells readers a bit more about the underlying data, without having to print the entire dataset on the page.

 

 

 


A long view of hurricanes

This chart by Axios is well made. The full version is here.

Axios_hurricanes

It's easy to identify all the Cat 5 hurricanes. Only important ones are labeled. The other labels are hidden behind the hover. The chart provides a good answer to the question: what time of the year does the worst hurricanes strike. It's harder to compare the maximum speeds of the hurricanes.

I wish there is a way to incorporate geography. I'd be willing to trade off the trajectory of wind speeds as the max speed is of most use.


Shocker: ease of use requires expanding, not restricting, choices

Recently, I noted how we have to learn to hate defaults in data visualization software. I was reminded again of this point when reviewing this submission from long-time reader & contributor Chris P.

Medium_retailstocks

The chart is included in this Medium article, which credits Mott Capital Management as the source.

Jc_medium_retailersLook at the axis labels on the right side. They have the hallmarks of software defaults. The software designer decided that the axis labels will be formatted in exactly the same way as the data in that column: this means $XXX.XXB, with two decimal places. The same formatting rule is in place for the data labels, shown in boxes.

Why put tick marks at the odd intervals, 37.50, 62.50, 87.50, ... ? What's wrong with 40, 60, 80, 100, ...? It comes down to machine thinking versus human thinking.

This software places the most recent values into data labels, formatted as boxes that point to the positions of those values on the axis. Evidently, it doesn't have a plan for overcrowding. At the bottom of the axis, we see four labels for six lines. The blue, pink and orange labels point to the wrong places on the axis.

Worse, it's unclear what those "most recent" values represent. I have added gridlines for each year on the excerpt shown right. The lines extend to 2017, which isn't even half over.

Now, consider the legend. Which version do you prefer?

Jc_medium_retailers_legend

Most likely, the original dataset has columns named "Amazon.com Revenue (TTM)", "Dillard's Revenue (TTM)", etc. so the software just picks those up and prints them in the legend text.

***

The chart is an output from YCharts, which I learned is a Bloomberg terminal competitor. It probably uses one of the available Web graphing packages out there. These packages typically emphasize ease of use through automating the process of data visualization. Ease of use is defined as rigid defaults that someone determines are the optimal settings. Users then discover that there is no getting around those settings; in some cases, a coding interface is available, which usurps the goal of user-friendliness.

The problem lies in defining what ease of use means. Ease of use should require expanding, not restricting, choices. Setting rigid defaults restricts choices. In addition to providing good defaults, the software designer should make it simple for users to make their own choices. Ideally, each of the elements (data labels, gridlines, tick marks, etc.) can be independently removed, shifted, expanded, reduced, re-colored, edited, etc. from their original settings.


A pretty good chart ruined by some naive analysis

The following chart showing wage gaps by gender among U.S. physicians was sent to me via Twitter:

Statnews_physicianwages

The original chart was published by the Stat News website (link).

I am most curious about the source of the data. It apparently came from a website called Doximity, which collects data from physicians. Here is a link to the PR release related to this compensation dataset. However, the data is not freely available. There is a claim that this data come from self reports by 36,000 physicians.

I am not sure whether I trust this data. For example:

Stat_wagegapdoctor_1

Do I believe that physicians in North Dakota earn the highest salaries on average in the nation? And not only that, they earn almost 30% more than the average physician in New York. Does the average physician in ND really earn over $400K a year? If you are wondering, the second highest salary number comes from South Dakota. And then Idaho.  Also, these high-salary states are correlated with the lowest gender wage gaps.

I suspect that sample size is an issue. They do not report sample size at the level of their analyses. They apparently published statistics at the level of MSAs. There are roughly 400 MSAs in the U.S. so at that level, on average, they have only 90 samples per MSA. When split by gender, the average sample size is less than 50. Then, they are comparing differences, so we should see the standard errors. And finally, they are making hundreds of such comparisons, for which some kind of multiple-comparisons correction is needed.

I am pretty sure some of you are doctors, or work in health care. Do those salary numbers make sense? Are you moving to North/South Dakota?

***

Turning to the Visual corner of the Trifecta Checkup (link), I have a mixed verdict. The hover-over effect showing the precise values at either axes is a nice idea, well executed.

I don't see the point of drawing the circle inside a circle.  The wage gap is already on the vertical axis, and the redundant representation in dual circles adds nothing to it. Because of this construct, the size of the bubbles is now encoding the male average salary, taking attention away from the gender gap which is the point of the chart.

I also don't think the regional analysis (conveyed by the colors of the bubbles) is producing a story line.

***

This is another instance of a dubious analysis in this "big data" era. The analyst makes no attempt to correct for self-reporting bias, and works as if the dataset is complete. There is no indication of any concern about sample sizes, after the analyst drills down to finer areas of the dataset. While there are other variables available, such as specialty, and other variables that can be merged in, such as income levels, all of which may explain at least a portion of the gender wage gap, no attempt has been made to incorporate other factors. We are stuck with a bivariate analysis that does not control for any other factors.

Last but not least, the analyst draws a bold conclusion from the overly simplistic analysis. Here, we are told: "If you want that big money, you can't be a woman." (link)

 

P.S. The Stat News article reports that the researchers at Doximity claimed that they controlled for "hours worked and other factors that might explain the wage gap." However, in Doximity's own report, there is no language confirming how they included the controls.

 


An enjoyable romp through the movies

Chris P. tipped me about this wonderful webpage containing an analysis of high-grossing movies. The direct link is here.

First, a Trifecta checkup: This thoughtful web project integrates beautifully rendered, clearly articulated graphics with the commendable objective of bringing data to the conversation about gender and race issues in Hollywood, an ambitious goal that it falls short of achieving because the data only marginally address the question at hand.

There is some intriguing just-beneath-the-surface interplay between the Q (question) and D (data) corners of the Trifecta, which I will get to in the lower half of this post. But first, let me talk about the Visual aspect of the project, which for the most part, I thought, was well executed.

The leading chart is simple and clear, setting the tone for the piece:

Polygraphfilm_bars

I like the use of color here. The colored chart titles are inspired. I also like the double color coding - notice that the proportion data are coded not just in the lengths of the bar segments but also in the opacity. There is some messiness in the right-hand-side labeling of the first chart but probably just a bug.

This next chart also contains a minor delight: upon scrolling to the following dot plot, the reader finds that one of the dots has been labeled; this is a signal to readers that they can click on the dots to reveal the "tooltips". It's a little thing but it makes a world of difference.

Polygraphfilm_dotplotwithlabel

I also enjoy the following re-imagination of those proportional bar charts from above:

Polygraphfilm_tinmen_bars

This form fits well with the underlying data structure (a good example of setting the V and the D in harmony). The chart shows the proportion of words spoken by male versus female actors over the course of a single movie (Tin Men from 1987 is the example shown here). The chart is centered in the unusual way, making it easy to read exactly when the females are allowed to have their say.

There is again a possible labeling hiccup. The middle label says 40th minute which would imply the entire movie is only 80 minutes long. (A quick check shows Tin Men is 110 minutes long.) It seems that they are only concerned with dialog, ignoring all moments of soundtrack, or silence. The visualization would be even more interesting if those non-dialog moments are presented.

***

The reason why the music and silence are missing has more to do with practicality than will. The raw materials (Data) used are movie scripts. The authors, much to their merit, acknowledge many of the problems that come with this data, starting with the fact that directors make edits to the scripts. It is also not clear how to locate each line along the duration of the movie. An assumption of speed of dialog seems to be required.

I have now moved to the Q corner of the Trifecta checkup. The article is motivated by the #OscarSoWhite controversy from a year or two ago, although by the second paragraph, the race angle has already been dropped in favor of gender, and by the end of the project, readers will have learned also about ageism but  the issue of race never returned. Race didn't come back because race is not easily discerned from a movie script, nor is it clearly labeled in a resource such as IMDB. So, the designers provided a better solution to a lesser problem, instead of a lesser solution to a better problem.

In the last part of the project, the authors tackle ageism. Here we find another pretty picture:

Polygraphfilm_ageanalysis

At the high level, the histograms tell us that movie producers prefer younger actresses (in their 20s) and middle-aged actors (forties and fifties). It is certainly not my experience that movies have a surplus of older male characters. But one must be very careful interpreting this analysis.

The importance of actors and actresses is being measured by the number of words in the scripts while the ages being analyzed are the real ages of the actors and actresses, not the ages of the characters they are playing.

Tom Cruise is still making action movies, and he's playing characters much younger than he is. A more direct question to ask here is: does Hollywood prefer to put younger rather than older characters on screen?

Since the raw data are movie scripts, the authors took the character names, and translated those to real actors and actresses via IMDB, and then obtained their ages as listed on IMDB. This is the standard "scrape-and-merge" method executed by newsrooms everywhere in the name of data journalism. It often creates data that are only marginally relevant to the problem.

 

 

 


Chopped legs, and abridged analyses

Reader Glenn T. was not impressed by the graphical talent on display in the following column chart (and others) in a Monkey Cage post in the Washington Post:

Wp_trumpsupporters1

Not starting column charts at zero is like having one's legs chopped off. Here's an animated gif to show what's taking place: (you may need to click on it to see the animation)

Wp_trumpassistance

Since all four numbers show up on the chart itself, there is no need to consult the vertical axis.

I wish they used a structured color coding to help fast comprehension of the key points.

***

These authors focus their attention on the effect of the "black or white cue" but the other effect of Trump supporters vs. non-supporters is many times as big.

Notice that on average 56% of Trump supporters in this study oppose mortgage assistance while 25% of non Trump supporters oppose it - a gap of about 30%.

If we are to interpret the roughly +/- 5% swing attributed to black/white cues as "racist" behavior on the part of Trump supporters, then the +/- 3% swing on the part of non-Trump supporters in the other direction should be regarded as a kind of "reverse racist" behavior. No?

So from this experiment, one should not conclude that Trump voters are racist, which is what the authors are implying. Trump voters have many reasons to oppose mortgage assistance, and racist reaction to pictures of black and white people has only a small part of play in it.

***

The reporting of the experimental results irks me in other ways.

The headline claimed that "we showed Trump voters photos of black and white Americans." That is a less than accurate description of the experiment and subsequent analysis. The authors removed all non-white Trump voters from the analysis, so they are only talking about white Trump voters.

Also, I really, really dislike the following line:

When we control for age, income, sex, education, party identification, ideology, whether the respondent was unemployed, and perceptions of the national economy — other factors that might shape attitudes about mortgage relief — our results were the same.                        

Those are eight variables they looked into for which they provided zero details. If they investigated "interaction" effects, only of pairs of variables, that would add another 28 dimensions for which they provided zero information.

The claim that "our results were the same" tells me nothing! It is hard for me to imagine that the set of 8+28 variables described above yielded exactly zero insights.

Even if there were no additional insights, I would still like to see the more sophisticated analysis that controls for all those variables that, as they admitted, shape attitudes about mortgage relief. After all, the results are "the same" so the researcher should be indifferent between the simple and the sophisticated analyses.

In the old days of printed paper, I can understand why journal editors are reluctant to print all those analyses. In the Internet age, we should put those analyses online, providing a link to supplementary materials for those who want to dig deeper.

***

On average, 56 percent of white Trump voters oppose mortgage relief. Add another 3-5 percent (rounding error) if they were cued with an image of a black person. The trouble here is that 90% of the white Trump voting respondents could have been unaffected by the racial cue and the result still holds.

While the effect may be "statistically significant" (implied but not stated by the authors), it represents a small shift in the average attitude. The fact that the "average person" responded to the racial cue does not imply that most people responded to it.

The last two issues I raised here are not specific to this particular study. They are prevalent in the reporting of psychological experiments.

 


Is this chart rotten?

Some students pointed me to a FiveThirtyEight article about Rotten Tomatoes scores that contain the following chart: (link to original)

Hickey-rtcurve-3

This is a chart that makes my head spin. Too much is going on, and all the variables in the plot are tangled with each other. Even after looking at it for a while, I still don't understand how the author looked at the above and drew this conclusion:

"Movies that end up in the top tier miss a step ahead of their release, mediocre movies stumble, and the bottom tiers fall down an elevator shaft."

(Here is the article. It's a great concept but a bit disappointing analysis coming from Nate Silver's site. I have written features for them before so I know they ask good questions. Maybe they should apply the same level of rigor in editing feature writers to editing staff writers.)