A pretty good chart ruined by some naive analysis

The following chart showing wage gaps by gender among U.S. physicians was sent to me via Twitter:


The original chart was published by the Stat News website (link).

I am most curious about the source of the data. It apparently came from a website called Doximity, which collects data from physicians. Here is a link to the PR release related to this compensation dataset. However, the data is not freely available. There is a claim that this data come from self reports by 36,000 physicians.

I am not sure whether I trust this data. For example:


Do I believe that physicians in North Dakota earn the highest salaries on average in the nation? And not only that, they earn almost 30% more than the average physician in New York. Does the average physician in ND really earn over $400K a year? If you are wondering, the second highest salary number comes from South Dakota. And then Idaho.  Also, these high-salary states are correlated with the lowest gender wage gaps.

I suspect that sample size is an issue. They do not report sample size at the level of their analyses. They apparently published statistics at the level of MSAs. There are roughly 400 MSAs in the U.S. so at that level, on average, they have only 90 samples per MSA. When split by gender, the average sample size is less than 50. Then, they are comparing differences, so we should see the standard errors. And finally, they are making hundreds of such comparisons, for which some kind of multiple-comparisons correction is needed.

I am pretty sure some of you are doctors, or work in health care. Do those salary numbers make sense? Are you moving to North/South Dakota?


Turning to the Visual corner of the Trifecta Checkup (link), I have a mixed verdict. The hover-over effect showing the precise values at either axes is a nice idea, well executed.

I don't see the point of drawing the circle inside a circle.  The wage gap is already on the vertical axis, and the redundant representation in dual circles adds nothing to it. Because of this construct, the size of the bubbles is now encoding the male average salary, taking attention away from the gender gap which is the point of the chart.

I also don't think the regional analysis (conveyed by the colors of the bubbles) is producing a story line.


This is another instance of a dubious analysis in this "big data" era. The analyst makes no attempt to correct for self-reporting bias, and works as if the dataset is complete. There is no indication of any concern about sample sizes, after the analyst drills down to finer areas of the dataset. While there are other variables available, such as specialty, and other variables that can be merged in, such as income levels, all of which may explain at least a portion of the gender wage gap, no attempt has been made to incorporate other factors. We are stuck with a bivariate analysis that does not control for any other factors.

Last but not least, the analyst draws a bold conclusion from the overly simplistic analysis. Here, we are told: "If you want that big money, you can't be a woman." (link)


P.S. The Stat News article reports that the researchers at Doximity claimed that they controlled for "hours worked and other factors that might explain the wage gap." However, in Doximity's own report, there is no language confirming how they included the controls.


Attractive, interactive graphic challenges lazy readers

The New York Times spent a lot of effort making a nice interactive graphical feature to accompany their story about Uber's attempt to manipulate its drivers. The article is here. Below is a static screenshot of one of the graphics.


The illustrative map at the bottom is exquisite. It has Uber cars driving around, it has passengers waiting at street corners, the cars pick up passengers, new passengers appear, etc. There are also certain oddities: all the cars go at the same speed, some strange things happen when cars visually run into each other, etc.

This interactive feature is mostly concerned with entertainment. I don't think it is possible to infer either of the two metrics listed above the chart by staring at the moving Uber cars. The metrics are the percentage of Uber drivers who are idle and the average number of minutes that a passenger waits. Those two metrics are crucial to understanding the operational problem facing Uber planners. You can increase the number of Uber cars on the road to reduce average waiting time but the trade-off is a higher idle rate among drivers.


One of the key trends in interactive graphics at the Times is simplication. While a lot of things are happening behind the scenes, there is only one interactive control. The only thing the reader can control is the number of drivers in the grid.

As one of the greatest producers of interactive graphics, I trust that they know what they are doing. In fact, this article describes some comments made by Gregor Aisch, who works at the Times. The gist is: very few readers play with their interactive graphics. Someone else said, "If you make a tooltip or rollover, assume no one will ever see it." I also have heard someone say (hope this is not merely a voice in my own head): "Every extra button or knob you place on the graphic, you lose another batch of readers." This might be called the law of the interactive knob, analogous to the law of the printed equation, in the realm of popular book publishing, which stipulates that every additional equation you print in a book, you lose another batch of readers.

(Note, however, that we are talking about graphics for communications here, not exploratory graphics.)


Several years ago, I introduced the concept of "return on effort" in this blog post. Most interactive graphics are high effort to produce. The question is whether there is enough reward for the readers. 


An enjoyable romp through the movies

Chris P. tipped me about this wonderful webpage containing an analysis of high-grossing movies. The direct link is here.

First, a Trifecta checkup: This thoughtful web project integrates beautifully rendered, clearly articulated graphics with the commendable objective of bringing data to the conversation about gender and race issues in Hollywood, an ambitious goal that it falls short of achieving because the data only marginally address the question at hand.

There is some intriguing just-beneath-the-surface interplay between the Q (question) and D (data) corners of the Trifecta, which I will get to in the lower half of this post. But first, let me talk about the Visual aspect of the project, which for the most part, I thought, was well executed.

The leading chart is simple and clear, setting the tone for the piece:


I like the use of color here. The colored chart titles are inspired. I also like the double color coding - notice that the proportion data are coded not just in the lengths of the bar segments but also in the opacity. There is some messiness in the right-hand-side labeling of the first chart but probably just a bug.

This next chart also contains a minor delight: upon scrolling to the following dot plot, the reader finds that one of the dots has been labeled; this is a signal to readers that they can click on the dots to reveal the "tooltips". It's a little thing but it makes a world of difference.


I also enjoy the following re-imagination of those proportional bar charts from above:


This form fits well with the underlying data structure (a good example of setting the V and the D in harmony). The chart shows the proportion of words spoken by male versus female actors over the course of a single movie (Tin Men from 1987 is the example shown here). The chart is centered in the unusual way, making it easy to read exactly when the females are allowed to have their say.

There is again a possible labeling hiccup. The middle label says 40th minute which would imply the entire movie is only 80 minutes long. (A quick check shows Tin Men is 110 minutes long.) It seems that they are only concerned with dialog, ignoring all moments of soundtrack, or silence. The visualization would be even more interesting if those non-dialog moments are presented.


The reason why the music and silence are missing has more to do with practicality than will. The raw materials (Data) used are movie scripts. The authors, much to their merit, acknowledge many of the problems that come with this data, starting with the fact that directors make edits to the scripts. It is also not clear how to locate each line along the duration of the movie. An assumption of speed of dialog seems to be required.

I have now moved to the Q corner of the Trifecta checkup. The article is motivated by the #OscarSoWhite controversy from a year or two ago, although by the second paragraph, the race angle has already been dropped in favor of gender, and by the end of the project, readers will have learned also about ageism but  the issue of race never returned. Race didn't come back because race is not easily discerned from a movie script, nor is it clearly labeled in a resource such as IMDB. So, the designers provided a better solution to a lesser problem, instead of a lesser solution to a better problem.

In the last part of the project, the authors tackle ageism. Here we find another pretty picture:


At the high level, the histograms tell us that movie producers prefer younger actresses (in their 20s) and middle-aged actors (forties and fifties). It is certainly not my experience that movies have a surplus of older male characters. But one must be very careful interpreting this analysis.

The importance of actors and actresses is being measured by the number of words in the scripts while the ages being analyzed are the real ages of the actors and actresses, not the ages of the characters they are playing.

Tom Cruise is still making action movies, and he's playing characters much younger than he is. A more direct question to ask here is: does Hollywood prefer to put younger rather than older characters on screen?

Since the raw data are movie scripts, the authors took the character names, and translated those to real actors and actresses via IMDB, and then obtained their ages as listed on IMDB. This is the standard "scrape-and-merge" method executed by newsrooms everywhere in the name of data journalism. It often creates data that are only marginally relevant to the problem.




The state of the art of interactive graphics

Scott Klein's team at Propublica published a worthy news application, called "Hell and High Water" (link) I took some time taking in the experience. It's a project that needs room to breathe.

The setting is Houston Texas, and the subject is what happens when the next big hurricane hits the region. The reference point was Hurricane Ike and Galveston in 2008.

This image shows the depth of flooding at the height of the disaster in 2008.


The app takes readers through multiple scenarios. This next image depicts what would happen (according to simulations) if something similar to Ike plus 15 percent stronger winds hits Galveston.


One can also speculate about what might happen if the so-called "Mid Bay" solution is implemented:


This solution is estimated to cost about $3 billion.


I am drawn to this project because the designers liberally use some things I praised in my summer talk at the Data Meets Viz conference in Germany.

Here is an example of hover-overs used to annotate text. (My mouse is on the words "Nassau Bay" at the bottom of the paragraph. Much of the Bay would be submerged at the height of this scenario.)


The design has a keen awareness of foreground/background issues. The map uses sparse static labels, indicating the most important landmarks. All other labels are hidden unless the reader hovers over specific words in the text.

I think plotting population density would have been more impactful. With the current set of labels, the perspective is focused on business and institutional impact. I think there is a missed opportunity to highlight the human impact. This can be achieved by coding population density into the map colors. I believe the colors on the map currently represent terrain.


This is a successful interactive project. The technical feats are impressive (read more about them here). A lot of research went into the articles; huge amounts of details are included in the maps. A narrative flow was carefully constructed, and the linkage between the text and the graphics is among the best I've seen.

Rethinking the index data, with modesty and clarity in mind

I discussed the rose chart used in the Environmental Performance Index (EPI) report last week. This type of data is always challenging to visualize.

One should start with an objective. If the goal is a data dump, that is to say, all you want is to deliver the raw data in its full glory to the user, then you should just print a set of data tables. This has traditionally been the delivery mechanism of choice.

If, on the other hand, your interest is communicating insights, then you need to ask some interesting questions. One such question is how do different regions and/or countries compare with each other, not just in the overall index but also in the major sub-indices?

Learning to ask such a question requires first understanding the structure of the data. As described in the previous post, the EPI is a weighted average of a bunch of sub-indices. Each sub-index measures "distance to a target," which is then converted into a scale from 0 to 100. This formula guarantees that at the aggregate level, the EPI is not going to be 0 or 100: a country would have to score 100 on all sub-indices to attain EPI perfection!

Here is a design sketch to address the question posed above:


For a print version, I chose several reference countries listed at the bottom that span the range of common values. In the final product, hovering over a stripe should disclose a country and its EPI. Then the reader can construct comparisons of the type: "Thailand has a value of 53, which places it between Brazil and China."

The chart reveals a number of insights. Each region stakes out its territory within the EPI scale. There are no European countries with EPI lower than 45 while there are no South Asian countries with EPI higher than 50 or so. Within each region, the distribution is very wide, and particularly so in the East Asia and Pacific region. Europe is clearly the leading region, followed by North America.

The same format can be replicated for every sub-index.

This type of graph addresses a subset of the set of all possible questions and it does so in a clear way. Modesty in your goals often helps.


I try hard to not hate all hover-overs. Here is one I love

One of the smart things Noah (at WNYC) showed to my class was his NFL fan map, based on Facebook data.

This is the "home" of the visualization:


The fun starts by clicking around. Here are the Green Bay fans on Facebook:


Also, you can see these fans relative to other teams in the same division:


A team like Jacksonville has a tiny footprint:



What makes this visualization work?

Notice the "home" image and those straight black lines. They are the "natural" regions of influence, if you assume that all fans root for the team that they are physcially closest to. 

To appreciate this, you have to look at a more generic NFL fan map (this is one from Deadspin):


This map is informative but not as informative as it ought to be. The reference point provided here are the state boundaries but we don't have one NFL team per state. Those "Voronoi" boundaries Noah added are more reasonable reference points to compare to the Facebook fan data.

When looking at the fan map, the most important question you have is what is each team's region of influence. This work reminds me of what I wrote before about the Beer Map (link). Putting all beer labels (or NFL teams) onto the same map makes it hard to get quick answers to that question. A small-multiples presentation is more direct, as the reader can see the brands/teams one at a time.

Here, Noah makes use of interactivity to present these small multiples on the same surface. It's harder to compare multiple teams but that is a secondary question. He does have two additions in case readers want to compare multiple teams. If you click instead of mousing over a team, the team's area of influence sticks around. Also, he created tabs so you can compare teams within each division.

I usually hate hover-over effects. They often hide things that readers want (creating what Noah calls "scavenger hunts"). The hover-over effect is used masterfully here to organize the reader's consumption of the data.


Moving to the D corner of the Trifecta checkup. Here is Noah's comment on the data:

Facebook likes are far from a perfect method for measuring NFL fandom. In sparsely-populated areas of the country, counties are likely to have a very small sample size. People who like things on Facebook are also not a perfect cross-section of football fans (they probably skew younger, for example). Other data sources that could be used as proxies for fan interest (but are subject to their own biases) are things like: home game attendance, merchandise sales, TV ratings, or volume of tweets about a team.


Visualizing survey results excellently

Surveys generate a lot of data. And, if you have used a survey vendor, you know they generate a ton of charts.

I was in Germany  to attend the Data Meets Viz workshop organized by Antony Unwin. Paul and Sascha from Zeit Online presented some of their work at the German publication, and I was highly impressed by this effort to visualize survey results. (I hope the link works for you. I found that the "scroll" fails on some platforms.)

The survey questions attempted to assess the gap between West and East Germans 25 years after reunification.

The best feature of this presentation is the maintenance of one chart form throughout. This is the general format:



The survey asks whether working mothers is a good thing or not. They choose to plot how the percent agreeing that working mothers is good changes over time. The blue line represents the East German average and the yellow line the West German average. There is a big gap in attitude between the two sides on this issue although both regions have experienced an increase in acceptance of working mothers over time.

All the other lines in the background indicate different subgroups of interest. These subgroups are accessible via the tabs on top. They include gender, education level, and age.

The little red "i" conceals some text explaining the insight from this chart.

Hovering over the "Men" tab leads to the following visual:


Both lines for men sit under the respective average but the shape is roughly the same. (Clicking on the tab highlights the two lines for men while moving the aggregate lines to the background.)

The Zeit team really does an amazing job keeping this chart clean while still answering a variety of questions.

They did make an important choice: not to put every number on this chart. We don't see the percent disagreeing or those who are ambivalent or chose not to answer the question.


Like I said before, what makes this set of charts is the seamless transitions between one question and the next. Every question is given the same graphical treatment. This eliminates learning time going from one chart to the next.

Here is one using a Likert scale, and accordingly, the vertical axis goes from 1 to 7. They plotted the average score within each subgroup and the overall average:


Here is one where they combined the top categories into a "Bottom 2 Box" type metric:



Finally, I appreciate the nice touch of adding tooltips to the series of dots used to aid navigation.


The theme of the workshop was interactive graphics. This effort by the Zeit team is one of the best I have seen. Market researchers take note!


Graphical forms impose assumptions on the data

In a comment to my previous post, reader Chris P. pointed me to the following set of maps, also from the New York Times crew, on the legalization of gay marriage in the U.S. (link)



(For those who did not click through, the orange colors represent two types of bans while the dark gray/grey color indicates legalization.)

These maps are pleasing to the eye for sure. By portraying every state as a same-sized square, the presentation avoids the usual areal distortion introduced by the map.

But not so quick. Note that each presentation makes its own assumption on the relative importance of states. The typical map scales weights according to geographical area while this presentation assumes that every state has equal weight. Another typical cartographic display uses squares of different sizes, based on the population of each state.

The location of states are necessarily distorted. One way to remedy this is to have hover-over state labels. On a browser, such interactivity works better than having to scroll to the top where there is a larger map which doubles as the legend.

It would be interesting to learn also about the future. Are there any legislation in the pipeline either to legalize gay marriage in the remaining orange states or to overturn the legalization laws in the gray states?


PS. [5/6/2015] Here is an alternative presentation of this data by David Mendoza.

Observing Rosling’s Current Visual Style

On the sister blog, I wrote about Hans Rosling’s recent presentation in New York (link). I noted that Rosling has apparently simplified his visual palette.

Rosling is best known as the developer of the Gapminder tool, used to visualize global social statistics data collected by national statistical agencies. I wrote favorably about this tool in a series of posts (link). Gapminder made popular the moving bubble chart, although not the only graphical form present.


These animated bubble charts also made Rosling a YouTube star (See here.)


In last week’s presentation, Rosling only showed one moving bubble chart. The rest of his graphics are noticeably simpler, something that anyone can produce on Excel or Powerpoint. Here is one example:


I’m particularly impressed by a simple sequence of charts in which Rosling explains the demographic changes the world is expecting to see in the next 50 to 100 years.


This is an enhanced area chart. Each slice of area is subdivided into stick figures so that an axis for population counts becomes unnecessary.

Instead, the reader sees two useful dimensions: region of the world, and age group.

How the population ages as it grows is the feature story and the effect of aging is ingeniously portrayed as layers. This becomes apparent as Rosling lets time roll forward, and the layers literally walk off the page. (Unfortunately, I couldn't capture each step fast enough.)


 (This photo courtesy of Daniel Vadnais.)

When Rosling showed the 2085 projection, we find that the entire rectangle has filled up, so the world population has definitely grown, roughly by 30 percent. The growth happens by filling up of adults; the total number of children has not changed. This is one of the key insights from recent demographic data. The first photo above shows something remarkable: the fertility rate in Asian countries has plunged to about the same level of developed countries already.


This set of charts is unusually effective. It represents another level of simplification in visual means. At the same time, the message is sharpened.

As I reported the other day (link), Rosling does not believe modern tools have improved data analysis. This talk which utilized simple tools is a good demonstration of his point.

Tricky boy William

Last week, I was quite bothered by this chart I produced using the Baby Name Voyager tool.


According to this chart, William has drastically declined in popularity over time. The name was 7 times more popular back in the 1880s compared to the 2010s. And yet, when I hovered over the chart, the rank of William in 2013 was 3. Apparently, William was the 3rd most popular boy name in 2013.

I wrote the nice people at the website and asked if there might be a data quality issue, and their response was:

The data in our Name Voyager tool is correct. While it may be puzzling, there are definitely less Williams in the recent years than there were in the past (1880s). Although the name is still widely popular, there are plenty of other baby names that parents are using. In the past, there were a limited amount of names that parents would choose, therefore more children had the same name.

What bothered me was that the rate has declined drastically while the number of births was increasing. So, I was expecting William to drop in rank as well. But their explanation makes a lot of sense: if there is a much wider spread of names in recent times, the rank could indeed remain top. It was very nice of them to respond.


There are three ways to present this data series, as shown below. One can show the raw counts of William babies (orange line). One can show the popularity against total births (what Baby Name Wizard shows, blue line). One can show the rank of William relative to all other male baby names (green line). Consider how different these three lines look!


The rate metric (per million births) adjusts for growth in total births. But the blue line is difficult to interpret in the face of the orange line. In the period 1900 to 1950, the actual number of William babies went up but the blue line came down. The rank is also tough especially in the 1970-2000 period when it took a dive, a trend not visible in either the raw counts or the adjusted counts.

Adding to the difficulty is the use of the per-million metric. In the following chart, I show three different scales for popularity: per million, per 100,000, and per 100 (i.e. proportion). The raw count is shown up top.


All three blue lines are essentially the same but how readers interpret the scales is quite another matter. The per-million births metric is the worst of the lot. The chart shows values in the 20,000-25,000 range in the 1910s but the actual number of William babies was below 20,000 for a number of years. Switching to per-100K helps but in this case, using the standard proportion (the bottom chart) is more natural.


The following scatter plot shows the strange relationship between the rate of births and the rank over time for Williams babies.


Up to 1990s, there is an intuitive relationship: as the proportion of Williams among male babies declined, so did the rank of William. Then in the 1990s and beyond, the relationship flipped. The proportion of Williams among male babies continued to drop but the rank of William actually recovered!