A gem among the snowpack of Olympics data journalism

It's not often I come across a piece of data journalism that pleases me so much. Here it is, the "Happy 700" article by Washington Post is amazing.

Wpost_happy700_map2

 

When data journalism and dataviz are done right, the designers have made good decisions. Here are some of the key elements that make this article work:

(1) Unique

The topic is timely but timeliness heightens both the demand and supply of articles, which means only the unique and relevant pieces get the readers' attention.

(2) Fun

The tone is light-hearted. It's a fun read. A little bit informative - when they describe the towns that few have heard of. The notion is slightly silly but the reader won't care.

(3) Data

It's always a challenge to make data come alive, and these authors succeeded. Most of the data work involves finding, collecting and processing the data. There isn't any sophisticated analysis. But a powerful demonstration that complex analysis is not always necessary.

(4) Organization

The structure of the data is three criteria (elevation, population, and terrain) by cities. A typical way of showing such data might be an annotated table, or a Bumps-type chart, grouped columns, and so on. All these formats try to stuff the entire dataset onto one chart. The designers chose to highlight one variable at a time, cumulatively, on three separate maps. This presentation fits perfectly with the flow of the writing. 

(5) Details

The execution involves some smart choices. I am a big fan of legend/axis labels that are informative, for example, note that the legend doesn't say "Elevation in Meters":

Wpost_happy700_legend

The color scheme across all three maps shows a keen awareness of background/foreground concerns. 


A pretty good chart ruined by some naive analysis

The following chart showing wage gaps by gender among U.S. physicians was sent to me via Twitter:

Statnews_physicianwages

The original chart was published by the Stat News website (link).

I am most curious about the source of the data. It apparently came from a website called Doximity, which collects data from physicians. Here is a link to the PR release related to this compensation dataset. However, the data is not freely available. There is a claim that this data come from self reports by 36,000 physicians.

I am not sure whether I trust this data. For example:

Stat_wagegapdoctor_1

Do I believe that physicians in North Dakota earn the highest salaries on average in the nation? And not only that, they earn almost 30% more than the average physician in New York. Does the average physician in ND really earn over $400K a year? If you are wondering, the second highest salary number comes from South Dakota. And then Idaho.  Also, these high-salary states are correlated with the lowest gender wage gaps.

I suspect that sample size is an issue. They do not report sample size at the level of their analyses. They apparently published statistics at the level of MSAs. There are roughly 400 MSAs in the U.S. so at that level, on average, they have only 90 samples per MSA. When split by gender, the average sample size is less than 50. Then, they are comparing differences, so we should see the standard errors. And finally, they are making hundreds of such comparisons, for which some kind of multiple-comparisons correction is needed.

I am pretty sure some of you are doctors, or work in health care. Do those salary numbers make sense? Are you moving to North/South Dakota?

***

Turning to the Visual corner of the Trifecta Checkup (link), I have a mixed verdict. The hover-over effect showing the precise values at either axes is a nice idea, well executed.

I don't see the point of drawing the circle inside a circle.  The wage gap is already on the vertical axis, and the redundant representation in dual circles adds nothing to it. Because of this construct, the size of the bubbles is now encoding the male average salary, taking attention away from the gender gap which is the point of the chart.

I also don't think the regional analysis (conveyed by the colors of the bubbles) is producing a story line.

***

This is another instance of a dubious analysis in this "big data" era. The analyst makes no attempt to correct for self-reporting bias, and works as if the dataset is complete. There is no indication of any concern about sample sizes, after the analyst drills down to finer areas of the dataset. While there are other variables available, such as specialty, and other variables that can be merged in, such as income levels, all of which may explain at least a portion of the gender wage gap, no attempt has been made to incorporate other factors. We are stuck with a bivariate analysis that does not control for any other factors.

Last but not least, the analyst draws a bold conclusion from the overly simplistic analysis. Here, we are told: "If you want that big money, you can't be a woman." (link)

 

P.S. The Stat News article reports that the researchers at Doximity claimed that they controlled for "hours worked and other factors that might explain the wage gap." However, in Doximity's own report, there is no language confirming how they included the controls.

 


Lines that delight, lines that blight

This WSJ graphic caught my eye. The accompanying article is here.

Wsj_ipo_dealdrought_full

The article (judging from the sub-header) makes two separate points, one about the total amount of money raised in IPOs in a year, and the change in market value of those newly-public companies one year from the IPO date.

The first metric is shown by the size of the bubbles while the second metric is displayed as distances from the horizontal axis. (The second metric is further embedded, in a simplified, binary manner, in the colors of the bubbles.)

The designer has decided that the second metric - performance after IPO - to be more important. Therefore, it is much easier for readers to know how each annual cohort of IPOs has performed. The use of color to map to the second metric (and not the first) also helps to emphasize the second metric.

There are details on this chart that I admire. The general tidiness of it. The restraint on the gridlines, especially along the horizontal ones. The spatial balance. The annotation.

And ah, turning those bubbles into lollipops. Yummy! Those dotted lines allow readers to find the center of each bubble, which is where the values of the second metrics lie. Frequently, these bubble charts are presented without those guiding lines, and it is often hard to find the circles' anchors.

That leaves one inexplicable decision - why did they place two vertical gridlines in the middle of two arbitrary years?


Batmen not as interesting as it seems

When this post appears, I will be on my way to Seattle. Maybe I will meet some of you there. You can still register here.

I held onto this tip from a reader for a while. I think it came from Twitter:

20160326_woc432_1 batman

The Economist found a fun topic but what's up with the axis not starting at zero?

The height x weight gimmick seems cool but on second thought, weight is not the same as girth so it doesn't make much sense!

In the re-design, I use bubbles to indicate weight and vertical location to indicate height. The data aren't as interesting as one might think. All the actors pretty much stayed true to the comic-book ideal, with Adam West being the closest. I also changed the order of the actors.

Redo_batman

I left out the Lego, as it creates a design challenge that does not justify the effort.

 

 


More chart drama, and data aggregation

Robert Kosara posted a response to my previous post.

He raises an important issue in data visualization - the need to aggregate data, and not plot raw data. I have no objection to that point.

What was shown in my original post are two extremes. The bubble chart is high drama at the expense of data integrity. Readers cannot learn any of the following from that chart:

  • the shape of the growth and subsequent decline of the flu epidemic
  • the beginning and ending date of the epidemic
  • the peak of the epidemic*

* The peak can be inferred from the data label, although there appears to be at least one other circle of approximately equal size, which isn't labeled.

The column chart is low drama but high data integrity. To retain some dramatic element, I encoded the data redundantly in the color scale. I also emulated the original chart in labeling specific spikes.

The designer then simply has to choose a position along these two extremes. This will involve some smoothing or aggregation of the data. Robert showed a column chart that has weekly aggregates, and in his view, his version is closer to the bubble chart.

Robert's version indeed strikes a balance between drama and data integrity, and I am in favor of it. Here is the idea (I am responsible for the added color).

Kosara_avianflu2

***

Where I depart from Robert is how one reads a column chart such as the one I posted:

Redo_avianflu2

Robert thinks that readers will perceive each individual line separately, and in so doing, "details hide the story". When I look at a chart like this, I am drawn to the envelope of the columns. The lighter colors are chosen for the smaller spikes to push them into the background. What might be the problem are those data labels identifying specific spikes; they are a holdover from the original chart--I actually don't know why those specific dates are labeled.

***

In summary, the key takeaway is, as Robert puts it:

the point of this [dataset] is really not about individual days, it’s about the grand totals and the speed with which the outbreak happened.

We both agree that the weekly version is the best among these. I don't see how the reader can figure out grand totals and speed with which the outbreak happened by staring at those dramatic but overlapping bubbles.


Is it worth the drama?

Quite the eye-catching chart this:

Wsj_avianflu

The original accompanied this article in the Wall Street Journal about avian flu outbreaks in the U.S.

The point of the chart appears to be the peak in the flu season around May. The overlapping bubbles were probably used for drama.

A column chart, with appropriate colors, attains much of the drama but retains the ability to read the data.

Redo_avianflu2

 


How to tell if your graphic is underpowered?

Some time ago, this chart showed up in a NYT Magazine (it's about sex):

Nytm_circles

In this composition, the visual element (the circles) has no utility. A self-sufficiency test makes this point clear.

All the data (four numbers) are printed on the original graphic. When removed, the reader loses all ability to understand the data.

Nytm_circles_cropped

***

Redo_nytm_circles_1Even when the first number is revealed, it is impossible to know the values of the others.

If one knows the second (and largest) pink circle represents 58 percent, it is still impossible to guess that the adjacent circle is 40 percent.

Even both those numbers are provided, it is still impossible to infer the rest without a calculation.

In order to understand this graphic, readers must look at the data labels.

 

 

***

I made a couple of other versions for comparison.

The first uses the pie chart, which is almost readable without the data labels. 

Redo_nytm_circles_2

The second uses the bar chart, which requires only an axis.

Redo_nytm_circles_3

 

 

 

 

 


Observing Rosling’s Current Visual Style

On the sister blog, I wrote about Hans Rosling’s recent presentation in New York (link). I noted that Rosling has apparently simplified his visual palette.

Rosling is best known as the developer of the Gapminder tool, used to visualize global social statistics data collected by national statistical agencies. I wrote favorably about this tool in a series of posts (link). Gapminder made popular the moving bubble chart, although not the only graphical form present.

Gapminder_screengrab

These animated bubble charts also made Rosling a YouTube star (See here.)

***

In last week’s presentation, Rosling only showed one moving bubble chart. The rest of his graphics are noticeably simpler, something that anyone can produce on Excel or Powerpoint. Here is one example:

Image1
 

I’m particularly impressed by a simple sequence of charts in which Rosling explains the demographic changes the world is expecting to see in the next 50 to 100 years.

  Image2

This is an enhanced area chart. Each slice of area is subdivided into stick figures so that an axis for population counts becomes unnecessary.

Instead, the reader sees two useful dimensions: region of the world, and age group.

How the population ages as it grows is the feature story and the effect of aging is ingeniously portrayed as layers. This becomes apparent as Rosling lets time roll forward, and the layers literally walk off the page. (Unfortunately, I couldn't capture each step fast enough.)

Image3

 (This photo courtesy of Daniel Vadnais.)

When Rosling showed the 2085 projection, we find that the entire rectangle has filled up, so the world population has definitely grown, roughly by 30 percent. The growth happens by filling up of adults; the total number of children has not changed. This is one of the key insights from recent demographic data. The first photo above shows something remarkable: the fertility rate in Asian countries has plunged to about the same level of developed countries already.

***

This set of charts is unusually effective. It represents another level of simplification in visual means. At the same time, the message is sharpened.

As I reported the other day (link), Rosling does not believe modern tools have improved data analysis. This talk which utilized simple tools is a good demonstration of his point.