« April 2017 | Main | June 2017 »

Making people jump over hoops

Take a look at the following chart, and guess what message the designer wants to convey:


This chart accompanied an article in the Wall Street Journal about Wells Fargo losing brokers due to the fake account scandal, and using bonuses to lure them back. Like you, my first response to the chart was that little has changed from 2015 to 2017.

It is a bit mysterious the intention of the whitespace inserted to split the four columns into two pairs. It's not obvious that UBS and Merrill are different from Wells Fargo and Morgan Stanley. This device might have been used to overcome the difficulty of reading four columns side by side.

The additional challenge of this dataset is the outlier values for UBS, which elongates the range of the vertical axis, squeezing together the values of the other three banks.

In this first alternative version, I play around with irregular gridlines.


Grouped column charts are not great at conveying changes over time, as they cause our eyes to literally jump over hoops. In the second version, I use a bumps chart to compactly highlight the trends. I also zoom in on the quarterly growth rates.


The rounded interpolation removes the sharp angles from the typical bumps chart (aka slopegraph) but it does add patterns that might not be there. This type of interpolation however respects the values at the "knots" (here, the quarterly values) while a smoother may move those points. On balance, I like this treatment.


PS. [6/2/2017] Given the commentary below, I am including the straight version of the chart, so you can compare. The straight-line version is more precise. One aspect of this chart form I dislike is the sharp angles. When there are more lines, it gets very entangled.


Shocker: ease of use requires expanding, not restricting, choices

Recently, I noted how we have to learn to hate defaults in data visualization software. I was reminded again of this point when reviewing this submission from long-time reader & contributor Chris P.


The chart is included in this Medium article, which credits Mott Capital Management as the source.

Jc_medium_retailersLook at the axis labels on the right side. They have the hallmarks of software defaults. The software designer decided that the axis labels will be formatted in exactly the same way as the data in that column: this means $XXX.XXB, with two decimal places. The same formatting rule is in place for the data labels, shown in boxes.

Why put tick marks at the odd intervals, 37.50, 62.50, 87.50, ... ? What's wrong with 40, 60, 80, 100, ...? It comes down to machine thinking versus human thinking.

This software places the most recent values into data labels, formatted as boxes that point to the positions of those values on the axis. Evidently, it doesn't have a plan for overcrowding. At the bottom of the axis, we see four labels for six lines. The blue, pink and orange labels point to the wrong places on the axis.

Worse, it's unclear what those "most recent" values represent. I have added gridlines for each year on the excerpt shown right. The lines extend to 2017, which isn't even half over.

Now, consider the legend. Which version do you prefer?


Most likely, the original dataset has columns named "Amazon.com Revenue (TTM)", "Dillard's Revenue (TTM)", etc. so the software just picks those up and prints them in the legend text.


The chart is an output from YCharts, which I learned is a Bloomberg terminal competitor. It probably uses one of the available Web graphing packages out there. These packages typically emphasize ease of use through automating the process of data visualization. Ease of use is defined as rigid defaults that someone determines are the optimal settings. Users then discover that there is no getting around those settings; in some cases, a coding interface is available, which usurps the goal of user-friendliness.

The problem lies in defining what ease of use means. Ease of use should require expanding, not restricting, choices. Setting rigid defaults restricts choices. In addition to providing good defaults, the software designer should make it simple for users to make their own choices. Ideally, each of the elements (data labels, gridlines, tick marks, etc.) can be independently removed, shifted, expanded, reduced, re-colored, edited, etc. from their original settings.

Canadian winters in cold gray

I was looking at some Canadian data graphics while planning my talk in Vancouver this Thursday (you can register for the free talk here). I love the concept behind the following chart:


Based on the forecasted temperature for 2015 (specifically the temperature on Christmas Eve), the reporter for National Post asked whether the winter of 2015 would be colder or warmer than the winters on record since 1990. The accompanying article is here.

The presentation of small multiples encourages readers to examine that question city by city. It is more challenging to discover larger patterns.

Here is a sketch of a different take that attempts to shed light on regional and temporal patterns:


You can see that the western and central cities were warmer in the past while the eastern cities were colder in the past.

Also, there were some particularly cold years (1996, 1998, 2008, and 2012) when most of the featured cities experienced a freeze.

I am not sure why certain cities had no record of their temperature in certain years (machine malfunction?). In fact, one flaw in the original chart is the confusing legend that maps the grey color to "Data Unavailable" when most of the columns shown are grey. 


A pretty good chart ruined by some naive analysis

The following chart showing wage gaps by gender among U.S. physicians was sent to me via Twitter:


The original chart was published by the Stat News website (link).

I am most curious about the source of the data. It apparently came from a website called Doximity, which collects data from physicians. Here is a link to the PR release related to this compensation dataset. However, the data is not freely available. There is a claim that this data come from self reports by 36,000 physicians.

I am not sure whether I trust this data. For example:


Do I believe that physicians in North Dakota earn the highest salaries on average in the nation? And not only that, they earn almost 30% more than the average physician in New York. Does the average physician in ND really earn over $400K a year? If you are wondering, the second highest salary number comes from South Dakota. And then Idaho.  Also, these high-salary states are correlated with the lowest gender wage gaps.

I suspect that sample size is an issue. They do not report sample size at the level of their analyses. They apparently published statistics at the level of MSAs. There are roughly 400 MSAs in the U.S. so at that level, on average, they have only 90 samples per MSA. When split by gender, the average sample size is less than 50. Then, they are comparing differences, so we should see the standard errors. And finally, they are making hundreds of such comparisons, for which some kind of multiple-comparisons correction is needed.

I am pretty sure some of you are doctors, or work in health care. Do those salary numbers make sense? Are you moving to North/South Dakota?


Turning to the Visual corner of the Trifecta Checkup (link), I have a mixed verdict. The hover-over effect showing the precise values at either axes is a nice idea, well executed.

I don't see the point of drawing the circle inside a circle.  The wage gap is already on the vertical axis, and the redundant representation in dual circles adds nothing to it. Because of this construct, the size of the bubbles is now encoding the male average salary, taking attention away from the gender gap which is the point of the chart.

I also don't think the regional analysis (conveyed by the colors of the bubbles) is producing a story line.


This is another instance of a dubious analysis in this "big data" era. The analyst makes no attempt to correct for self-reporting bias, and works as if the dataset is complete. There is no indication of any concern about sample sizes, after the analyst drills down to finer areas of the dataset. While there are other variables available, such as specialty, and other variables that can be merged in, such as income levels, all of which may explain at least a portion of the gender wage gap, no attempt has been made to incorporate other factors. We are stuck with a bivariate analysis that does not control for any other factors.

Last but not least, the analyst draws a bold conclusion from the overly simplistic analysis. Here, we are told: "If you want that big money, you can't be a woman." (link)


P.S. The Stat News article reports that the researchers at Doximity claimed that they controlled for "hours worked and other factors that might explain the wage gap." However, in Doximity's own report, there is no language confirming how they included the controls.


It's your fault when you use defaults

The following chart showed up on my Twitter feed last week. It's a cautionary tale for using software defaults.


 At first glance, the stacking of years in a bar chart makes little sense. This is particularly so when there appears not to be any interesting annual trend: the four segments seem to have roughly equal length almost everywhere.

This designer might be suffering from what I have called "loss aversion" (link). Loss aversion in data visualization is the fear of losing your data, which causes people to cling on to every little bit of data they have.

Several challenges of the chart come from the software defaults. The bars are ordered alphabetically, making it difficult to discern a trend. The horizontal axis labels are given in single dollars and units, and yet the intention of the designer is to use millions, as indicated in the chart titles.

The one horrifying feature of this chart is the 3D effect. The third dimension contains no information at all. In fact, it destroys information, as readers who use the vertical gridlines to estimate the lengths of the bars will be sadly misled. As shown below, readers must draw imaginary lines to figure out the horizontal values.


The Question of this chart is the distribution of book sales (revenues and units) across different genres. When the designer chose to stack the bars (i.e. sum the yearly data), he or she has decided that the details of specific years are not as important as the total - this is the right conclusion since the bar segments have similar measurement within each genre.

So let's pursue the revolution of averaging the data, plotting average yearly sales.


This chart shows that there are two major types of genres. In the education world, the unit prices of (text)books are very high while sales are relatively small by units but in aggregate, the dollar revenues are high. In the "adult" world, whether it's fiction or non-fiction, the unit price is low while the number of units is high, which results in similar total dollar revenues as the education genres.


Simple lesson here: learn to hate software defaults