Two good charts can use better titles

NPR has this chart, which I like:

Npr_votersgunpolicy

It's a small multiples of bumps charts. Nice, clear labels. No unnecessary things like axis labels. Intuitive organization by Major Factor, Minor Factor, and Not a Factor.

Above all, the data convey a strong, surprising, message - despite many high-profile gun violence incidents this year, some Democratic voters are actually much less likely to see guns as a "major factor" in deciding their vote!

Of course, the overall importance of gun policy is down but the story of the chart is really about the collapse on the Democratic side, in a matter of two months.

The one missing thing about this chart is a nice, informative title: In two months, gun policy went from a major to a minor issue for some Democratic voters.

***

 I am impressed by this Financial Times effort:

Ft_millennialunemploy

The key here is the analysis. Most lazy analyses compare millennials to other generations but at current ages but this analyst looked at each generation at the same age range of 18 to 33 (i.e. controlling for age).

Again, the data convey a strong message - millennials have significantly higher un(der)employment than previous generations at their age range. Similar to the NPR chart above, the overall story is not nearly as interesting as the specific story - it is the pink area ("not in labour force") that is driving this trend.

Specifically, millennial unemployment rate is high because the proportion of people classified as "not in labour force" has doubled in 2014, compared to all previous generations depicted here. I really like this chart because it lays waste to a prevailing theory spread around by reputable economists - that somehow after the Great Recession, demographics trends are causing the explosion in people classified as "not in labor force". These people are nobodies when it comes to computing the unemployment rate. They literally do not count! There is simply no reason why someone just graduated from college should not be in the labour force by choice. (Dean Baker has a discussion of the theory that people not wanting to work is a long term trend.)

The legend would be better placed to the right of the columns, rather than the top.

Again, this chart benefits from a stronger headline: BLS Finds Millennials are twice as likely as previous generations to have dropped out of the labour force.

 

 

 

 


Fantastic visual, but the Google data need some pre-processing

Another entry in the Google Newslab data visualization project that caught my eye is the "How to Fix It" project, illustrating search queries across the world that asks "how." The project web page is here.

The centerpiece of the project is an interactive graphic showing queries related to how to fix home appliances. Here is what it looks like in France (It's always instructive to think about how they would count "France" queries. Is it queries from google.fr? queries written in French? queries from an IP address in France? A combination of the above?)

Howtofixit_france_appliances

I particularly appreciate the lack of labels. When we see the pictures, we don't need to be told this is a window and that is a door. The search data concern the relative sizes of the appliances. The red dotted lines show the relative popularity of searches for the respective appliances in aggregate.

By comparison, the Russian picture looks very different:

Howtofixit_russia_appliances

Are the Russians more sensible? Their searches are far and away about the washing machine, which is the most complicated piece of equipment on the graphic.

At the bottom of the page, the project looks at other queries, such as those related to cooking. I find it fascinating to learn what people need help making:

Howtofixit_world_cooking

I have to confess that I searched for "how to make soft boiled eggs". That led me to a lot of different webpages, mostly created for people who search for how to make a soft boiled egg. All of them contain lots of advertising, and the answer boils down to cook it for 6 minutes.

***

The Russia versus France comparison brings out a perplexing problem with the "Data" in this visualization. For competitive reasons, Google does not provide data on search volume. The so-called Search Index is what is being depicted. The Search Index uses the top-ranked item as the reference point (100). In the Russian diagram, the washing machine has Search Index of 100 and everything else pales in comparison.

In the France example, the window is the search item with the greatest number of searches, so it has Search Index of 100; the door has Index 96, which means it has 96% of the search volume of the window; the washing machine with Index 49 has about half the searches of the window.

The numbers cannot be interpreted as proportions. The Index of 49 does not mean that washing machines account for 49% of all France queries about fixing home appliances. That is really the meaning of popularity we want to have but we don't have. We can obtain true popularity measures by "normalizing" the Search Index: just sum up the Index Values of all the appliances and divide the Search Index by the sum of the Indices. After normalizing, the numbers can be interpreted as proportions and they add up to 100% for each country. When not normalized, the indices do not add to 100%.

Take the case in which we have five appliances, and let's say all five appliances are equally popular, comprising 20% of searches each. The five Search Indices will all be 100 because the top-ranked item is given the value of 100. Those indices add to 500!

By contrast, in the case of Russia (or a more extreme case), the top-ranked query is almost 100% of all the searches, so the sum of the indices will be only slightly larger than 100.

If you realize this, then you'd understand that it is risky to compare Search Indices across countries. The interpretation is clouded by how much of the total queries accounted for by the top query.

In our Trifecta Checkup, this is a chart that does well in the Question and Visual corners, but there is a problem with the Data.

 

 


Lines, gridlines, reference lines, regression lines, the works

This post is part 2 of an appreciation of the chart project by Google Newslab, advised by Alberto Cairo, on the gender and racial diversity of the newsroom. Part 1 can be read here.

In the previous discussion, I left out the following scatter bubble plot.

Goog_newsrooms_gender_2

This plot is available in two versions, one for gender and one for race. The key question being asked is whether the leadership in the newsroom is more or less diverse than the rest of the staff.

The story appears to be a happy one: in many newsrooms, the leadership roughly reflects the staff in terms of gender distribution (even though both parts of the whole compare disfavorably to the gender ratio in the neighborhoods, as we saw in the previous post.)

***

Unfortunately, there are a few execution problems with this scatter plot.

First, take a look at the vertical axis labels on the right side. The labels inform the leadership axis. The mid-point showing 50-50 (parity) is emphasized with the gray band. Around the mid-point, the labels seem out of place. Typically, when the chart contains gridlines, we expect the labels to sit right around each gridline, either on top or just below the line. Here the labels occupy the middle of the space between successive gridlines. On closer inspection, the labels are correctly affixed, and the gridlines  drawn where they are supposed to be. The designer chose to show irregularly spaced labels: from the midpoint, it's a 15% jump on either side, then a 10% jump.

I find this decision confounding. It also seems as if two people have worked on these labels, as there exists two patterns: the first is "X% Leaders are Women", and second is "Y% Female." (Actually, the top and bottom labels are also inconsistent, one using "women" and the other "female".)

The horizontal axis? They left out the labels. Without labels, it is not possible to interpret the chart. Inspecting several conveniently placed data points, I figured that the labels on the six vertical gridlines should be 25%, 35%, ..., 65%, 75%, in essence the same scale as the vertical axis.

Here is the same chart with improved axis labels:

Jc_newsroomgender_1

Re-labeling serves up a new issue. The key reference line on this chart isn't the horizontal parity line: it is the 45-degree line, showing that the leadership has the same proprotion of females as the rest of the staff. In the following plot (right side), I added in the 45-degree line. Note that it is positioned awkwardly on top of the grid system. The culprit is the incompatible gridlines.

  Jc_newsroomgender_1

The solution, as shown below, is to shift the vertical gridlines by 5% so that the 45-degree line bisects every grid cell it touches.

Jc_newsroomgender_3

***

Now that we dealt with the purely visual issues, let me get to a statistical issue that's been troubling me. It's about that yellow line. It's supposed to be a regression line that runs through the points.

Does it appear biased downwards to you? It just seems that there are too many dots above and not enough below. The distance of the furthest points above also appears to be larger than that of the distant points below.

How do we know the line is not correct? Notice that the green 45-degree line goes through the point labeled "AVERAGE." That is the "average" newsroom with the average proportion of female staff and the average proportion of leadership staff. Interestingly, the average falls right on the 45-degree line.

In general, the average does not need to hit the 45-degree line. The average, however, does need to hit the regression line! (For a mathematical explanation, see here.)

Note the corresponding chart for racial diversity has it right. The yellow line does pass through the average point here:

Goog_newsrooms_race_2

 ***

In practice, how do problems seep into dataviz projects? It's the fact that you don't get to the last chart via a clean, streamlined process but that you pass through a cycle of explore-retrench-synthesize, frequently bouncing ideas between several people, and it's challenging to keep consistency!

And let me repeat my original comment about this project - the key learning here is how they took a complex dataset with many variables, broke it down into multiple parts addressing specific problems, and applied the layering principle to make each part of the project digestible.

 

 


The tech world in which everyone is below average

Laura pointed me to an infographic about tech worker salaries in major tech hubs (link).

What's wrong with this map?

Entrepreneur_techsalaries_map

The box "Global average" is doubly false. It is not global, and it is not the average!

The only non-American cities included in this survey are Toronto, Paris and London.

The only city with average salary above the "Global average" is San Francisco Bay Area. Since the Bay Area does not outweigh all other cities combined in the number of tech workers, it is impossible to get an average of $135,000.

***

Here is the second chart.

What's wrong with these lines?

Entrepreneur_techsalaries_lines

This chart frustrates the reader's expectations. The reader interprets it as a simple line chart, based on three strong hints:

  • time along the horizontal axis
  • data labels show dollar units
  • lines linking time

Each line seems to show the trend of average tech worker salary, in dollar units.

However, that isn't the designer's intention. Let's zoom in on Chicago and Denver:

Entrepreneur_techsalaries_lines2

The number $112,000 (Denver) sits below the number $107,000 (Chicago). It appears that each chart has its own scale. But that's not the case either.

For a small-multiples setup, we expect all charts should use the same scale. Even though the data labels are absolute dollar amounts, the vertical axis is on a relative scale (percent change). To make things even more complicated, the percent change is computed relative to the minimum of the three annual values, no matter which year it occurs.

Redo_entrepreneurtechsalarieslines2

That's why $106,000 (Chicago) is at the same level as $112,000 (Denver). Those are the minimum values in the respective time series. As shown above, these line charts are easier to understand if the axis is displayed in its true units of percent change.

The choice of using the minimum value as the reference level interferes with comparing one city to the next. For Chicago, the line chart tells us 2015 is about 2 percent above 2016 while 2017 is 6 percent above. For Denver, the line chart tells us that 2016 is about 2 percent above the 2015 and 2017 values. Now what's the message again?

Here I index all lines to the earliest year.

  Redo_junkcharts_entrepreneurtechsalaries_lines

In a Trifecta Checkup analysis (link), I'd be suspicious of the data. Did tech salaries in London really drop by 15-20 percent in the last three years?

 

 


Governor of Maine wants a raise

In a Trifecta checkup, this map scores low on the Q corner: what is its purpose? What have readers learned about the salaries of state governors after looking at the map? (Link to original)

How-much_governors-salary-every-state-9e71

The most obvious "insights" include:

  • There are more Republican governors than Democratic governors
  • Most Democratic governors are from the coastal states
  • There is exactly one Independent governor
  • Small states on the Eastern seaboard is messing up the design

Notice I haven't said anything about salaries. That's because the reader has to read the data labels to learn the governor's salary in each state. It's work to know what the average or median salary is, or even the maximum and minimum without spending quality time with the labels.

This is also an example of a chart that is invariant to the data. The chart would look exactly the same if I substituted the real salaries with 50 fake numbers.

***

The following design attempts to say something about the data. The dataset is actually not that interesting because the salaries are relatively closely clustered.

Redo_governorsalary
You get to see the full range of salaries, with the median, 25th and 75th percentiles marked off. The states are divided into top and bottom halves, with the median as the splitting level. A simple clustering algorithm is applied to group the salaries into similar categories, then color-coded.

The Maine governor is the least compensated.

If you have other ideas for this dataset, feel free to submit them to me.


A long view of hurricanes

This chart by Axios is well made. The full version is here.

Axios_hurricanes

It's easy to identify all the Cat 5 hurricanes. Only important ones are labeled. The other labels are hidden behind the hover. The chart provides a good answer to the question: what time of the year does the worst hurricanes strike. It's harder to compare the maximum speeds of the hurricanes.

I wish there is a way to incorporate geography. I'd be willing to trade off the trajectory of wind speeds as the max speed is of most use.


The salaries are attractive but the chart isn't

Ieee_engineersalaries
The only reason why the IEEE Spectrum magazine editors chose this chart form is because they think they need to deliver precise salary figures to readers.

This chart is just so... sad.

The color scheme is all wrong, the black suggesting a funeral. The printed data occupying at least half of the width of each bar frustrate any attempt to compare lengths. We enter an unusual place where higher numbers appear under smaller numbers. The job titles are regrettably dressed in the same cloth as the median salary bars. It's not clear how the regions are ordered but in any case, it's hard to figure out regional disparities. In reality, no one is getting precisely the listed salaries - rounding up those numbers makes them easier to grasp.

This is a chart that repels rather than attracts readers.

***

A test of sufficiency immediately nails the problem. When the data set is removed, there is almost nothing to see:

Redo_ieeesalaries_sufficiencytest

***

Mid-Atlantic managers are the winners.

Redo_jc_ieeesalaries
 


Details, details, details: giving Zillow a pie treatment

Delinquent_homes_chart
This chart (shown right), published by Zillow in a report on housing in 2012, looks quite standard, apparently avoiding the worst of Excel defaults.

In real estate, it’s all about location. In dataviz, it’s all about details.

What are some details that I caught my eye on this chart?

Readers have to get over the hurdle that “negative equity” is the same as “underwater homes.” This is not readily understood unless one reads the surrounding text. For example, the first row for the U.S. average proclaims that 31% of U.S. homes are “underwater” and among these underwater homes, 10% of the mortgages are delinquent. The former is concerned with the valuation of the property while the latter deals with payments or lack thereof.

According to the legend, the blue segments stand for the proportions of underwater homes in different metro areas but it’s not quite true – the blue part represents underwater but not delinquent mortgages while the red and blue combined represents all underwater mortgages. This is a common problem in stacked bar charts.

The metro areas are in alphabetical order by city, which means an opportunity is missed to help readers discern patterns. Patterns related to city-name alphabets is not of interest to most (except certain econometrics journal editors). Try arranging by region, or by decreasing level of negative equity, or some other meaningful variable.

The designer tried to do something clever with the horizontal axis labels and I don't think it succeeds. To see what is going on, read the note below the chart. The trick is to let readers look at the number of underwater and delinquent mortgages in two ways, as a proportion of underwater mortgages (through the white data labels) and as a proportion of all mortgages (through the axis labels). That's a mess, sorry to say.

Finally, I like the horizontal axis to extend to 100% because underlying the proportions shown in blue and on the horizontal axis is the population of all mortgages.

***

Perhaps a shock to many readers. The task of showing underwater delinquent mortgages simultaneously as a proportion of underwater mortgages and as a proportion of all mortgages is solved using .... pie charts.

I just created a couple of examples here:

Redo_zillowunderwater

The deep orange sector can be compared to the entire circle, or to the larger orange sector. Readers usually don't have a problem with pies with only three slices.


This one takes time to make, takes even more time to read

Reader Matt F. contributed this confusing chart from Wired, accompanying an article about Netflix viewing behavior. 

Wired_netflix_chart-1

Matt doesn't like this chart. He thinks the main insight - most viewers drop out after the first episode - is too obvious. And there are more reasons why the chart doesn't work.

This is an example of a high-effort, low-reward chart. See my return-on-effort matrix for more on this subject.

The high effort is due to several design choices.

The most attention-grabbing part of the chart is the blue, yellow and green bars. The blue and yellow together form a unity, while the green color refers to something else entirely. The shows in blue are classified as "savored," meaning that "viewers" on average took in less than two hours per day "to complete the season." The shows in yellow are just the opposite and labeled "devoured." The distinction between savored and devoured shows appears to be a central thesis of the article.

The green cell measures something else unrelated to the average viewer's speed of consumption. It denotes a single episode, the "watershed" after which "at least 70 percent of viewers will finish the season." The watershed episode exists for all shows, the only variability is which episode. The variability is small because all shows experience a big drop-off in audience after episode 1, the slope of the audience curve is decreasing with further episodes, and these shows have a small number of episodes (6 to 13). In the shows depicted, with a single exception of BoJack Horseman, the watershed occurs in episode 2, 3, or 4. 

Wired_netflix_inset1Beyond the colors, readers will consider the lengths of the bars. The labels are typically found on the horizontal axis but here, they are found facing the wrong way on pink columns on the right edge of the chart. These labels are oriented in a way that makes readers think they represent column heights.

The columns look like they are all roughly the same height but on close inspection, they are not! Their heights are not given on top of the columns but on the side of the vertical axis.

The bar lengths show the total number of minutes of season 1 of each of these shows. This measure is a peripheral piece of information that adds little to the chart.

The vertical axis indicates the proportion of viewers who watched all episodes within one week of viewing. This segmentation of viewers is related to the segmentation of the shows (blue/yellow) as they are both driven by the speed of consumption. 

Not surprisingly, the higher the elevation of the bar, the more likely it is yellow. Higher bar means more people are binge-watching, which should imply the show is more likely classified as "devoured". Despite the correlation, these two ways of measuring the speed of consumption is not consistent. The average show on the chart has about 7 hours of content. If consumed within one week, it requires only one hour of viewing per day... so the average show would be classified as "savored" even though the average viewer can be labeled a binge-watcher who finishes in one week.

***

[After taking a breath of air] We may have found the interesting part of this chart - the show Orange is the New Black is considered a "devoured" show and yet only half the viewers finish all episodes within one week, a much lower proportion than most of the other shows. Given the total viewing hours of about 12, if the viewer watches two hours per day, it should take 6 days to finish the series, within the one-week cutoff. So this means that the viewers may be watching more than one episode at a time, but taking breaks between viewing sessions. 

The following chart brings out the exceptional status of this show:

Redo_wirednetflixchill_v2

PS. Above image was replaced on 7/19/2017 based on feedback from the commenters. Labels and legend added.


Shocker: ease of use requires expanding, not restricting, choices

Recently, I noted how we have to learn to hate defaults in data visualization software. I was reminded again of this point when reviewing this submission from long-time reader & contributor Chris P.

Medium_retailstocks

The chart is included in this Medium article, which credits Mott Capital Management as the source.

Jc_medium_retailersLook at the axis labels on the right side. They have the hallmarks of software defaults. The software designer decided that the axis labels will be formatted in exactly the same way as the data in that column: this means $XXX.XXB, with two decimal places. The same formatting rule is in place for the data labels, shown in boxes.

Why put tick marks at the odd intervals, 37.50, 62.50, 87.50, ... ? What's wrong with 40, 60, 80, 100, ...? It comes down to machine thinking versus human thinking.

This software places the most recent values into data labels, formatted as boxes that point to the positions of those values on the axis. Evidently, it doesn't have a plan for overcrowding. At the bottom of the axis, we see four labels for six lines. The blue, pink and orange labels point to the wrong places on the axis.

Worse, it's unclear what those "most recent" values represent. I have added gridlines for each year on the excerpt shown right. The lines extend to 2017, which isn't even half over.

Now, consider the legend. Which version do you prefer?

Jc_medium_retailers_legend

Most likely, the original dataset has columns named "Amazon.com Revenue (TTM)", "Dillard's Revenue (TTM)", etc. so the software just picks those up and prints them in the legend text.

***

The chart is an output from YCharts, which I learned is a Bloomberg terminal competitor. It probably uses one of the available Web graphing packages out there. These packages typically emphasize ease of use through automating the process of data visualization. Ease of use is defined as rigid defaults that someone determines are the optimal settings. Users then discover that there is no getting around those settings; in some cases, a coding interface is available, which usurps the goal of user-friendliness.

The problem lies in defining what ease of use means. Ease of use should require expanding, not restricting, choices. Setting rigid defaults restricts choices. In addition to providing good defaults, the software designer should make it simple for users to make their own choices. Ideally, each of the elements (data labels, gridlines, tick marks, etc.) can be independently removed, shifted, expanded, reduced, re-colored, edited, etc. from their original settings.