Is the visual serving the question?

The following chart concerns California's bullet train project.

California_bullettrain

Now, look at the bubble chart at the bottom. Here it is - with all the data except the first number removed:

Highspeedtrains_sufficiency

It is impossible to know how fast the four other train systems run after I removed the numbers. The only way a reader can comprehend this chart is to read the data inside the bubbles. This chart fails the "self-sufficiency test". The self-sufficiency test asks how much work the visual elements on the chart are doing to communicate the data; in this case, the bubbles do nothing at all.

Another problem: this chart buries its lede. The message is in the caption: how California's bullet train rates against other fast train systems. California's train speed of 220 mph is only mentioned in the text but not found in the visual.

Here is a chart that draws attention to the key message:

Redo_highspeedtrains

In a Trifecta checkup, we improved this chart by bringing the visual in sync with the central question of the chart.


Two views of earthquake occurrence in the Bay Area

This article has a nice description of earthquake occurrence in the San Francisco Bay Area. A few quantities are of interest: when the next quake occurs, the size of the quake, the epicenter of the quake, etc. The data graphic included in the article fails the self-sufficiency test: the only way to read this chart is to read out the entire data set - in other words, the graphical details have no utility.

Earthquake-probability-chart

The article points out the clustering of earthquakes. In particular, there is a 68-year "quiet period" between 1911 and 1979, during which no quakes over 6.0 in size occurred. The author appears to have classified quakes into three groups: "Largest" which are those at 6.5 or over; "Smaller but damaging" which are those between 6.0 and 6.5; and those below 6.0 (not shown).

For a more standard and more effective visualization of this dataset, see this post on a related chart (about avian flu outbreaks). The post discusses a bubble chart versus a column chart. I prefer the column chart.

image from junkcharts.typepad.com

This chart focuses on the timing of rare events. The time between events is not as easy to see. 

What if we want to focus on the "quiet years" between earthquakes? Here is a visualization that addresses the question: when will the next one hit us?

Redo_jc_earthquakeprobability

 

 


Big Macs in Switzerland are amazing, according to my friend

Bigmac_chNote for those in or near Zurich: I'm giving a Keynote Speech tomorrow morning at the Swiss Statistics Meeting (link). Here is the abstract:

The best and the worst of data visualization share something in common: these graphics provoke emotions. In this talk, I connect the emotional response of readers of data graphics to the design choices made by their creators. Using a plethora of examples, collected over a dozen years of writing online dataviz criticism, I discuss how some design choices generate negative emotions such as confusion and disbelief while other choices elicit positive feelings including pleasure and eureka. Important design choices include how much data to show; which data to highlight, hide or smudge; what research question to address; whether to introduce imagery, or playfulness; and so on. Examples extend from graphics in print, to online interactive graphics, to visual experiences in society.

***

The Big Mac index seems to never want to go away. Here is the latest graphic from the Economist, saying what it says:

Econ_bigmacindex

The index never made much sense to me. I'm in Switzerland, and everything here is expensive. My friend, who is a U.S. transplant, seems to have adopted McDonald's as his main eating-out venue. Online reviews indicate that the quality of the burger served in Switzerland is much better than the same thing in the States. So, part of the price differential can be explained by quality. The index also confounds several other issues, such as local inflation and exchange rate

Now, on to the data visualization, which is primarily an exercise in rolling one's eyeballs. In order to understand the red and blue line segments, our eyes have to hop over the price bubbles to the top of the page. Then, in order to understand the vertical axis labels, unconventionally placed on the right side, our eyes have to zoom over to the left of the page, and search for the line below the header of the graph. Next, if we want to know about a particular country, our eyes must turn sideways and scan from bottom up.

Here is a different take on the same data:

Redo_jc_econbigmac2018

I transformed the data as I don't find it compelling to learn that Russian Big Macs are 60% less than American Big Macs. Instead, on my chart, the reader learns that the price paid for a U.S. Big Mac will buy him/her almost 2 and a half Big Macs in Russia.

The arrows pointing left indicate that in most countries, the values of their currencies are declining relative to the dollar from 2017 to 2018 (at least by the Big Mac Index point of view). The only exception is Turkey, where in 2018, one can buy more Big Macs equivalent to the price paid for one U.S. Big Mac. compared to 2017.

The decimal differences are immaterial so I have grouped the countries by half Big Macs.

This example demonstrates yet again, to make good data visualization, one has to describe an interesting question, make appropriate transformations of the data, and then choose the right visual form. I describe this framework as the Trifecta - a guide to it is here.

(P.S. I noticed that Bitly just decided unilaterally to deactivate my customized Bitly link that was configured years and years ago, when it switched design (?). So I had to re-create the custom link. I have never grasped  why "unreliability" is a feature of the offering by most Tech companies.)


Checking the scale on a chart

Dot maps, and by extension, bubble maps are popular options for spatial data; but the scale of these maps can be deceiving. Here is an example of a poorly-scaled dot map:

Farm-Dot Density

The U.S. was primarily an agrarian economy in 1997, if you believe your eyes.

Here is a poorly-scaled bubble map:

image from junkcharts.typepad.com

New Yorkers have all become Citibikers, if you believe what you see.

Last week, I saw a nice dot map embedded inside this New York Times Graphics feature on the destruction of the Syrian city of Raqqa.

Nyt_raqqa_dotmap

Before I conclude that the destruction was broadly felt, I'd like to check the scale on the map to make sure it doesn't have the problem seen above. What is helpful here is the scale provided on the map itself.

Nty_raqqa_scale

That line segment representing a quarter mile fits about 15 dots side by side. Then, I found out that a Manhattan avenue (longer) block is roughly a quarter mile. That means the map places about 15 buildings to an avenue block. In my experience, that sounds about right: I'd imagine 15-20 buildings per block.

So I'm convinced that the designer chose an appropriate scale to display the data. It is actually true that the entire city of Raqqa was pretty much annihilated by U.S. bombs.


The downside of discouraging pie charts

It's no secret most dataviz experts do not like pie charts.

Our disdain for pie charts causes people to look for alternatives.

Sometimes, the alternative is worse. Witness:

Schwab_bloombergaggregatebondindex

This chart comes from the Spring 2018 issue of On Investing, the magazine for Charles Schwab customers.

It's not a pie chart.

Redo_jc_bondindex

I'm forced to say the pie chart is preferred.

The original chart fails the self-sufficiency test. Here is the 2007 chart with the data removed.

Bloombergbondindex_sufficiency

It's very hard to figure out how large are those pieces, so any reader trying to understand this chart will resort to reading the data, which means the visual representation does no work!

Or, you can use a dot plot.

Redo_jc_bondindex2

This version emphasizes the change over time.

 


Is the chart answering your question? Excavating the excremental growth map

Economist_excrement_growthSan Franciscans are fed up with excremental growth. Understandably.

Here is how the Economist sees it - geographically speaking.

***

In the Trifecta Checkup analysis, one of the questions to ask is "What does the visual say?" and with respect to the question being asked.

The question is how much has the problem of human waste in SF grew from 2011 to 2017.

What does the visual say?

The number of complaints about human waste has increased from 2011 to 2014 to 2017.

The areas where there are complaints about human waste expanded.

The worst areas are around downtown, and that has not changed during this period of time.

***

Now, what does the visual not say?

Let's make a list:

  • How many complaints are there in total in any year?
  • How many complaints are there in each neighborhood in any year?
  • What's the growth rate in number of complaints, absolute or relative?
  • What proportion of complaints are found in the worst neighborhoods?
  • What proportion of the area is covered by the green dots on each map?
  • What's the growth in terms of proportion of areas covered by the green dots?
  • Does the density of green dots reflect density of human waste or density of human beings?
  • Does no green dot indicate no complaints or below the threshold of the color scale?

There's more:

  • Is the growth in complaints a result of more reporting or more human waste?
  • Is each complainant unique? Or do some people complain multiple times?
  • Does each piece of human waste lead to one and only one complaint? In other words, what is the relationship between the count of complaints and the count of human waste?
  • Is it easy to distinguish between human waste and animal waste?

And more:

  • Are all complaints about human waste valid? Does anyone verify complaints?
  • Are the plotted locations describing where the human waste is or where the complaint was made?
  • Can all complaints be treated identically as a count of one?
  • What is the per-capita rate of complaints?

In other words, the set of maps provides almost all no information about the excrement problem in San Francisco.

After you finish working, go back and ask what the visual is saying about the question you're trying to address!

 

As a reference, I found this map of the population density in San Francisco (link):

SFO_Population_Density

 


Common charting issues related to connecting lines, labels, sequencing

The following chart about "ranges and trends for digital marketing salaries" has some problems that appear in a great number of charts.

Marketingsherpa-chartofweek-062915-salaries

The head tilt required to read the job titles.

The order of the job titles is baffling. It's neither alphabetical nor by salary.

The visual form suggests that we could see trends in salaries reading left-right, but the only information about trends is the year on year salary change, printed on top of the chart.

Some readers will violently object to the connecting lines between job titles, which are discrete categories. In this case, I also agree. I am a fan of so-called profile charts in which we do connect discrete categories with connecting lines - but those charts work because we are comparing the "profiles" of one group versus another group. Here, there is only one group.

The N=3,567 is weird. It doesn't say anything about the reliability of the estimate for say Chief Marketing Officer.

***

A dot plot can be used for this dataset. Like this:

Redo_jc_digitalsalaries

The range of salaries is not a great metric as the endpoints could be outliers.

Also, the variability of salaries is affected by two factors: the variability between companies, and sampling variability (which depends on the sample size for each job title). A wide range here could mean that different companies pay different salaries for the same job title, or that very few survey responders held that job title.

 

 


Hog wild about dot maps

Reader Chris P. sent me this chart.

This was meant to be "light entertainment." See the Twitter discussion below.

9gag_hogsmap

***

Let's think a bit about the dot map as a data graphic.

Dot maps are one dimensional. The dot's location is used to indicate the latitude and longitude and therefore the x,y coordinates cannot encode any other data. If we have basically a black/white chart, as in this hog map, the dot can only encode binary data (yes/no).

The legend says "each dot represents 5,000 hogs." Think about how that statement applies to these scenarios:

  • Do you expect to see something different between the dot representing 4,200 and the one showing 4,900?
  • Do you expect to see something different between the dot representing 400 and 4,000?
  • Do you expect to see something different between the location with 4,800 hogs and 9,600 hogs?


Based on the legend, the designer would need two dots to represent 10,000 hogs. But those two dots pertain to the same location. Sometimes, "jitter" is added, and the two dots are placed side by side. However, with the scale of the map of the U.S., and the dots representing seemingly small neighborhoods, jitter creates more confusion than anything. Also, what about 3, 4, 5, .. dots in the same location?

 9gag_hogmap_inset

Looking at the details above, are the dots jittered or do they represent neighboring locations?

Sometimes, colors are used to encode data on a dot map. But each dot can only contain one color, so it only typically shows the top category in each location.

Dot maps are very limited. Think before you use them.

 


Well-structured, interactive graphic about newsrooms

Today, I take a detailed look at one of the pieces that came out of an amazing collaboration between Alberto Cairo, and Google's News Lab. The work on diversity in U.S. newsrooms is published here. Alberto's introduction to this piece is here.

The project addresses two questions: (a) gender diversity (representation of women) in U.S. newsrooms and (b) racial diversity (representation of white vs. non-white) in U.S. newsrooms.

One of the key strengths of the project is how the complex structure of the underlying data is displayed. The design incorporates the layering principle everywhere to clarify that structure.

At the top level, the gender and race data are presented separately through the two tabs on the top left corner. Additionally, newsrooms are classified into three tiers: brand-names (illustrated with logos), "top" newsrooms, and the rest.

Goog_newsrooms_gender_1

The brand-name newsrooms are shown with logos while the reader has to click on individual bubbles to see the other newsrooms. (Presumably, the size of the bubble is the size of each newsroom.)

The horizontal scale is the proportion of males (or females), with equality positioned in the middle. The higher the proportion of male staff, the deeper is the blue. The higher the proportion of female staff, the deeper is the red. The colors are coordinated between the bubbles and the horizontal axis, which is a nice touch.

I am not feeling this color choice. The key reference level on this chart is the 50/50 split (parity), which is given the pale gray. So the attention is drawn to the edges of the chart, to those newsrooms that are the most gender-biased. I'd rather highlight the middle, celebrating those organizations with the best gender balance.

***

The red-blue color scheme unfortunately re-appeared in a subsequent chart, with a different encoding.

Goog_newsrooms_gender_4

Now, blue means a move towards parity while red indicates a move away from parity between 2001 and 2017. Gray now denotes lack of change. The horizontal scale remains the same, which is why this can cause some confusion.

Despite the colors, I like the above chart. The arrows symbolize trends. The chart delivers an insight. On average, these newsrooms are roughly 60% male with negligible improvement over 16 years.

***

Back to layering. The following chart shows that "top" newsrooms include more than just the brand-name ones.

Goog_newsrooms_gender_3

The dot plot is undervalued for showing simple trends like this. This is a good example of this use case.

While I typically recommend showing balanced axis for bipolar scale, this chart may be an exception. Moving to the right side is progress but the target sits in the middle; the goal isn't to get the dots to the far right so much of the right panel is wasted space.