The following chart concerns California's bullet train project.
Now, look at the bubble chart at the bottom. Here it is - with all the data except the first number removed:
It is impossible to know how fast the four other train systems run after I removed the numbers. The only way a reader can comprehend this chart is to read the data inside the bubbles. This chart fails the "self-sufficiency test". The self-sufficiency test asks how much work the visual elements on the chart are doing to communicate the data; in this case, the bubbles do nothing at all.
Another problem: this chart buries its lede. The message is in the caption: how California's bullet train rates against other fast train systems. California's train speed of 220 mph is only mentioned in the text but not found in the visual.
Here is a chart that draws attention to the key message:
In a Trifecta checkup, we improved this chart by bringing the visual in sync with the central question of the chart.
This article has a nice description of earthquake occurrence in the San Francisco Bay Area. A few quantities are of interest: when the next quake occurs, the size of the quake, the epicenter of the quake, etc. The data graphic included in the article fails the self-sufficiency test: the only way to read this chart is to read out the entire data set - in other words, the graphical details have no utility.
The article points out the clustering of earthquakes. In particular, there is a 68-year "quiet period" between 1911 and 1979, during which no quakes over 6.0 in size occurred. The author appears to have classified quakes into three groups: "Largest" which are those at 6.5 or over; "Smaller but damaging" which are those between 6.0 and 6.5; and those below 6.0 (not shown).
For a more standard and more effective visualization of this dataset, see this post on a related chart (about avian flu outbreaks). The post discusses a bubble chart versus a column chart. I prefer the column chart.
This chart focuses on the timing of rare events. The time between events is not as easy to see.
What if we want to focus on the "quiet years" between earthquakes? Here is a visualization that addresses the question: when will the next one hit us?
Note for those in or near Zurich: I'm giving a Keynote Speech tomorrow morning at the Swiss Statistics Meeting (link). Here is the abstract:
The best and the worst of data visualization share something in common: these graphics provoke emotions. In this talk, I connect the emotional response of readers of data graphics to the design choices made by their creators. Using a plethora of examples, collected over a dozen years of writing online dataviz criticism, I discuss how some design choices generate negative emotions such as confusion and disbelief while other choices elicit positive feelings including pleasure and eureka. Important design choices include how much data to show; which data to highlight, hide or smudge; what research question to address; whether to introduce imagery, or playfulness; and so on. Examples extend from graphics in print, to online interactive graphics, to visual experiences in society.
The Big Mac index seems to never want to go away. Here is the latest graphic from the Economist, saying what it says:
The index never made much sense to me. I'm in Switzerland, and everything here is expensive. My friend, who is a U.S. transplant, seems to have adopted McDonald's as his main eating-out venue. Online reviews indicate that the quality of the burger served in Switzerland is much better than the same thing in the States. So, part of the price differential can be explained by quality. The index also confounds several other issues, such as local inflation and exchange rate
Now, on to the data visualization, which is primarily an exercise in rolling one's eyeballs. In order to understand the red and blue line segments, our eyes have to hop over the price bubbles to the top of the page. Then, in order to understand the vertical axis labels, unconventionally placed on the right side, our eyes have to zoom over to the left of the page, and search for the line below the header of the graph. Next, if we want to know about a particular country, our eyes must turn sideways and scan from bottom up.
Here is a different take on the same data:
I transformed the data as I don't find it compelling to learn that Russian Big Macs are 60% less than American Big Macs. Instead, on my chart, the reader learns that the price paid for a U.S. Big Mac will buy him/her almost 2 and a half Big Macs in Russia.
The arrows pointing left indicate that in most countries, the values of their currencies are declining relative to the dollar from 2017 to 2018 (at least by the Big Mac Index point of view). The only exception is Turkey, where in 2018, one can buy more Big Macs equivalent to the price paid for one U.S. Big Mac. compared to 2017.
The decimal differences are immaterial so I have grouped the countries by half Big Macs.
This example demonstrates yet again, to make good data visualization, one has to describe an interesting question, make appropriate transformations of the data, and then choose the right visual form. I describe this framework as the Trifecta - a guide to it is here.
(P.S. I noticed that Bitly just decided unilaterally to deactivate my customized Bitly link that was configured years and years ago, when it switched design (?). So I had to re-create the custom link. I have never grasped why "unreliability" is a feature of the offering by most Tech companies.)
Dotmaps, and by extension, bubble maps are popular options for spatial data; but the scale of these maps can be deceiving. Here is an example of a poorly-scaled dot map:
The U.S. was primarily an agrarian economy in 1997, if you believe your eyes.
Here is a poorly-scaled bubble map:
New Yorkers have all become Citibikers, if you believe what you see.
Last week, I saw a nice dot map embedded inside this New York Times Graphics feature on the destruction of the Syrian city of Raqqa.
Before I conclude that the destruction was broadly felt, I'd like to check the scale on the map to make sure it doesn't have the problem seen above. What is helpful here is the scale provided on the map itself.
That line segment representing a quarter mile fits about 15 dots side by side. Then, I found out that a Manhattan avenue (longer) block is roughly a quarter mile. That means the map places about 15 buildings to an avenue block. In my experience, that sounds about right: I'd imagine 15-20 buildings per block.
So I'm convinced that the designer chose an appropriate scale to display the data. It is actually true that the entire city of Raqqa was pretty much annihilated by U.S. bombs.
The following chart about "ranges and trends for digital marketing salaries" has some problems that appear in a great number of charts.
The head tilt required to read the job titles.
The order of the job titles is baffling. It's neither alphabetical nor by salary.
The visual form suggests that we could see trends in salaries reading left-right, but the only information about trends is the year on year salary change, printed on top of the chart.
Some readers will violently object to the connecting lines between job titles, which are discrete categories. In this case, I also agree. I am a fan of so-called profile charts in which we do connect discrete categories with connecting lines - but those charts work because we are comparing the "profiles" of one group versus another group. Here, there is only one group.
The N=3,567 is weird. It doesn't say anything about the reliability of the estimate for say Chief Marketing Officer.
A dot plot can be used for this dataset. Like this:
The range of salaries is not a great metric as the endpoints could be outliers.
Also, the variability of salaries is affected by two factors: the variability between companies, and sampling variability (which depends on the sample size for each job title). A wide range here could mean that different companies pay different salaries for the same job title, or that very few survey responders held that job title.
This was meant to be "light entertainment." See the Twitter discussion below.
Let's think a bit about the dot map as a data graphic.
Dot maps are one dimensional. The dot's location is used to indicate the latitude and longitude and therefore the x,y coordinates cannot encode any other data. If we have basically a black/white chart, as in this hog map, the dot can only encode binary data (yes/no).
The legend says "each dot represents 5,000 hogs." Think about how that statement applies to these scenarios:
Do you expect to see something different between the dot representing 4,200 and the one showing 4,900?
Do you expect to see something different between the dot representing 400 and 4,000?
Do you expect to see something different between the location with 4,800 hogs and 9,600 hogs?
Based on the legend, the designer would need two dots to represent 10,000 hogs. But those two dots pertain to the same location. Sometimes, "jitter" is added, and the two dots are placed side by side. However, with the scale of the map of the U.S., and the dots representing seemingly small neighborhoods, jitter creates more confusion than anything. Also, what about 3, 4, 5, .. dots in the same location?
Looking at the details above, are the dots jittered or do they represent neighboring locations?
Sometimes, colors are used to encode data on a dot map. But each dot can only contain one color, so it only typically shows the top category in each location.
Dot maps are very limited. Think before you use them.
Today, I take a detailed look at one of the pieces that came out of an amazing collaboration between Alberto Cairo, and Google's News Lab. The work on diversity in U.S. newsrooms is published here. Alberto's introduction to this piece is here.
The project addresses two questions: (a) gender diversity (representation of women) in U.S. newsrooms and (b) racial diversity (representation of white vs. non-white) in U.S. newsrooms.
One of the key strengths of the project is how the complex structure of the underlying data is displayed. The design incorporates the layering principle everywhere to clarify that structure.
At the top level, the gender and race data are presented separately through the two tabs on the top left corner. Additionally, newsrooms are classified into three tiers: brand-names (illustrated with logos), "top" newsrooms, and the rest.
The brand-name newsrooms are shown with logos while the reader has to click on individual bubbles to see the other newsrooms. (Presumably, the size of the bubble is the size of each newsroom.)
The horizontal scale is the proportion of males (or females), with equality positioned in the middle. The higher the proportion of male staff, the deeper is the blue. The higher the proportion of female staff, the deeper is the red. The colors are coordinated between the bubbles and the horizontal axis, which is a nice touch.
I am not feeling this color choice. The key reference level on this chart is the 50/50 split (parity), which is given the pale gray. So the attention is drawn to the edges of the chart, to those newsrooms that are the most gender-biased. I'd rather highlight the middle, celebrating those organizations with the best gender balance.
The red-blue color scheme unfortunately re-appeared in a subsequent chart, with a different encoding.
Now, blue means a move towards parity while red indicates a move away from parity between 2001 and 2017. Gray now denotes lack of change. The horizontal scale remains the same, which is why this can cause some confusion.
Despite the colors, I like the above chart. The arrows symbolize trends. The chart delivers an insight. On average, these newsrooms are roughly 60% male with negligible improvement over 16 years.
Back to layering. The following chart shows that "top" newsrooms include more than just the brand-name ones.
The dot plot is undervalued for showing simple trends like this. This is a good example of this use case.
While I typically recommend showing balanced axis for bipolar scale, this chart may be an exception. Moving to the right side is progress but the target sits in the middle; the goal isn't to get the dots to the far right so much of the right panel is wasted space.