This chart shows why the PR agency for the UK government deserves a Covid-19 bonus

The Economist illustrated some interesting consumer research with this chart (link):

Economist_covidpoll

The survey by Dalia Research asked people about the satisfaction with their country's response to the coronavirus crisis. The results are reduced to the "Top 2 Boxes", the proportion of people who rated their government response as "very well" or "somewhat well".

This dimension is laid out along the horizontal axis. The chart is a combo dot and bubble chart, arranged in rows by region of the world. Now what does the bubble size indicate?

It took me a while to find the legend as I was expecting it either in the header or the footer of the graphic. A larger bubble depicts a higher cumulative number of deaths up to June 15, 2020.

The key issue is the correlation between a country's death count and the people's evaluation of the government response.

Bivariate correlation is typically shown on a scatter plot. The following chart sets out the scatter plots in a small multiples format with each panel displaying a region of the world.

Redo_economistcovidpolling_scatter

The death tolls in the Asian countries are low relative to the other regions, and yet the people's ratings vary widely. In particular, the Japanese people are pretty hard on their government.

In Europe, the people of Greece, Netherlands and Germany think highly of their government responses, which have suppressed deaths. The French, Spaniards and Italians are understandably unhappy. The British appears to be the most forgiving of their government, despite suffering a higher death toll than France, Spain or Italy. This speaks well of their PR operation.

Cumulative deaths should be adjusted by population size for a proper comparison across nations. When the same graphic is produced using deaths per million (shown on the right below), the general story is preserved while the pattern is clarified:

Redo_economistcovidpolling_deathspermillion_2

The right chart shows deaths per million while the left chart shows total deaths.

***

In the original Economist chart, what catches our attention first is the bubble size. Eventually, we notice the horizontal positioning of these bubbles. But the star of this chart ought to be the new survey data. I swapped those variables and obtained the following graphic:

Redo_economistcovidpolling_swappedvar

Instead of using bubble size, I switched to using color to illustrate the deaths-per-million metric. If ratings of the pandemic response correlate tightly with deaths per million, then we expect the color of these dots to evolve from blue on the left side to red on the right side.

The peculiar loss of correlation in the U.K. stands out. Their PR firm deserves a bonus!


Bubble charts, ratios and proportionality

A recent article in the Wall Street Journal about a challenger to the dominant weedkiller, Roundup, contains a nice selection of graphics. (Dicamba is the up-and-comer.)

Wsj_roundup_img1


The change in usage of three brands of weedkillers is rendered as a small-multiples of choropleth maps. This graphic displays geographical and time changes simultaneously.

The staircase chart shows weeds have become resistant to Roundup over time. This is considered a weakness in the Roundup business.

***

In this post, my focus is on the chart at the bottom, which shows complaints about Dicamba by state in 2019. This is a bubble chart, with the bubbles sorted along the horizontal axis by the acreage of farmland by state.

Wsj_roundup_img2

Below left is a more standard version of such a chart, in which the bubbles are allowed to overlap. (I only included the bubbles that were labeled in the original chart).

Redo_roundupwsj0

The WSJ’s twist is to use the vertical spacing to avoid overlapping bubbles. The vertical axis serves a design perogative and does not encode data.  

I’m going to stick with the more traditional overlapping bubbles here – I’m getting to a different matter.

***

The question being addressed by this chart is: which states have the most serious Dicamba problem, as revealed by the frequency of complaints? The designer recognizes that the amount of farmland matters. One should expect the more acres, the more complaints.

Let's consider computing directly the number of complaints per million acres.

The resulting chart (shown below right) – while retaining the design – gives a wholly different feeling. Arkansas now owns the largest bubble even though it has the least acreage among the included states. The huge Illinois bubble is still large but is no longer a loner.

Redo_dicambacomplaints1

Now return to the original design for a moment (the chart on the left). In theory, this should work in the following manner: if complaints grow purely as a function of acreage, then the bubbles should grow proportionally from left to right. The trouble is that proportional areas are not as easily detected as proportional lengths.

The pair of charts below depict made-up data in which all states have 30 complaints for each million acres of farmland. It’s not intuitive that the bubbles on the left chart are growing proportionally.

Redo_dicambacomplaints2

Now if you look at the right chart, which shows the relative metric of complaints per million acres, it’s impossible not to notice that all bubbles are the same size.


Water stress served two ways

Via Alberto Cairo (whose new book How Charts Lie can be pre-ordered!), I found the Water Stress data visualization by the Washington Post. (link)

The main interest here is how they visualized the different levels of water stress across the U.S. Water stress is some metric defined by the Water Resources Institute that, to my mind, measures the demand versus supply of water. The higher the water stress, the higher the risk of experiencing droughts.

There are two ways in which the water stress data are shown: the first is a map, and the second is a bubble plot.

Wp_waterstress

This project provides a great setting to compare and contrast these chart forms.

How Data are Coded

In a map, the data are usually coded as colors. Sometimes, additional details can be coded as shades, or moire patterns within the colors. But the map form locks down a number of useful dimensions - including x and y location, size and shape. The outline map reserves all these dimensions, rendering them unavailable to encode data.

By contrast, the bubble plot admits a good number of dimensions. The key ones are the x- and y- location. Then, you can also encode data in the size of the dots, the shape, and the color of the dots.

In our map example, the colors encode the water stress level, and a moire pattern encodes "arid areas". For the scatter plot, x = daily water use, y = water stress level, grouped by magnitude, color = water stress level, size = population. (Shape is constant.)

Spatial Correlation

The map is far superior in displaying spatial correlation. It's visually obvious that the southwestern states experience higher stress levels.

This spatial knowledge is relinquished when using a bubble plot. The designer relies on the knowledge of the U.S. map in the head of the readers. It is possible to code this into one of the available dimensions, e.g. one could make x = U.S. regions, but another variable is sacrificed.

Non-contiguous Spatial Patterns

When spatial patterns are contiguous, the map functions well. Sometimes, spatial patterns are disjoint. In that case, the bubble plot, which de-emphasizes the physcial locations, can be superior. In our example, the vertical axis divides the states into five groups based on their water stress levels. Try figuring out which states are "medium to high" water stress from the map, and you'll see the difference.

Finer Geographies

The map handles finer geographical units like counties and precincts better. It's completely natural.

In the bubble plot, shifting to finer units causes the number of dots to explode. This clutters up the chart. Besides, while most (we hope) Americans know the 50 states, most of us can't recite counties or precincts. Thus, the designer can't rely on knowledge in our heads. It would be impossible to learn spatial patterns from such a chart.

***

The key, as always, is to nail down your message, then select the right chart form.

 

 


How to describe really small chances

Reader Aleksander B. sent me to the following chart in the Daily Mail, with the note that "the usage of area/bubble chart in combination with bar alignment is not very useful." (link)

Dailymail-image-a-35_1431545452562

One can't argue with that statement. This chart fails the self-sufficiency test: anyone reading the chart is reading the data printed on the right column, and does not gain anything from the visual elements (thus, the visual representation is not self-sufficient). As a quick check, the size of the risk for "motorcycle" should be about 30 times larger than that of "car"; the size of the risk for "car" should be 100 times larger than that of "airplane". The risk of riding motorcycles then is roughly 3,000 times that of flying in an airplane. 

The chart does not appear to be sized properly as a bubble chart:

Dailymail_travelrisk_bubble

You'll notice that the visible proportion of the "car" bubble is much larger than that of the "motorcycle" bubble, which is one part of the problem.

Nor is it sized as a bar chart:

Dailymail_travelrisk_bar

As a bar chart, both the widths and the heights of the bars vary; and the last row presents a further challenge as the bubble for the airplane does not touch the baseline.

***

Besides the Visual, the Data issues are also quite hard. This is how Aleksander describes it: "as a reader I don't want to calculate all my travel distances and then do more math to compare different ways of traveling."

The reader wants to make smarter decisions about travel based on the data provided here. Aleksandr proposes one such problem:

In terms of probability it is also easier to understand: "I am sitting in my car in strong traffic. At the end in 1 hour I will make only 10 miles so what's the probability that I will die? Is it higher or lower than 1 hour in Amtrak train?"

The underlying choice is between driving and taking Amtrak for a particular trip. This comparison is relevant because those two modes of transport are substitutes for this trip. 

One Data issue with the chart is that riding a motorcycle and flying in a plane are rarely substitutes. 

***

A way out is to do the math on behalf of your reader. The metric of deaths per 1 billion passenger-miles is not intuitive for a casual reader. A more relevant question is what's the chance of dying from the time I spend per year of driving (or riding a plane). Because the chance will be very tiny, it is easier to express the risk as the number of years of travel before I expect to see one death.

Let's assume someone drives 300 days per year, and 100 miles per day so that each year, this driver contributes 30,000 passenger-miles to the U.S. total (which is 3.2 trillion). We convert 7.3 deaths per 1 billion passenger-miles to 1 death per 137 million passenger-miles. Since this driver does 30K per year, it will take (137 million / 30K) = about 4,500 years to see one death on average. This calculation assumes that the driver drives alone. It's straightforward to adjust the estimate if the average occupancy is higher than 1. 

Now, let's consider someone who flies once a month (one outbound trip plus one return trip). We assume that each plane takes on average 100 passengers (including our protagonist), and each trip covers on average 1,000 miles. Then each of these flights contributes 100,000 passenger-miles. In a year, the 24 trips contribute 2.4 million passenger-miles. The risk of flying is listed at 0.07 deaths per 1 billion, which we convert to 1 death per 14 billion passenger-miles. On this flight schedule, it will take (14 billion / 2.4 million) = almost 6,000 years to see one death on average.

For the average person on those travel schedules, there is nothing to worry about. 

***

Comparing driving and flying is only valid for those trips in which you have a choice. So a proper comparison requires breaking down the average risks into components (e.g. focusing on shorter trips). 

The above calculation also suggests that the risk is not evenly spread out throughout the population, despite the use of an overall average. A trucker who is on the road every work day is clearly subject to higher risk than an occasional driver who makes a few trips on rental cars each year.

There is a further important point to note about flight risk, due to MIT professor Arnold Barnett. He has long criticized the use of deaths per billion passenger-miles as a risk metric for flights. (In Chapter 5 of Numbers Rule Your World (link), I explain some of Arnie's research on flight risk.) The problem is that almost all fatal crashes involving planes happen soon after take-off or not long before landing. 

 


Is the visual serving the question?

The following chart concerns California's bullet train project.

California_bullettrain

Now, look at the bubble chart at the bottom. Here it is - with all the data except the first number removed:

Highspeedtrains_sufficiency

It is impossible to know how fast the four other train systems run after I removed the numbers. The only way a reader can comprehend this chart is to read the data inside the bubbles. This chart fails the "self-sufficiency test". The self-sufficiency test asks how much work the visual elements on the chart are doing to communicate the data; in this case, the bubbles do nothing at all.

Another problem: this chart buries its lede. The message is in the caption: how California's bullet train rates against other fast train systems. California's train speed of 220 mph is only mentioned in the text but not found in the visual.

Here is a chart that draws attention to the key message:

Redo_highspeedtrains

In a Trifecta checkup, we improved this chart by bringing the visual in sync with the central question of the chart.


Check out the Lifespan of News project

Alberto Cairo introduces another one of his collaborations with Google, visualizing Google search data. We previously looked at other projects here.

The latest project, designed by Schema, Axios, and Google News Initiative, tracks the trending of popular news stories over time and space, and it's a great example of making sense of a huge pile of data.

The design team produced a sequence of graphics to illustrate the data. The top news stories are grouped by category, such as Politics & Elections, Violence & War, and Environment & Science, each given a distinct color maintained throughout the project.

The first chart is an area chart that looks at individual stories, and tracks the volume over time.

Lifespannews_areachart

To read this chart, you have to notice that the vertical axis measuring volume is a log scale, meaning that each tick mark up represents a 10-fold increase. Log scale is frequently used to draw far-away data closer to the middle, making it possible to see both ends of a wide distribution on the same chart. The log transformation introduces distortion deliberately. The smaller data look disproportionately large because of it.

The time scrolls automatically so that you feel a rise and fall of various news stories. It's a great way to experience the news cycle in the past year. The overlapping areas show competing news stories that shared the limelight at that point in time.

Just bear in mind that you have to mentally reverse the distortion introduced by the log scale.

***

In the second part of the project, they tackle regional patterns. Now you see a map with proportional symbols. The top story in each locality is highlighted with the color of the topic. As time flows by, the sizes of the bubbles expand and contract.

Lifespannews_bubblemap

Sometimes, the entire nation was consumed by the same story, e.g. certain obituaries. At other times, people in different regions focused on different topics.

***

In the last part of the project, they describe general shapes of the popularity curves. Most stories have one peak although certain stories like U.S. government shutdown will have multiple peaks. There is also variation in terms of how fast a story rises to the peak and how quickly it fades away.

The most interesting aspect of the project can be learned from the footnote. The data are not direct hits to the Google News stories but searches on Google. For each story, one (or more) unique search terms are matched, and only those stories are counted. A "control" is established, which is an excellent idea. The control gives meaning to those counts. The control used here is the number of searches for the generic term "Google News." Presumably this is a relatively stable number that is a proxy for general search activity. Thus, the "volume" metric is really a relative measure against this control.

 

 

 

 


Crazy rich Asians inspire some rich graphics

On the occasion of the hit movie Crazy Rich Asians, the New York Times did a very nice report on Asian immigration in the U.S.

The first two graphics will be of great interest to those who have attended my free dataviz seminar (coming to Lyon, France in October, by the way. Register here.), as it deals with a related issue.

The first chart shows an income gap widening between 1970 and 2016.

Nyt_crazyrichasians_incomegap1

This uses a two-lines design in a small-multiples setting. The distance between the two lines is labeled the "income gap". The clear story here is that the income gap is widening over time across the board, but especially rapidly among Asians, and then followed by whites.

The second graphic is a bumps chart (slopegraph) that compares the endpoints of 1970 and 2016, but using an "income ratio" metric, that is to say, the ratio of the 90th-percentile income to the 10th-percentile income.

Nyt_crazyrichasians_incomeratio2

Asians are still a key story on this chart, as income inequality has ballooned from 6.1 to 10.7. That is where the similarity ends.

Notice how whites now appears at the bottom of the list while blacks shows up as the second "worse" in terms of income inequality. Even though the underlying data are the same, what can be seen in the Bumps chart is hidden in the two-lines design!

In short, the reason is that the scale of the two-lines design is such that the small numbers are squashed. The bottom 10 percent did see an increase in income over time but because those increases pale in comparison to the large incomes, they do not show up.

What else do not show up in the two-lines design? Notice that in 1970, the income ratio for blacks was 9.1, way above other racial groups.

Kudos to the NYT team to realize that the two-lines design provides an incomplete, potentially misleading picture.

***

The third chart in the series is a marvellous scatter plot (with one small snafu, which I'd get t0).

Nyt_crazyrichasians_byethnicity

What are all the things one can learn from this chart?

  • There is, as expected, a strong correlation between having college degrees and earning higher salaries.
  • The Asian immigrant population is diverse, from the perspectives of both education attainment and median household income.
  • The largest source countries are China, India and the Philippines, followed by Korea and Vietnam.
  • The Indian immigrants are on average professionals with college degrees and high salaries, and form an outlier group among the subgroups.

Through careful design decisions, those points are clearly conveyed.

Here's the snafu. The designer forgot to say which year is being depicted. I suspect it is 2016.

Dating the data is very important here because of the following excerpt from the article:

Asian immigrants make up a less monolithic group than they once did. In 1970, Asian immigrants came mostly from East Asia, but South Asian immigrants are fueling the growth that makes Asian-Americans the fastest-expanding group in the country.

This means that a key driver of the rapid increase in income inequality among Asian-Americans is the shift in composition of the ethnicities. More and more South Asian (most of whom are Indians) arrivals push up the education attainment and household income of the average Asian-American. Not only are Indians becoming more numerous, but they are also richer.

An alternative design is to show two bubbles per ethnicity (one for 1970, one for 2016). To reduce clutter, the smaller ethnicites can be aggregated into Other or South Asian Other. This chart may help explain the driver behind the jump in income inequality.

 

 

 

 

 


Two views of earthquake occurrence in the Bay Area

This article has a nice description of earthquake occurrence in the San Francisco Bay Area. A few quantities are of interest: when the next quake occurs, the size of the quake, the epicenter of the quake, etc. The data graphic included in the article fails the self-sufficiency test: the only way to read this chart is to read out the entire data set - in other words, the graphical details have no utility.

Earthquake-probability-chart

The article points out the clustering of earthquakes. In particular, there is a 68-year "quiet period" between 1911 and 1979, during which no quakes over 6.0 in size occurred. The author appears to have classified quakes into three groups: "Largest" which are those at 6.5 or over; "Smaller but damaging" which are those between 6.0 and 6.5; and those below 6.0 (not shown).

For a more standard and more effective visualization of this dataset, see this post on a related chart (about avian flu outbreaks). The post discusses a bubble chart versus a column chart. I prefer the column chart.

image from junkcharts.typepad.com

This chart focuses on the timing of rare events. The time between events is not as easy to see. 

What if we want to focus on the "quiet years" between earthquakes? Here is a visualization that addresses the question: when will the next one hit us?

Redo_jc_earthquakeprobability

 

 


Big Macs in Switzerland are amazing, according to my friend

Bigmac_chNote for those in or near Zurich: I'm giving a Keynote Speech tomorrow morning at the Swiss Statistics Meeting (link). Here is the abstract:

The best and the worst of data visualization share something in common: these graphics provoke emotions. In this talk, I connect the emotional response of readers of data graphics to the design choices made by their creators. Using a plethora of examples, collected over a dozen years of writing online dataviz criticism, I discuss how some design choices generate negative emotions such as confusion and disbelief while other choices elicit positive feelings including pleasure and eureka. Important design choices include how much data to show; which data to highlight, hide or smudge; what research question to address; whether to introduce imagery, or playfulness; and so on. Examples extend from graphics in print, to online interactive graphics, to visual experiences in society.

***

The Big Mac index seems to never want to go away. Here is the latest graphic from the Economist, saying what it says:

Econ_bigmacindex

The index never made much sense to me. I'm in Switzerland, and everything here is expensive. My friend, who is a U.S. transplant, seems to have adopted McDonald's as his main eating-out venue. Online reviews indicate that the quality of the burger served in Switzerland is much better than the same thing in the States. So, part of the price differential can be explained by quality. The index also confounds several other issues, such as local inflation and exchange rate

Now, on to the data visualization, which is primarily an exercise in rolling one's eyeballs. In order to understand the red and blue line segments, our eyes have to hop over the price bubbles to the top of the page. Then, in order to understand the vertical axis labels, unconventionally placed on the right side, our eyes have to zoom over to the left of the page, and search for the line below the header of the graph. Next, if we want to know about a particular country, our eyes must turn sideways and scan from bottom up.

Here is a different take on the same data:

Redo_jc_econbigmac2018

I transformed the data as I don't find it compelling to learn that Russian Big Macs are 60% less than American Big Macs. Instead, on my chart, the reader learns that the price paid for a U.S. Big Mac will buy him/her almost 2 and a half Big Macs in Russia.

The arrows pointing left indicate that in most countries, the values of their currencies are declining relative to the dollar from 2017 to 2018 (at least by the Big Mac Index point of view). The only exception is Turkey, where in 2018, one can buy more Big Macs equivalent to the price paid for one U.S. Big Mac. compared to 2017.

The decimal differences are immaterial so I have grouped the countries by half Big Macs.

This example demonstrates yet again, to make good data visualization, one has to describe an interesting question, make appropriate transformations of the data, and then choose the right visual form. I describe this framework as the Trifecta - a guide to it is here.

(P.S. I noticed that Bitly just decided unilaterally to deactivate my customized Bitly link that was configured years and years ago, when it switched design (?). So I had to re-create the custom link. I have never grasped  why "unreliability" is a feature of the offering by most Tech companies.)


Two thousand five hundred ways to say the same thing

Wallethub published a credit card debt study, which includes the following map:

Wallethub_creditcardpaydownbyCity

Let's describe what's going on here.

The map plots cities (N = 2,562) in the U.S. Each city is represented by a bubble. The color of the bubble ranges from purple to green, encoding the percentile ranking based on the amount of credit card debt that was paid down by consumers. Purple represents 1st percentile, the lowest amount of paydown while green represents 99th percentile, the highest amount of paydown.

The bubble size is encoding exactly the same data, apparently in a coarser gradation. The more purple the color, the smaller the bubble. The more green the color, the larger the bubble.

***

The design decisions are baffling.

Purple is more noticeable than the green, but signifies the less important cities, with the lesser paydowns.

With over 2,500 bubbles crowding onto the map, over-plotting is inevitable. The purple bubbles are printed last, dominating the attention but those are the least important cities (1st percentile). The green bubbles, despite being larger, lie underneath the smaller, purple bubbles.

What might be the message of this chart? Our best guess is: the map explores the regional variation in the paydown rate of credit card debt.

The analyst provides all the data beneath the map. 

Wallethub_paydownbyCity_data

From this table, we learn that the ranking is not based on total amount of debt paydown, but the amount of paydown per household in each city (last column). That makes sense.

Shouldn't it be ranked by the paydown rate instead of the per-household number? Divide the "Total Credit Card Paydown by City" by "Total Credit Card Debt Q1 2018" should yield the paydown rate. Surprise! This formula yields a column entirely consisting of 4.16%.

What does this mean? They applied the national paydown rate of 4.16% to every one of 2,562 cities in the country. If they had plotted the paydown rate, every city would attain the same color. To create "variability," they plotted the per-household debt paydown amount. Said differently, the color scale encodes not credit card paydown as asserted but amount of credit card debt per household by city.

Here is a scatter plot of the credit card amount against the paydown amount.

Redo_creditcardpaydown_scatter

A perfect alignment!

This credit card debt paydown map is an example of a QDV chart, in which there isn't a clear question, there is almost no data, and the visual contains several flaws. (See our Trifecta checkup guide.) We are presented 2,562 ways of saying the same thing: 4.16%.

 

P.S. [6/22/2018] Added scatter plot, and cleaned up some language.