The rule governing which variable to put on which axis, served a la mode

When making a scatter plot, the two variables should not be placed arbitrarily. There is a rule governing this: the outcome variable should be shown on the vertical axis (also called y-axis), and the explanatory variable on the horizontal (or x-) axis.

This chart from the archives of the Economist has this reversed:

20160402_WOC883_icecream_PISA

The title of the accompanying article is "Ice Cream and IQ"...

In a Trifecta Checkup (link), it's a Type DV chart. It's preposterous to claim eating ice cream makes one smarter without more careful studies. The chart also carries the xyopia fallacy: by showing just two variables, readers are unwittingly led to explain differences in "IQ" using differences in per-capita ice-cream consumption when lots of other stronger variables will explain any gaps in IQ.

In this post, I put aside my objections to the analysis, and focus on the issue of assigning variables to axes. Notice that this chart reverses the convention: the outcome variable (IQ) is shown on the horizontal, and the explanatory variable (ice cream) is shown on the vertical.

Here is a reconstruction of the above chart, showing only the dots that were labeled with country names. I fitted a straight regression line instead of a curve. (I don't understand why the red line in the original chart bends upwards when the data for Japan, South Korea, Singapore and Hong Kong should be dragging it down.)

Redo_econ_icecreamIQ_1A

Note that the interpretation of the regression line raises eyebrows because the presumed causality is reversed. For each 50 points increase in PISA score (IQ), this line says to expect ice cream consumption to raise by about 1-2 liters per person per year. So higher IQ makes people eat more ice cream.

***

If the convention is respected, then the following scatter plot results:

Redo_econ_icecreamIQ_2

The first thing to note is that the regression analysis is different here from that shown in the previous chart. The blue regression line is not equivalent to the black regression line from the previous chart. You cannot reverse the roles of the x and y variables in a regression analysis, and so neither should you reverse the roles of the x and y variables in a scatter plot.

The blue regression line can be interpreted as having two sections, roughly, for countries consuming more than or less than 6 liters of ice cream per person per year. In the less-ice-cream countries, the correlation between ice cream and IQ is stronger (I don't endorse the causal interpretation of this statement).

***

When you make a scatter plot, you have two variables for which you want to analyze their correlation. In most cases, you are exploring a cause-effect relationship.

Higher income households cares more on politics.
Less educated citizens are more likely to not register to vote.
Companies with more diverse workforce has better business performance.

Frequently, the reverse correlation does not admit a causal interpretation:

Caring more about politics does not make one richer.
Not registering to vote does not make one less educated.
Making more profits does not lead to more diversity in hiring.

In each of these examples, it's clear that one variable is the outcome, the other variable is the explanatory factor. Always put the outcome in the vertical axis, and the explanation in the horizontal axis.

The justification is scientific. If you are going to add a regression line (what Excel calls a "trendline"), you must follow this convention, otherwise, your regression analysis will yield the wrong result, with an absurd interpretation!

 

[PS. 11/3/2019: The comments below contain different theories that link the two variables, including theories that treat PISA score ("IQ") as the explanatory variable and ice cream consumption as the outcome. Also, I elaborated that the rule does not dictate which variable is the outcome - the designer effectively signals to the reader which variable is regarded as the outcome by placing it in the vertical axis.]


Pulling the multi-national story out, step by step

Reader Aleksander B. found this Economist chart difficult to understand.

Redo_multinat_1

Given the chart title, the reader is looking for a story about multinationals producing lower return on equity than local firms. The first item displayed indicates that multinationals out-performed local firms in the technology sector.

The pie charts on the right column provide additional information about the share of each sector by the type of firms. Is there a correlation between the share of multinationals, and their performance differential relative to local firms?

***

We can clean up the presentation. The first changes include using dots in place of pipes, removing the vertical gridlines, and pushing the zero line to the background:

Redo_multinat_2

The horizontal gridlines attached to the zero line can also be removed:

Redo_multinat_3

Now, we re-order the rows. Start with the aggregate "All sectors". Then, order sectors from the largest under-performance by multinationals to the smallest.

Redo_multinat_4

The pie charts focus only on the share of multinationals. Taking away the remainders speeds up our perception:

Redo_multinat_5

Help the reader understand the data by dividing the sectors into groups, organized by the performance differential:

Redo_multinat_6

For what it's worth, re-sort the sectors from largest to smallest share of multinationals:

Redo_multinat_7

Having created groups of sectors by share of multinationals, I simplify further by showing the average pie chart within each group:

Redo_multinat_8

***

To recap all the edits, here is an animated gif: (if it doesn't play automatically, click on it)

Redo_junkcharts_econmultinat

***

Judging from the last graphic, I am not sure there is much correlation between share of multinationals and the performance differentials. It's interesting that in aggregate, local firms and multinationals performed the same. The average hides the variability by sector: in some sectors, local firms out-performed multinationals, as the original chart title asserted.


Five steps to let the young ones shine

Knife stabbings are in the news in the U.K. and the Economist has a quartet of charts to illustrate what's going on.

Economist_20190309_WOC479

I'm going to focus on the chart on the bottom right. This shows the trend in hospital admissions due to stabbings in England from 2000 to 2018. The three lines show all ages, and two specific age groups: under 16 and 16-18.

The first edit I made was to spell out all years in four digits. For this chart, numbers like 15 and 18 can be confused with ages.

Redo_econ_ukknives_1

The next edit corrects an error in the subtitle. The reference year is not 2010 as those three lines don't cross 100. It appears that the reference year is 2000. Another reason to use four-digit years on the horizontal axis is to be consistent with the subtitle.

Redo_econ_ukknives_2

The next edit removes the black dot which draws attention to itself. The chart though is not about the year 2000, which has the least information since all data have been forced to 100.

Redo_econ_ukknives_3

The next edit makes the vertical axis easier to interpret. The indices 150, 200, are much better stated as + 50%, + 100%. The red line can be labeled "at 2000 level". One can even remove the subtitle 2000=100 if desired.

Redo_econ_ukknives_4

Finally, I surmise the message the designer wants to get across is the above-average jump in hospital admissions among children under 16 and 16 to 18. Therefore, the "All" line exists to provide context. Thus, I made it a dashed line pushing it to the background.

Redo_econ_ukknives_5

 

 

 


Big Macs in Switzerland are amazing, according to my friend

Bigmac_chNote for those in or near Zurich: I'm giving a Keynote Speech tomorrow morning at the Swiss Statistics Meeting (link). Here is the abstract:

The best and the worst of data visualization share something in common: these graphics provoke emotions. In this talk, I connect the emotional response of readers of data graphics to the design choices made by their creators. Using a plethora of examples, collected over a dozen years of writing online dataviz criticism, I discuss how some design choices generate negative emotions such as confusion and disbelief while other choices elicit positive feelings including pleasure and eureka. Important design choices include how much data to show; which data to highlight, hide or smudge; what research question to address; whether to introduce imagery, or playfulness; and so on. Examples extend from graphics in print, to online interactive graphics, to visual experiences in society.

***

The Big Mac index seems to never want to go away. Here is the latest graphic from the Economist, saying what it says:

Econ_bigmacindex

The index never made much sense to me. I'm in Switzerland, and everything here is expensive. My friend, who is a U.S. transplant, seems to have adopted McDonald's as his main eating-out venue. Online reviews indicate that the quality of the burger served in Switzerland is much better than the same thing in the States. So, part of the price differential can be explained by quality. The index also confounds several other issues, such as local inflation and exchange rate

Now, on to the data visualization, which is primarily an exercise in rolling one's eyeballs. In order to understand the red and blue line segments, our eyes have to hop over the price bubbles to the top of the page. Then, in order to understand the vertical axis labels, unconventionally placed on the right side, our eyes have to zoom over to the left of the page, and search for the line below the header of the graph. Next, if we want to know about a particular country, our eyes must turn sideways and scan from bottom up.

Here is a different take on the same data:

Redo_jc_econbigmac2018

I transformed the data as I don't find it compelling to learn that Russian Big Macs are 60% less than American Big Macs. Instead, on my chart, the reader learns that the price paid for a U.S. Big Mac will buy him/her almost 2 and a half Big Macs in Russia.

The arrows pointing left indicate that in most countries, the values of their currencies are declining relative to the dollar from 2017 to 2018 (at least by the Big Mac Index point of view). The only exception is Turkey, where in 2018, one can buy more Big Macs equivalent to the price paid for one U.S. Big Mac. compared to 2017.

The decimal differences are immaterial so I have grouped the countries by half Big Macs.

This example demonstrates yet again, to make good data visualization, one has to describe an interesting question, make appropriate transformations of the data, and then choose the right visual form. I describe this framework as the Trifecta - a guide to it is here.

(P.S. I noticed that Bitly just decided unilaterally to deactivate my customized Bitly link that was configured years and years ago, when it switched design (?). So I had to re-create the custom link. I have never grasped  why "unreliability" is a feature of the offering by most Tech companies.)


Is the chart answering your question? Excavating the excremental growth map

Economist_excrement_growthSan Franciscans are fed up with excremental growth. Understandably.

Here is how the Economist sees it - geographically speaking.

***

In the Trifecta Checkup analysis, one of the questions to ask is "What does the visual say?" and with respect to the question being asked.

The question is how much has the problem of human waste in SF grew from 2011 to 2017.

What does the visual say?

The number of complaints about human waste has increased from 2011 to 2014 to 2017.

The areas where there are complaints about human waste expanded.

The worst areas are around downtown, and that has not changed during this period of time.

***

Now, what does the visual not say?

Let's make a list:

  • How many complaints are there in total in any year?
  • How many complaints are there in each neighborhood in any year?
  • What's the growth rate in number of complaints, absolute or relative?
  • What proportion of complaints are found in the worst neighborhoods?
  • What proportion of the area is covered by the green dots on each map?
  • What's the growth in terms of proportion of areas covered by the green dots?
  • Does the density of green dots reflect density of human waste or density of human beings?
  • Does no green dot indicate no complaints or below the threshold of the color scale?

There's more:

  • Is the growth in complaints a result of more reporting or more human waste?
  • Is each complainant unique? Or do some people complain multiple times?
  • Does each piece of human waste lead to one and only one complaint? In other words, what is the relationship between the count of complaints and the count of human waste?
  • Is it easy to distinguish between human waste and animal waste?

And more:

  • Are all complaints about human waste valid? Does anyone verify complaints?
  • Are the plotted locations describing where the human waste is or where the complaint was made?
  • Can all complaints be treated identically as a count of one?
  • What is the per-capita rate of complaints?

In other words, the set of maps provides almost all no information about the excrement problem in San Francisco.

After you finish working, go back and ask what the visual is saying about the question you're trying to address!

 

As a reference, I found this map of the population density in San Francisco (link):

SFO_Population_Density

 


Digital revolution in China: two visual takes

The following map accompanied an article in the Economist about China's drive to create a "digital silkroad," roughly defined as making a Silicon Valley. 

Economist_digitalsilkroad

The two variables plotted are the wealth of each province (measured by GDP per capita) and the level of Internet penetration. The designer made the following choices:

  • GDP per capita is presented with less precision than Internet penetration. The former is grouped into five large categories while the latter is given as a percentage to one decimal place.
  • The visual design favors GDP per capita which is encoded as the shade of color of each province. The Internet penetration data appeared added on as an afterthought.

If we apply the self-sufficiency test (i.e. by removing the printed data from the chart), it's immediately clear that the visual elements convey zero information about Internet penetration. This is a serious problem for a chart about the "digital silkroad"!

***

If those two variables are chosen, it would seem appropriate to convey to readers the correlation between the two variables. The following sketch is focused on surfacing the correlation.

Redo_jc_china_digitalsilkroad2

(Click on the image to see it in full.) Here is the top of the graphic:

Redo_jc_china_digitalskilkroad_detail

The individual maps are not strictly necessary. Just placing provincial names onto the grid is enough, because regional pattern isn't salient here.

The Internet penetration data were grouped into five categories as well, putting it on equal footing as GDP per capita.

 


Diverging paths for rich and poor, infographically

Ray Vella (link) asked me to comment on a chart about regional wealth distribution, which I wrote about here. He also asked students in his NYU infographics class to create their own versions.

This effort caught my eye:

Nyu_redo_richpoor

This work is creative, and I like the concept of using two staircases to illustrate the diverging fortunes of the two groups. This is worlds away from the original Economist chart.

The infographic does have a serious problem. In one of my dataviz talks, I talk about three qualifications of work called "data visualization." The first qualification is that the data visualization has to display the data. This is an example of an infographic that is invariant to the data.

Is it possible to salvage the concept? I tried. Here is an idea:

Redo_econ_richpoor_infog2

I abandoned the time axis so the data plotted are only for 2015, and the countries are shown horizontally from most to least equal. I'm sure there are ways to do it even better.

Infographics can be done while respecting the data. Ray is one of the designers who appreciate this. And thanks Ray for letting me blog about this.

 

 

 


Fifty-nine intersections supporting forty dots of data

My friend Ray V. asked how this chart can be improved:

Econ_rv_therichgetsricher

Let's try to read this chart. The Economist is always the best at writing headlines, and this one is simple and to the point: the rich get richer. This is about inequality but not just inequality - the growth in inequality over time.

Each country has four dots, divided into two pairs. From the legend, we learn that the line represents the gap between the rich and the poor. But what is rich and what is poor? Looking at the sub-header, we learn that the population is divided by domicile, and the per-capita GDP of the poorest and richest regions are drawn. This is a indirect metric, and may or may not be good, depending on how many regions a country is divided into, the dispersion of incomes within each region, the distribution of population between regions, and so on.

Now, looking at the axis labels, it's pretty clear that the data depicted are not in dollars (or currency), despite the reference to GDP in the sub-header. The numbers represent indices, relative to the national average GDP per head. For many of the countries, the poorest region produces about half of the per-capita GDP as the richest region.

Back to the orginal question. A growing inequality would be represented by a longer line below a shorter line within each country. That is true in some of these countries. The exceptions are Sweden, Japan, South Korea.

***
It doesn't jump out that the key task requires comparing the lengths of the two lines. Another issue is the outdated convention of breaking up a line (Britian) when the line is of extreme length - particularly unwise given that the length of the line encodes the key metric in the chart.

Further, it has low data-ink ratio a la Tufte. The gridlines, reference lines, and data lines weave together in a complex pattern creating 59 intersections in a chart that contains only 40  36 numbers.

***

 I decided to compute a simpler metric - the ratio of rich to poor.  For example, in the UK, the richest area produces about 20 times as much GDP per capita as the poorest one in 2015.  That is easier to understand than an index to the average region.

I had fun making the following chart, although many standard forms like the Bumps chart (i.e. slopegraph) or paired columns and so on also work.

Redo_econ_jc_richgetricher

This chart is influenced by Ed Tufte, who spent a good number of pages in his first book advocating stripping even the standard column chart to its bare essence. The chart also acknowledges the power of design to draw attention.

 

 

PS. Sorry I counted incorrectly. The chart has 36 dots not 40. 


Confuse, confuses, confused, confusing

Via Twitter, @Stoltzmaniac sent me this chart, from the Economist (link to article):

Econ_vehicles

There is simply too much going on on the right side of the chart. The designer seems not to be able to decide which metric is more important, the cumulative growth rate of vehicles in use from 2005 to 2014, or the vehicles per 1,000 people in 2014. So both set of numbers are placed on the chart, regrettably in close proximity.

In the meantime, the other components of the chart, such as the gridlines and the red line indicating 2005 = 100 are only relevant to the cumulative vehicle growth metric. Perhaps noticing the imbalance, the designer then paints the other data series in rainbow-colored boxes, and prints the label for this data series in a big white box. This decision tilts the chart towards the vehicle per capita metric, as our eyes now cannot help but stare at the white box.

***

There are really three trends: the growth in population, the growth in vehicles, and the resultant growth in vehicle per capita. They are all be accommodated in a small-multiples setting, as follows:

Jc_econ_vehicles2

There are some curious angular trends revealed here. The German population somehow dipped into negative territory around 2007-8 but since then has turned around. Nigeria's vehicle growth declined sharply after 2006 so that the density of vehicles has stabilized.