Long before I came up with "numbersense," I wrote about "true lies" in data analysis. (link)
The nature of data, especially Big (as in multidimensional) Data, is that one can come up with an infinite number of statistical computations, all of which are "true" in the sense that one would obtain such statistics were one to plug the data into textbook formulas. Inevitably, some of these statistics lead to contradictions.
An example I give in the Prologue of Numbersense (link) is a case of Simpson's Paradox. There are two ways to compare two airlines' rate of delays during a given window of time at a common set of arriving airports. One can aggregate the number of flights across all airports, then compare the average rate of delay. Alternatively, one can compute a pair of delay rates for each airport, then compare the rates by airport. In the example given in the book, airline A came out ahead in the aggregate measure but was the more delayed at each of the individual airports. This is an instance of "true lies". Airline A is either better or worse, not both. Given that answer, one of the two methods leads to the wrong interpretation. But one cannot complain that there is anything wrong with the data or either formula used to compute the average.
***
I was thinking about "true lies" while reading the exchange between Alberto Cairo and Andy Kirk about the following chart, which prints a "truth," that half of US economic activity occur in major urban areas that constitute a tiny proportion of US territory.
Cairo complains that this chart is silly because about half the US population live in those orange urban areas so in reality, anyone who accepts the meme that this map has "incredible" insight is just surprised that half the US population live in major urban areas.
Kirk, who said he retweeted this map, wants us to stop whining:
I get that GDP is essentially a proxy indicator for where people are living yet I still have a novel interest in learning about the dynamics of the US. I *know* that there is not a uniform distribution of where people live (nowhere on earth has this) but it is still revealing for me to see anything that represents a proxy of this skewed population. I don’t think the map claims to be doing anything different to this so, in that sense, it doesn’t mislead or make false claims.
I will be writing on my other blog about the educational aspect of a chart like this, which is the other prong of Kirk's argument. That last sentence, which I bolded, strikes me as the argument that the true lie is true and therefore is beyond reproach. This is a crucial difference between doing statistics and doing pure math. In statistics, you can't win arguments by invoking the truth... if the truth is knowable, statisticians would all be unemployed.
The map does not make false claims but it leads readers to the conclusion that the orange areas are much more important than the blue region (equal economic activity but much smaller area). The first problem is that the types of economic activities are vastly different between those regions, and this significant factor is ignored.
The second problem is that the designer over-aggregated the data. All counties (or zip codes) are classified into two groups ("split in half") when in fact, the level of economic activity at the level of counties (or zip codes) is a gradient. Imagine plotting the economic activity index by county, ordered from the highest to the lowest. Do we see a dramatic drop-off after counting out half the counties (i.e., the pattern shown on the left chart below)? Or are we more likely to see the pattern shown on the right? If you see a distribution like the one shown on the right, would you summarize that with just two segments?
***
Cairo's general point is that good data visualizations require good data analyses. In turn, good data analysis requires numbersense.
***
Chapter 3 of Numbers Rule Your World (link) explores the question of aggregating data, which is central to statistical thinking. Aggregation features throughout Numbersense (link), particularly in Chapter 1 (school rankings), the chapters on economic statistics, and the chapter on fantasy football.
Also, you can learn statistical concepts from me at NYU. New course starting first week of March. More information here.
The post on Junk Charts about this map is here.
Kaiser, just a quick comment because I don't feel you've necessarily captured the essence of my post "Defending the ‘Incredible GDP Map'" which is slowly evolving into "Erecting a lightning rod for disagreements about the 'Incredible GDP Map'" :)
Firstly, I've never asked anyone to stop whining! Not sure where you get that from in this piece. Far from it. Indeed I don't think anyone has yet 'whined' about this graphic to be necessarily asked to stop whining - though I'm starting to feel the sentiment growing...
A correction of the actual main argument I put forward in that piece (signposted by the phrase in the post of 'back to my main argument'): one person’s ‘interesting’ is another person’s ‘knew it’. I wanted to urge recognition that just because something is not surprising to someone doesn't mean to say isn't to another person.
Inevitably there is this unbreakable connection between the claim on the original tweet of 'incredible' and an implied endorsement. I wanted to make the point that something doesn't have to be 'incredible' or 'surprising' to make it resonate/connect/impact on some level. I found the graphic interesting. Nothing more. I looked at it for about 8 seconds and found the contrast between the two areas interesting given my reasonably uninformed non-US perspective on the geographical, population and economic dynamics of the US. So I retweeted it.
I didn't feel it misled me because I didn't feel engaged with it for long enough or deeply enough to necessarily be 'led' anywhere other than basically seeing a quickly thrown together view that (in my mind at the time) probably just acted as a proxy for where people probably lived. I had a very casual acquaintance and expectation of its underlying statistical rigour because, for me, it just about ticked the box 'do I get a gist of the situation?'. That was all I sought and all I came away with.
This is something that I will be covering in a follow up post if I get chance this week.
Posted by: Andy Kirk | 02/24/2014 at 10:16 AM
[Submitted and then decided I need to add...]
I do completely recognise a the statistical perspectives you raise though this post and do like the analogy of the true lie.
Any apparent weariness I might now exhibit about discussing the issue is more to do with my frustration at not getting all my points out in the original post which was rushed by a train arriving in its destination :)
Posted by: Andy Kirk | 02/24/2014 at 10:31 AM