A response to a tweet forwarded to me. The person tweeting complained that FiveThirtyEight uses charts that don’t start the vertical axis at zero. The example given was this:
In this post, I want to clear some confusion around the "start-at-zero" rule.
I highlighted the columns for 1993 and 1996. Visually, the height of one column is twice that of the other column. And yet the axis labels tell us that the difference is 65% versus 62.5%.
The reason for the start-at-zero rule is to avoid exaggerating meaningless differences.
To judge whether a change is meaningful or not, in time-series data like this, we have to use history to understand the general variability in college enrollment rates. Based on what we can see in this data (about 20 years), the college enrollment rate hovers between 60 and 70 percent. There is no data between 0 and 60 percent. Those are irrelevant values for this data series. This is why starting at zero is counterproductive.
Here is the line chart starting at zero:
This display has the unintended effect of squashing meaningful changes over time by inserting a lot of empty space below the line.
A column chart starting at zero looks like this:
This is a fix on the truncated column chart from above. But it also squashes meaningful changes over time. A column chart is just a poor choice to illustrate this dataset.
For those who don't like the line chart, consider using a dot plot:
The associated article is here.
The question on the table is motivated by the extraordinary performance of a young baseball player Mike Trout. The early success can be interpreted either as evidence of future potential or as evidence of a future drought. As an analogy, someone wins a lottery. You can argue that the odds are so low that winning again is impossible. Or you can argue that winning once indicates that this person is "lucky" and lucky people might win again.
The chart shows the proportion of players who performed even better after the initial success, given the age at which they first broke out. One way to read this chart is to mentally replace the bubbles with dots (or columns), and then interpret the size of the bubbles as the statistical significance of the corresponding probability estimate. The legend says number of players, which is the sample size, which governs the error bar associated with that particular number.
This bubble chart is no different from others: it is impossible to judge the relative sizes of bubbles. Even though the legend provides us two reference points (a nice enough idea on its own), it is still impossible to know, for example, what proportion of players did better later in life when they first peaked at age 24. The bubble for age 23 looks like it's exactly five players but I still cannot figure out how many players the adjacent bubble represents.
The designer should have just replaced each bubble with an error bar, and the chart is instantly more readable. (I have another version of this at the end of the post.)
The rest of the design elements are clean and well-done, particularly use of notes to point out interesting aspects of the data.
From a Trifecta checkup perspective, I am uncertain about how the nature of the data used to investigate the interesting question posed above.
Readers should note the concept of "early success" and "later success" are not universally defined. The author here selects two proxies. Reaching an early peak is equated to "batters first posting 15+ WAR over two seasons". Next, reversion to the mean is defined as not having a better two-year span subsequent to the aforementioned early peak.
Why two seasons? Why WAR and not a different metric? Why 15 as the cutoff? These are all design decisions made while working with the data.
One can make reasonable arguments to justify the above two questions. A bigger head-scratcher relates to the horizontal axis, which identifies the first time a player reaches his "early peak," as defined above. The way the above chart is set up, it is almost preordained to exhibit a negative slope. The older the player is when he reaches the first peak, the fewer years left in his playing career to try to emulate or surpass that feat.
This last point is nicely illustrated in the next chart of the article:
This chart is excellent on many levels. It's not clear, though, whether it says anything other than aging.
Near the end of the post, the author rightfully pointed out that "there’s not really enough data to demonstrate this effect". Going back to the first chart, it appears that no single bubble contains a double-digit count of players. So every sample size is between one and, say, seven. We should be wary of conclusions based on so little data.
It's always fun to find examples of the Law of Small Numbers, courtesy of Kahneman & Tversky.
Here is a sketch of how I might re-make the first chart (I made up data; see the note below).
While making this chart, I realize another issue with the original bubble chart. When the proportion of players improving on their early peak is zero percent, how many players did not make it is quite hidden. In the revised chart, this data is clearly seen (look at age 22).
Note: I wonder if I totally missed the point of the original chart.... I actually had trouble eyeballing the data so I ended up making up numbers. The bubble at age 22 looks like it should stand for 5 players and yet it sits at precisely 50%, which would map to 2.5 players. If I assume the 22 bubble to be 4 players, then I don't know what the 26 bubble is. If it is 4 players also, then the minimum non-zero proportion should have been 1/4, but the bubble clearly lies below 25%. If it is 3 players, the minimum non-zero proportion is 1/3, which should be at 33%.
Today's post examines an example of Big Data analyses, submitted by a reader, Daniel T. The link to the analysis is here. (On the sister blog, I discussed the nature of this type of analysis. This post concerns the graphical element.)
The analyst looked at "the influence of operating systems and device types on hourly usage behavior". This dataset satisfies four of the five characteristics in the OCCAM framework (link).
Observational: the data are ad impressions coming from the Chitika Ad Network observed between February 26 and March 11, 2014. This means users are (unwittingly) being tracked by cookies, pixels, or some other form of tracking devices. The analyst did not plan this study and then collect the data.
Lacking Controls: There will be a time trend but what should we compare against? How do we know if something is out of the ordinary or not?
Seemingly Complete: Right up top, we are impressed with the use of "a sample of tens of millions of device-specific online ad impressions". At least they understand this is a sample, not everything.
Adapted: All weblog data are adapted in the sense that web logs originally serve web developers who are interested in debugging their code. Operating systems and device types are tracked because each variant of OS and devices require customization, and we need that data to understand how webpages render differently. I wrote about the adaptedness of this data in a separate blog post. (link)
The analysis did not require merging data, the fifth element of the framework.
Here is the chart type used to present the analysis. There are many problems.
The conclusion the analyst drew from the above chart is: "North American Android users are more active than their iOS counterparts late at night and during the majority of the workday." In other words, the analyst points out that the blue line sits on top of the orange line during certain times of the day.
Daniel is very annoyed with the way the data is processed, and rightfully so. The chart actually does not say what it appears to say. This is because of the use of indexing.
This simple chart is not so simple to interpret!
This is because each line is "indexed to self". For example, at 12 pm EST, Android users are at 75% of their peak-hour usage while iOS users are at 2/3 of their peak-hour usage. The trouble is the peak-hour usage by iOS users is more than 2.5 times as high as the peak-hour usage of Android users, so 100% blue is less than half of 100% orange by count.
Later in the same post, the analyst re-indexed both series to the iOS peak. This chart tells us that iOS users are more active no matter what time of the day.
The Chitika analyst is not doing anything unusual. This type of indexing is a pandemic in Web analytics. The worst thing about it is that a lot of Web data is long-tailed and the maximum value is an outlier. Indexing data to an outlier isn't wise. (Usually, the index is used to hide actual values of the data, usually for keeping company secrets. But there are better ways to accomplish this.)
Digging a little deeper, we've got to note other key assumptions that the analyst must have made in producing this analysis -- and about which we are in the dark.
Are users with both Apple and Microsoft devices counted on both blue and orange lines?
How is "volume" of Web usage determined? Is it strictly number of ad impressions?
Why is total volume displayed? If Microsoft PCs dominate Macs, and the chart shows the PC line well above the Mac line, is it speaking to market share or is it speaking to usage patterns of the average user?
How representative is the traffic in the Chitika network?
How did the analyst deal with bot traffic?
Finally, using EST (Eastern Standard Time) rather than local time is silly. Think of it this way: if you extract only New York and California users, and compare their curves, without even looking at the data, you can surmise that you will see a similar shape but time-shifted by approximately three hours. Ignoring time difference leads to silly statements like this: "Both sets of users are most active during the workday, with usage volume dropping off in the late evening/early morning."
On the sister blog, I wrote about a new report on the music industry lamenting that the hype over "Long Tail" retail has not really helped small artists (as a group). This was a tip sent by reader Patrick S. He was rightfully unhappy about the chart that was included in this summary of the report.
This classic Excel chart has some basic construction issues:
In terms of the Trifecta checkup, the biggest problem is the misalignment between the intended message and the chart's message. If you read the report, you'd learn that one of their key findings is that the top 1% (superstar) artists continue to earn ~75 percent of total income and this distribution has not changed noticeably despite the Long Tail phenonmenon.
But what is the chart's message? The first and most easily read trend is the fall in total income in the last 12-13 years. And it's a drastic drop of about $1 billion, almost 25 percent. Everything else is hard to compute on this stacked column chart. For example, the decline in the gray parts is even more drastic than the decline in the blue.
It also is challenging to estimate the proportions from these absolute amounts. Recognizing this, the designer added the proportions as text. But only for the most recent year.
So we have identifed two interesting stories, one about the decline in total income and the other about the unending dominance of the 1 percent. This is where the designer has to set priorities. Given that the latter message is the headline of the report, it is better to plot the proportions directly, while hiding the story about total income. The published chart has the priority reversed. Even though you can find both messages on the same chart, it is still not a good idea to highlight your lesser message.
A twitter follower @mdjoner felt that something is amiss with the squares in this chart comparing real estate prices in major cities around the world. I'm not sure where the chart originally came from but there is a CNBC icon.
There is one thing I really like about the chart, which is the metric that has been selected. The original data is likely to be price per square metre for luxury property in various places. The designer turned this around and computed the size of what you can buy assuming you spend $1 million. I think we have a better ability to judge areas than dollars.
The notion of floor area meshes well with the area on a chart, so there is an intuitive appeal as well.
So in the Trifecta checkup, they did well posing an interesting question, and picking some data. But like Mike, I'm not excited about the graphical construct.
There are a few problems with this chart:
Here is an alternative display of the data:
Notice that I designed this for an American audience. I'd change certain decisions if using this for the non-American reader. I choose New York as the focal point, and split the cities into two parts. On the left are the cities less expensive than New York and on the right are those cities more expensive than New York.
Also, along the bottom, I provide some clues to help people bridge the gap between the areas shown on the graphic, and real-life areas. For example, the orange square represents 400 square feet but without the annotation telling you it's about the size of a typical Manhattan studio, you may not know how to map the size of the orange square to your perception of real spaces. I also included images (although if I'm publishing this, I'd want better ones).
Finally, note that the data set did not show up on my version of the chart.
My twitter followers have been sending in several howlers.
Twitter (link) made a bunch of bold claims about its own influence by using the number of tweets about the Oscars as fodder. They also adopt the euphenism common to the digital marketing universe, the so-called "view", which credit to them, they define as "how many times tweets are displayed to users". Yes, you read that right, displaying is the same as viewing in this world - and Twitter is just a follower not a trend setter here.
Both designers basically used appropriated a graphical form and deprived it of data. In one, the designer threw the concept of scale to the wind. In the other, the designer dumped the law of total probability. In either case, the fundamental rationale for the particular graphical form is sacrificed.
Both are examples that fail our self-sufficiency test. This test says if a visual display cannot be understood unless the entire data set is printed on the chart, then why create a visual display? In both charts, if you block out the numbers, you are left with nothing!
The PWC chart was submitted by @graphomate, who also submitted the following KPMG chart:
The complaint was the total adding up to 101%. I'm not really bothered by this as it is a rounding issue. That said, I like to "hide" such rounding issues. I have never understood why it is necessary to display the imperfection. Flip a coin and remove the decimals from one of the categories!
Josh tweeted quite a shocking attack ad to me last week. He told me it came from the DC Metro. The ad is taken out by a group called HumaneWatch.Org, which apparently is a watchdog checking up on charity organizations. The ad attacks a specific group called the Humane Society of the United States. Here is the map that is the centerpiece of the copy:
I like to use the Trifecta checkup to evaluate graphics. It's a nice way to organize your visualization critique. You progress through three corners: figuring out what is the practical question being addressed by the graphic, then evaluating what data is being deployed, and finally whether the graphical elements (the chart itself) is well executed in relation to the question and the data.
Based on the map, it appears that HumaneWatch is interested in the spending on pet shelters. Every number shown is tiny: on a quick scan, the range may be from 0% to 0.35%. The all-caps title "A Whole Lotta Nothing" confirms that this is the intended message.
Knowing nothing about either of these organizations leaves me confused. Should the "Humane Society" be spending the bulk of its budget on pet shelters? If it doesn't, is it because the staff is pilfering money, or because it has wasteful spending, or because pets are not its major cause, or because pet shelters are not the key way this organization helps pets?
I did look up Humane Society to learn that it is an animal rights group. The four bullet points at the bottom of the ad provide a clue as to what the designer wanted to convey: namely, that this charity is a scam, with too much overhead spending, and spending on pensions.
So I think the question being asked is sufficiently clarified, and it's a pretty important one. How is this organization spending its donations? Is it irresponsible compared to other similar organizations?
The data should be in sync with the question being addressed; that's why there is a link between the two corners of the Trifecta. Given the trouble I endured understanding the question being addressed, it would come as no surprise that this chart scores poorly on the DATA corner.
I don't understand why budget spent on pet shelters is the key bone of contention. Based on the perceived objectives, it seems that they should display directly what proportion of the budget went to overhead, and what proportion went to pensions, with suitable comparisons.
The analysis by state is a disease of having too much data. Let's imagine that the proportions averaged across all states come to 0.1%. If we replaced those 50 numbers with one number printed across all states: "The Humane Society spends less than 0.1% of its budget on pet shelters.", the message would have been identical, while being less confusing.
And it's not just confusion. Cutting the data by state introduces complications. The analyst would need to make sure that any differences between states are not due to factors such as the number of pets, the proportion of households owning pets, the average spending per pet, the supply and demand for pet shelters, the existence of alternatives to pet shelters, etc. None of these issues need to worry the designer who does not slice the data down.
The same reason goes for why the absolute amount of spending (encoded in the colors of individual states) is not worth the ink it's printed on. The range between 0% and 0.35% has been chopped into seven pieces, which creates artificial gaps between the states. This design muddles the graphic's key message, "A Whole Lotta Nothing".
THE CHART ITSELF:
As we land on the final corner of the Trifecta, we ignore our previous complaint and accept that the proportion of budget is an interesting data series to visualize, and turn attention to the graphical elements. This chart scores poorly on chart execution as well!
Notice that the designer simultaneously plots two data series on the same map, the dollar value of pet shelter spending, and it as a proportion of budget. The former is encoded in the color of the state areas while the latter is printed directly as data labels. This is a map equivalent of "dual-axes" line charts, and equally unreadable.
Based on the color legend, our brain tells us the yellow states are better than the blue states but the huge numbers printed on the map conveys the opposite message. The progression of colors makes little sense. The red and yellow stand out but those states are in the middle of the range.
It's a little blurry but I think there is a number of New England states in the high spending category (black and dark gray colors), and the map just happens to obscure this key feature.
PRACTICAL QUESTION: Fair
DATA: Very Poor
When you see two time series, resist the temptation to plot them as lines on the same chart. According to the Atlantic, the following dual-axis chart has been making the rounds in the investment community: (thanks to Alberto Cairo for the tip)
There may be correlation or there may not be. When we look at a chart like this and see "correlation" -- actually a high degree of correlation -- what we are really talking about are the long-run trends being correlated. For example, the underlying data for this chart is most likely on a daily level. If you train your eye to a small part of the chart, you will notice that at the daily level, there is a lot of noise and a lot less correlation than you think.
Long term trends being correlated does not imply short-term trends are also correlated!
Furthermore, the long-run correlation is not enough to jump to the conclusion that the new trend will follow the old trend. When you make this conclusion, you are implicitly assuming that the mechanism causing the trend in the 1928-9 period is identical to that causing the current-period trend. This is when you realize that such an assumption is hard to support.
The Atlantic piece debunks this chart by re-expressing the data as indices. This means we switched from absolute changes in the Dow Jones average to relative changes. This has its own problem actually because the general level of the Dow Jones is so different between those two periods.
Here are some posts I have written on dual-axis charts. I have been complaining about them since almost the beginning of this blog. Back in 2006, I wrote this piece which takes a different path to debunking a similar chart -- by compressing or expanding one of the axes.
In a more recent post, I showed an example of when it is natural to use two axes on the same chart.
In the prior post, I linked to Eric P.'s (link) vetting of the Bloomberg chart on the drop in median male income in the U.S. in the last few decades. Just as a reminder, here is the key chart:
In the 25-34 age group (blue line), the median income has suffered two waves of drastic declines, about 25% from 1972 to 1992 and then about 18% from 1999 to 2011.
There is a different way to digest the chart above, which is what I want to talk about in this post. Notice that people age over time so if you trace the blue line from left to right, at every point in time we are comparing different people.
Instead, let's trace the same people across time -- this is known as a cohort analysis. I traced a black line through the above chart:
This cohort consists of male workers who were 25 to 34 years old between 1972 and 1982. By 1982, they would have aged to between 35 and 44 years old and so they would belong to the green line. Then they shifted up to the yellow line. So over the lifespan, the median worker increased their income.
You might notice that this analysis is very rough because the data is not granular enough. For example, if you are 34 years old in 1972, by 1973, you already moved from the blue to the green line. With the proper data, this analysis can be made precise. The weird jump (indicated by the dashed lines) is most likely a consequence of the imperfections in the cohorting.
If we have the birth year data, then we can trace people who are born in each year forward, and then stack all these traces on the same chart to figure out true generational changes. Imagine that the chart would have age on the horizontal axis.
One of the key elements of numbersense is realizing that there is no single way to analyze any given dataset. When the data is rich, it holds many different insights.