I highlighted the columns for 1993 and 1996. Visually, the height of one column is twice that of the other column. And yet the axis labels tell us that the difference is 65% versus 62.5%.
The reason for the start-at-zero rule is to avoid exaggerating meaningless differences.
To judge whether a change is meaningful or not, in time-series data like this, we have to use history to understand the general variability in college enrollment rates. Based on what we can see in this data (about 20 years), the college enrollment rate hovers between 60 and 70 percent. There is no data between 0 and 60 percent. Those are irrelevant values for this data series. This is why starting at zero is counterproductive.
Here is the line chart starting at zero:
This display has the unintended effect of squashing meaningful changes over time by inserting a lot of empty space below the line.
This chart from Reuters is making the rounds on Twitter today.
Quickly, tell me whether the Gun Law in Florida did well or poorly.
That of course is the entire purpose of the chart.
If you are like me, that is, you have knowledge in your head of time-seriesline charts, you probably experienced that moment where the bottom fell out and you didn't know which way was up.
This is the double edge of novelty in charts. There should be a very high bar against running counter to convention. Readers do bring their "baggage" to the chart, and the designer should take that into consideration.
Some commentators are complaining about trickery. That may be true. But it's also possible the designer actually thought reversing the direction of the vertical axis made the chart better.
Don't forget about we have another convention: up is good and down is bad. Fewer murders is good and more murders is bad. So why not make it such that a rising line indicates goodness (fewer murders)?
Going back to the Trifecta Checkup. This chart has dual problems. We just talked about the syncing between the data and the graphical element.
The other issue is that the data is insufficient to draw conclusions about the underlying question: what explains the shift in number of murders since the late 2000s? This is a complex problem--the chapter in Freakonomics about abortion and crime rate is still instructive, not for the disputed conclusion but for the process of testing various hypotheses. The reduction of the complex causal structure to a single factor is dissatisfying.
A twitter follower submitted this chart showing the shift in ethnicity in Texas:
If you blinked, you probably took away the wrong message. Our "prior" tells us that the proportion of Hispanics has been rising quite rapidly in Texas. So, like me, you might hone in on the blue columns which has increased drastically from 32% to 68%.
Things start to fall apart.
First, you might notice the blue label said "Non-Hispanic Whites," which is exactly the opposite of our hypothesis. For a moment, we are confused. Could it be that the Hispanics population in Texas has been shrinking?
Then, you might notice that the "information in our head" made us assume that the horizontal axis represents time. On a closer look, we discover that it's not time; what's being plotted from left to right are age groups. In fact, it's kind of a reversed time. The generations on the right side were born earlier and represent the ethnic distribution today of people born over 60 years ago while the columns on the left represent younger generations.
Finally, the gray columns are redundant and distracting.
On the other hand, the designer is admirably restrained with data labels, and included the baby and crooked man with a stick icons to provide some guidance, both of which are good ideas.
If I apply the Trifecta checkup to this chart, the biggest issue is misalignment between the interesting question of ethnic changes in Texans and the data used to explore this question. The current ethnic mix is not only impacted by the ethnic composition at birth but also by net migrations of different races and by their longevity. As pointed out above, the split by age groups forced us into a kind of reversed time thinking.
A simple fix involves expressing ages as birth years, and using a single line instead of columns:
This version doesn't address the tendency to interpret the left-right axis as time, and the excessive number of age groups.
An even better chart would put time on the horizontal axis, then have multiple lines each representing the proportion of non-Hispanic whites of a specific age group. It may be a political choice--I'm not sure why they chose to plot the declining proportion of non-Hispanic whites and lump Hispanics into "all others" as opposed to plotting the increasing mix of Hispanics.
On the sister blog, I wrote about a new report on the music industry lamenting that the hype over "Long Tail" retail has not really helped small artists (as a group). This was a tip sent by reader Patrick S. He was rightfully unhappy about the chart that was included in this summary of the report.
This classic Excel chart has some basic construction issues:
The data labels are excessive
The number of ticks on the vertical axis should be halved, given the choice to not show decimal places
With only two colors, it is a big ask for readers to shift their sight to the legend on top to understand what the blue and gray signify. Just include the legend text into the existing text annotation!
In terms of the Trifecta checkup, the biggest problem is the misalignment between the intended message and the chart's message. If you read the report, you'd learn that one of their key findings is that the top 1% (superstar) artists continue to earn ~75 percent of total income and this distribution has not changed noticeably despite the Long Tail phenonmenon.
But what is the chart's message? The first and most easily read trend is the fall in total income in the last 12-13 years. And it's a drastic drop of about $1 billion, almost 25 percent. Everything else is hard to compute on this stacked column chart. For example, the decline in the gray parts is even more drastic than the decline in the blue.
It also is challenging to estimate the proportions from these absolute amounts. Recognizing this, the designer added the proportions as text. But only for the most recent year.
So we have identifed two interesting stories, one about the decline in total income and the other about the unending dominance of the 1 percent. This is where the designer has to set priorities. Given that the latter message is the headline of the report, it is better to plot the proportions directly, while hiding the story about total income. The published chart has the priority reversed. Even though you can find both messages on the same chart, it is still not a good idea to highlight your lesser message.
A twitter follower @mdjoner felt that something is amiss with the squares in this chart comparing real estate prices in major cities around the world. I'm not sure where the chart originally came from but there is a CNBC icon.
There is one thing I really like about the chart, which is the metric that has been selected. The original data is likely to be price per square metre for luxury property in various places. The designer turned this around and computed the size of what you can buy assuming you spend $1 million. I think we have a better ability to judge areas than dollars.
The notion of floor area meshes well with the area on a chart, so there is an intuitive appeal as well.
So in the Trifecta checkup, they did well posing an interesting question, and picking some data. But like Mike, I'm not excited about the graphical construct.
There are a few problems with this chart:
It requires using colors when the colors do nothing other than delineating one city from the next.
There's overcrowding at the bottom of the chart because the designer maintained a fixed spacing throughout the chart.
The city label is always positioned above the middle of the diamond. I find it very confusing in the bottom half of the chart when the diamonds started overlapping.
The shadows plus the overlapping make it almost impossible to make out the actual areas of the pieces.
Here is an alternative display of the data:
Notice that I designed this for an American audience. I'd change certain decisions if using this for the non-American reader. I choose New York as the focal point, and split the cities into two parts. On the left are the cities less expensive than New York and on the right are those cities more expensive than New York.
Also, along the bottom, I provide some clues to help people bridge the gap between the areas shown on the graphic, and real-life areas. For example, the orange square represents 400 square feet but without the annotation telling you it's about the size of a typical Manhattan studio, you may not know how to map the size of the orange square to your perception of real spaces. I also included images (although if I'm publishing this, I'd want better ones).
Finally, note that the data set did not show up on my version of the chart.
My twitter followers have been sending in several howlers.
Twitter (link) made a bunch of bold claims about its own influence by using the number of tweets about the Oscars as fodder. They also adopt the euphenism common to the digital marketing universe, the so-called "view", which credit to them, they define as "how many times tweets are displayed to users". Yes, you read that right, displaying is the same as viewing in this world - and Twitter is just a follower not a trend setter here.
In the meantime, @wilte found this unfortunate donut chart, created by PWC in the Netherlands.
Both designers basically used appropriated a graphical form and deprived it of data. In one, the designer threw the concept of scale to the wind. In the other, the designer dumped the law of total probability. In either case, the fundamental rationale for the particular graphical form is sacrificed.
Both are examples that fail our self-sufficiency test. This test says if a visual display cannot be understood unless the entire data set is printed on the chart, then why create a visual display? In both charts, if you block out the numbers, you are left with nothing!
The PWC chart was submitted by @graphomate, who also submitted the following KPMG chart:
The complaint was the total adding up to 101%. I'm not really bothered by this as it is a rounding issue. That said, I like to "hide" such rounding issues. I have never understood why it is necessary to display the imperfection. Flip a coin and remove the decimals from one of the categories!
Some graphics are made to inform, some to amuse, some to delight. But the following scatter plot makes one wonder why why why...
What does the designer want to say?
I saw this chart inside an infographics titled "Where in the World are the Best Schools and the Happiest Kids?", via the Cool Infographics blog. The horizontal axis is happiness and the vertical axis is average test score.
So it appears that happy kids can get the best and the worst test scores, and kids with the best test scores can be both happy and sad.
That means the happiness of kids does not depend on their test scores.
When you see two time series, resist the temptation to plot them as lines on the same chart. According to the Atlantic, the following dual-axis chart has been making the rounds in the investment community: (thanks to Alberto Cairo for the tip)
There may be correlation or there may not be. When we look at a chart like this and see "correlation" -- actually a high degree of correlation -- what we are really talking about are the long-run trends being correlated. For example, the underlying data for this chart is most likely on a daily level. If you train your eye to a small part of the chart, you will notice that at the daily level, there is a lot of noise and a lot less correlation than you think.
Long term trends being correlated does not imply short-term trends are also correlated!
Furthermore, the long-run correlation is not enough to jump to the conclusion that the new trend will follow the old trend. When you make this conclusion, you are implicitly assuming that the mechanism causing the trend in the 1928-9 period is identical to that causing the current-period trend. This is when you realize that such an assumption is hard to support.
The Atlantic piece debunks this chart by re-expressing the data as indices. This means we switched from absolute changes in the Dow Jones average to relative changes. This has its own problem actually because the general level of the Dow Jones is so different between those two periods.
Here are some posts I have written on dual-axis charts. I have been complaining about them since almost the beginning of this blog. Back in 2006, I wrote this piece which takes a different path to debunking a similar chart -- by compressing or expanding one of the axes.
In a more recent post, I showed an example of when it is natural to use two axes on the same chart.
In the prior post, I linked to Eric P.'s (link) vetting of the Bloomberg chart on the drop in median male income in the U.S. in the last few decades. Just as a reminder, here is the key chart:
In the 25-34 age group (blue line), the median income has suffered two waves of drastic declines, about 25% from 1972 to 1992 and then about 18% from 1999 to 2011.
There is a different way to digest the chart above, which is what I want to talk about in this post. Notice that people age over time so if you trace the blue line from left to right, at every point in time we are comparing different people.
Instead, let's trace the same people across time -- this is known as a cohort analysis. I traced a black line through the above chart:
This cohort consists of male workers who were 25 to 34 years old between 1972 and 1982. By 1982, they would have aged to between 35 and 44 years old and so they would belong to the green line. Then they shifted up to the yellow line. So over the lifespan, the median worker increased their income.
You might notice that this analysis is very rough because the data is not granular enough. For example, if you are 34 years old in 1972, by 1973, you already moved from the blue to the green line. With the proper data, this analysis can be made precise. The weird jump (indicated by the dashed lines) is most likely a consequence of the imperfections in the cohorting.
If we have the birth year data, then we can trace people who are born in each year forward, and then stack all these traces on the same chart to figure out true generational changes. Imagine that the chart would have age on the horizontal axis.
One of the key elements of numbersense is realizing that there is no single way to analyze any given dataset. When the data is rich, it holds many different insights.