From a purely graphical perspective, the following NYT chart (link) is well executed:
Labeling is always a challenge with scatter plots. Here, they have 54 points, and the chart still doesn't look too crammed. I like the axis labels, and the clear labeling of the four quadrants.
I also like the vertical scale that goes from 4 to 8, despite the scoring range going from 1 to 10. This trims unneeded whitespace and magnifies the differences between nations.
In the Trifecta checkup, we also care about what the key question the chart is designed to answer, and how it relates to the graphical element. According to the subtitle, this chart showed that "the nations with more progressive tax rates had happier citizens."
This conclusion certainly does not jump off the page. Reader Christopher L. who submitted this chart found "no obvious trend." (Given the source, I suspect it's the researchers who drew that conclusion.)
There are lots of unanswered questions in an international comparison of subjective results of this type:
How were the 54 nations chosen?
Is the year 2007 representative of the recent situation in every one of these countries? Were there any tax reforms in 2007 in any of these countries?
How reliable is the Gallup poll in each of these countries? How large are the sample sizes? Is it the same survey?
Why is the difference between the highest and lowest tax burdens the right measure of progressiveness? And are they using the marginal tax rates or the average tax rates? (Judging from the Wikipedia page, there is a lot of arbitrariness in determining a country's tax rate.)
Are the two data sources comparable? Happiness is a personal question while the range of tax rate is an aggregate metric, with each individual only experiencing one tax rate.
These are not trivial questions. If the data is bad, no amount of graphical magic can save it.
Most vexing for a display like this is that it forces the reader to look for the impact of tax burden on happiness. That's how the question is framed. There is nothing in this chart, though, that suggests that tax rates can explain happiness, and certainly nothing to suggest that low tax rates cause more happiness.
I call this story time. Put up some data, then spin stories and spin away.
Andrew Wheeler took the time to write code (in SPSS) to create the "Scariest Chart ever" (link). I previously wrote about my own attempt to remake the famous chart in grayscale. I complained that this is a chart that is easier to make in the much-maligned Excel paradigm, than in a statistical package: "I find it surprising how much work it would be to use standard tools like R to do this."
Andrew disagreed, saying "anyone saavy with a statistical package would call bs". He goes on to do the "Junk Charts challenge," which has two parts: remake the original Calculated Risk chart, and then, make the Junk Charts version of the chart.
I highly recommend reading the post. You'll learn a bit of SPSS and R (ggplot2) syntax, and the philosophy behind these languages. You can compare and contrast different ways to creating the charts. You can compare the output of various programs to generate the charts.
I'll leave you to decide whether the programs he created are easier than Excel.
Unfortunately, Andrew skipped over one of the key challenges that I envision for anyone trying to tackle this problem. The data set he started with, which he found from the Minneapolis Fed, is post-processed data. (It's a credit to him that he found a more direct source of data.) The Fed data is essentially the spreadsheet that sits behind the Calculated Risk chart. One can just highlight the data, and create a plot directly in Excel without any further work.
What I started with was the employment level data from BLS. What such data lacks is the definition of a recession, that is, the starting year and ending year of each recession. The data also comes in calendar months and years, and transforming that to "months from start of recession" is not straightforward. If we don't want to "hard code" the details, i.e. allowing the definition of a recession to be flexible, and make this a more general application, the challenge is more severe.
Another detail that Andrew skimmed over is the uneven length of the data series. One of the nice things about the Calculated Risk chart is that each line terminates upon reaching the horizontal axis. Even though more data is available for out years, that part of the time series is deemed extraneous to the story. This creates an awkward dataset where some series have say 25 values and others have only 10 values. While most software packages will handle this, more code needs to be written either during the data processing phase or during the plotting.
By contrast, in Excel, you just leave the cells blank where you want the lines to terminate.
In the last section, Andrew did a check on how well the straight lines approximate the real data. You can see that the approximation is extremely well. (The two panels where there seems to be a difference are due to a disagreement between the data as to when the recession started. If you look at 1974 instead of 1973, and also follow Calculated Risk's convention of having a really short recession in 1980, separate from that of 1981, then the straight lines match superbly.)
I'm the last person to say Excel is the best graphing package out there. That's not the point of my original post. If you're a regular reader, you will notice I make my graphs using various software, including R. I came across a case where I think current software packages are inferior, and would like the community to take notice.
Sometimes, a chart just strains your mind. Such is the case with the following, a tip from Augustine F. (@acfou)
There are just so many percentages on the chart it's really hard to figure out which is which.
Under the title, it hints that they are showing results from a poll. The legend implies that the poll asks for estimates of budget and revenue allocations: one imagines the questions were what proportion of your marketing budget is allocated to digital? and what proportion of your revenues is attributed to digital? On top of the bars are some percentages, presumably percentages of respondents. Perhaps, or perhaps not. The column labels clearly add up to over 100% since there are two columns in the 30-35% range.
Under the axis, we have buckets of percentages. Are they percentages of people, of budgets or of revenues? Why and how are they bucketed?
My best guess is that the survey is a multiple-choice with 11 choices corresponding to the groups of columns. The axis labels refer to both percentage of budget and percentage of revenues, depending on which column you're looking at.
What is maximally confusing is the last set of columns, labeled "Average", with values in the 35% range. It is most likely not a choice in the survey. They somehow came up with an average based on the responses. So maybe I was wrong about the multiple-choice format: if the raw data comes in buckets like 61 to 70%, there is no easy way to average these responses. Maybe they asked for two exact percentages, and then grouped them afterwards.
To sum all that up, the percentages on top of the columns are percentages of respondents, except in the last set of columns, where they are percentages of budget (or revenues). The percentages of budget (or revenues) are sitting on the horizontal axis, except in the last label, called "Average", where it means the average respondent.
There is a problem with my interpretation. It makes the chart completely worthless!
What use is it to learn that "16% of the respondents say they allocate 11-20% of their budget on digital while 12% of the respondents say they derive 11-20% of their budget from digital"?
You might be interested in whether there is a return on investment to the money spent on digital marketing. You'd then need to know for a given company, what proportion of budget was spent on marketing versus what proportion of revenues was attributed to that marketing. In this chart, there is no linkage -- the companies who say they spend 11-20% on digital may or may not be the same set of companies who say they derive 11-20% from digital spend.
If the survey asked for exact percentages, then I'd prefer to see a scatter plot, showing proportion of budget on one axis, and proportion of revenues on the other axis, each dot representing a respondent.
A final note: it is worth asking what types of people answer this survey. Pretty much the only people in a company who can answer this question accurately are the heads of marketing. If you are working for the head of marketing, you likely know the details of a particular segment of marketing but not the aggregate numbers. If you work in a different department, there is little to no chance that you have any useful knowledge about marketing budgets and revenue allocations.
One would also appreciate it if all such pictures include the sample size.
Look at the axis. Usually a break in the axis is reserved for outliers. If there is one bar in a bar chart that extends way beyond the rest of the data, then you would sever that bar to let readers know that the scale is broken. Here, the designer broke every bar in the entire chart. It's as if the designer knows we'll complain about not starting the chart at zero -- so the bars all start at zero except they jump from zero to 70 right away.
The biggest issue with this chart is not its graphical element. It's the other two corners of the Trifecta checkup: what is the question being asked? And what data should be used to address that question?
The accompanying article complains about the dearth of HB1 H-1B visas for technical talent at businesses. But it never references the data being plotted.
It's hard for me to even understand what the chart is saying. I think it is saying that in Bloomington-Normal, IL, 94.8 percent of its HB1 H-1B visa requests are science related. There is no way to interpret this number without knowing the percentage for the entire country. It is most likely true that HB1 H-1B visas are primarily used to recruit technical talent from overseas, and the proportion of such requests that are STEM related is high everywhere. In this sense, it's not clear that the proportion of HB1 H-1B requests is a useful indicator of the dearth of technical talent.
Secondly, it is highly unlikely that the decimal point is meaningful. Given the highly variable total number of requests across different locations, the decimal point would represent widely varying numbers of requests.
I'd prefer to look at absolute number of requests for this type of analysis, given that Silicon Valley has orders of magnitude more technical jobs than most of the other listed locations. Requests aren't even a good indicator of labor shortage. Typically HB1 H-1B visas run up against the quota sometime during the year, and companies will stop requesting new visas since there is no chance of getting approved. This is a form of survivorship bias. Wouldn't it be easier to collect data on the number of vacant technical jobs in each location?
Ken B., another Australian reader, wasn't too proud of this effort, apparently excerpted from an HSBC report by the Sydney Morning Herald (link):
Ken: If you plot ranking by ranking it magically turns into a straight line.
There are a few other annoyances. Gridlines, data labels, double-edged arrow, bars all based on the same data, which can easily be conveyed with a ranked table. In fact, just turn the chart 90 degrees clockwise, get rid of everything else except the names of countries, and you have a much more readable figure.
The completely unnecessary legend is an Excel special. If only one data series is plotted, it should be automatic to suppress the legend.
The three-letter acronyms for different currencies is a futile educational lesson kind of like plotting geographical data on maps (in many cases). For most readers, the message of the chart does not require knowing the names of the currencies, nor their acronyms. For those who care about acronyms, say currency traders, they most likely already know those letters.
Just like I don't understand how we can define "over-rated" or "under-rated" restaurants (see this post and this), I also don't understand how we can define "over-valued" or "under-valued" currencies given the impossiblity of knowing the "true value" of any currency.
I just had to point your attention to the fact that 123 people tweeted this article, and 221 liked this item on Facebook. And these actions form part of the so-called Big Data revolution.
One of the best charts depicting our jobs crisis is the one popularized by the Calculated Risk blog (link). This one:
I think a lot of readers have seen this one. It's a very effective chart.
The designer had to massage the data in order to get this look. The data published by the government typically gives an estimated employment level for each month of each year. The designer needs to find the beginning and ending months of each previous recession. Then the data needs to be broken up into unequal-length segments. A month counter now needs to be set up for each segment, re-setting to zero, for each new recession. All this creates the effect of time-shifting.
And we're not done yet. The vertical axis shows the percentage job losses relative to the peak of the prior cycle! This means that for each recession, he has to look at the prior recession and extract out the peak employment level, which is then used as the base to compute the percentage that is being plotted.
One thing you'll learn quickly from doing this exercise is that this is a task ill-suited for a computer (so-called artificial intelligence)! The human brain together with Excel can do this much faster. I'm not saying you can't create a custom-made application just for the purpose of creating this chart. That can be done and it would run quickly once it's done. But I find it surprising how much work it would be to use standard tools like R to do this.
Let me get to my point. While this chart works wonders on a blog, it doesn't work on the printed page. There are too many colors, and it's hard to see which line refers to which recession, especially if the printed page is grayscale. So I asked CR for his data, and re-made the chart like this:
You'd immediately notice that I have liberally applied smoothing. I modeled every curve as a V-shaped curve with two linear segments, the left arm showing the average rate of decline leading to the bottom of the recession, while the right arm shows the average rate of growth taking us out of the doldrums. If you look at the original chart carefully, you'd notice that these two arms suffice to represent pretty much every jobs trend... all the other jittering are just noise.
I also chose a small-multiples to separate the curves into groups by decades. When you only have one color, you can't have ten lines plotted on top of one another.
One can extend the 2007 recession line to where it hits the 0% axis, which would really make the point that the jobs crisis is unprecedented and inexplicably not getting any kind of crisis management.
(Meanwhile, New York City calls a crisis with every winter storm... It's baffling.)
Augustine F. (@acfou) was not amused by a set of charts made by the Bureau of Labor Statistics, via Business Insider (link). Here's one of them:
The article's message is that the book, periodical and music stores industry has shrunk drastically (over 50%) in the last 10 years but unless you spend time studying the chart, you're not likely to get this picture.
The bubbles are going right and up, which usually is indicative of an increasing trend. What is tripping us up is the employment level occupying the horizontal axis rather than the expected time dimension. The only real way to see the plunge in employment is to focus on the horizontal axis, and to notice the deepening color of the bubbles.
The chart is actually a scatter plot of number of firms versus number of employees. The slope of the line gives us the number of firms per employee, which is also unexpected since the usual metric is its reciprocal, the number of employees per company. However, since the slope is essentially constant, highlighting this number is pointless. While the industry is collapsing, the average workforce of the surviving firms has remained more or less the same.
I added a cone to the chart to visualize the narrow range in which the employees per firm varied during the past decade.
As if it's not confusing enough, the reciprocal of the slope is coded to the size of the bubbles on the chart. This requires a legend to explain. All of this means that readers' attention is directed to the average work force metric, instead of the drop in employment.
The following indexed chart shows that the number of employees and the number of firms dropped in step during the ten years. Both dropped about 55% during the decade. This just confirms that the average employee per firm metric is not meaningful.
If you follow the link to the BLS analysis, you'll find some other interesting data, namely the "internet publishing" industry. Does it make sense to talk about the drastic decline in traditional publishing without talking about the rise of the "substitute" industry? The chart below shows that the new jobs created in Internet publishing filled almost all of the hole left in the traditional publishing industry. The decline from 2009 on may not be specific to the industry; it could just be the Great Recession. (As defined, I don't think the two industry sectors are exactly what I'm looking for, but it's close enough.)
James C @annelidworm sent me to this BBC chart, which he thinks is "hard on the eyes":
I find a few things I like, and also a few I don't.
Unlike James, I actually find the chart quite pretty. The use of a small-multiples to compare season tickets with single tickets is also nice. For someone like me, who isn't well versed in the British map, the geography lesson is appreciated - although for a local reader, this may be superfluous. The thickness of lines used to encode the data works alright.
There are a few problems with the chart:
There is a self-sufficiency problem. This is a chart in which every data element is printed on the chart, which means the graphical pieces are merely cosmetic. If the data labels were removed, the reader would be entirely lost. However, this problem can be solved by judicious use of colors.
Consider how color is used here. Blue and yellow distinguishes between season and single tickets but the small-multiples setup already does the job well enough. The tint is used in some arbitrary manner unrelated to the data, as far as I can tell.
Instead, price increases above the rate of inflation should be differentiated from price changes below the rate of inflation by using two colors. The special case of Birmingham's season ticket which increased exactly at the rate of inflation deserves its own color.
Speaking of increases relative to inflation. The analyst helpfully explains via the legend that any number above 66 percent is ahead of inflation, and any number below is behind inflation, meaning prices have actually come down. The entire dataset can be simplified by subtracting 66% from each number to show the "real" price changes.
Take a step back. What is the story in this dataset? The numbers on the right side are all much higher than those on the left, with Shoeburyness being a bit of the exception. It appears that the rail company is trying to push sales of season tickets. Too bad this chart doesn't bring the story to the front.