« January 2013 | Main | March 2013 »

Experiments with multiple dimensions

Reader (and author) Bernard L. sends us to the Economist (link), where they walked through a few charts they sketched to show data relating to the types of projects that get funded on Kickstarter. The three metrics collected were total dollars raised, average dollars per project, and the success rate of different categories of projects.

Here's the published version, which is a set of bar charts, ranked by individual metrics, and linked by colors.


This bar chart does the job. The only challenge is the large number of colors. But otherwise, it's not hard to see that fashion projects have the worst success rate and raised relatively little money overall although the average pledge amount tended to be higher than average.

The following chart used more of a Bumps chart aesthetic. It dropped the average pledge per project metric, which I think is a reasonable design choice. The variance in pledge amount is probably pretty high and thus the average may not be a good metric anyway. The Bumps format though suffers because there are too many categories and the two metrics are rather uncorrelated, resulting in a spider web. Instead of using colors as a link, this format uses explicit lines as links between the metrics.


The following version combines features from both. It requires no colors. It drops the third metric, while adopting the bar chart format. The two charts retain the same order of categories so that one can read across to learn about both metrics.



PS. Readers want to see a scatter plot:


The overall pattern is clearer on a scatter plot. When there are so many categories, it's a pain to put the data labels on the chart. It's odd that the amount pledged for games is the highest of the categories and yet it has among the lowest rate of being fully funded. Is this a sign of inefficiency?

Breaking every limb is very painful

This Financial Times chart is a big failure:


Look at the axis. Usually a break in the axis is reserved for outliers. If there is one bar in a bar chart that extends way beyond the rest of the data, then you would sever that bar to let readers know that the scale is broken. Here, the designer broke every bar in the entire chart. It's as if the designer knows we'll complain about not starting the chart at zero -- so the bars all start at zero except they jump from zero to 70 right away.


Trifecta_checkupThe biggest issue with this chart is not its graphical element. It's the other two corners of the Trifecta checkup: what is the question being asked? And what data should be used to address that question?

The accompanying article complains about the dearth of HB1 H-1B visas for technical talent at businesses. But it never references the data being plotted.

It's hard for me to even understand what the chart is saying. I think it is saying that in Bloomington-Normal, IL, 94.8 percent of its HB1 H-1B visa requests are science related. There is no way to interpret this number without knowing the percentage for the entire country. It is most likely true that HB1 H-1B visas are primarily used to recruit technical talent from overseas, and the proportion of such requests that are STEM related is high everywhere. In this sense, it's not clear that the proportion of HB1 H-1B requests is a useful indicator of the dearth of technical talent.

Secondly, it is highly unlikely that the decimal point is meaningful. Given the highly variable total number of requests across different locations, the decimal point would represent widely varying numbers of requests.

I'd prefer to look at absolute number of requests for this type of analysis, given that Silicon Valley has orders of magnitude more technical jobs than most of the other listed locations. Requests aren't even a good indicator of labor shortage. Typically HB1 H-1B visas run up against the quota sometime during the year, and companies will stop requesting new visas since there is no chance of getting approved. This is a form of survivorship bias. Wouldn't it be easier to collect data on the number of vacant technical jobs in each location?



A straight line going nowhere fast, despite tweets and likes

Ken B., another Australian reader, wasn't too proud of this effort, apparently excerpted from an HSBC report by the Sydney Morning Herald (link):


Ken: If you plot ranking by ranking it magically turns into a straight line.


There are a few other annoyances. Gridlines, data labels, double-edged arrow, bars all based on the same data, which can easily be conveyed with a ranked table. In fact, just turn the chart 90 degrees clockwise, get rid of everything else except the names of countries, and you have a much more readable figure.

The completely unnecessary legend is an Excel special. If only one data series is plotted, it should be automatic to suppress the legend.

The three-letter acronyms for different currencies is a futile educational lesson kind of like plotting geographical data on maps (in many cases). For most readers, the message of the chart does not require knowing the names of the currencies, nor their acronyms. For those who care about acronyms, say currency traders, they most likely already know those letters.


Just like I don't understand how we can define "over-rated" or "under-rated" restaurants (see this post and this), I also don't understand how we can define "over-valued" or "under-valued" currencies given the impossiblity of knowing the "true value" of any currency. 


I just had to point your attention to the fact that 123 people tweeted this article, and 221 liked this item on Facebook. And these actions form part of the so-called Big Data revolution.

Remaking a great chart

One of the best charts depicting our jobs crisis is the one popularized by the Calculated Risk blog (link). This one:


I think a lot of readers have seen this one. It's a very effective chart.

The designer had to massage the data in order to get this look. The data published by the government typically gives an estimated employment level for each month of each year. The designer needs to find the beginning and ending months of each previous recession. Then the data needs to be broken up into unequal-length segments. A month counter now needs to be set up for each segment, re-setting to zero, for each new recession. All this creates the effect of time-shifting.

And we're not done yet. The vertical axis shows the percentage job losses relative to the peak of the prior cycle! This means that for each recession, he has to look at the prior recession and extract out the peak employment level, which is then used as the base to compute the percentage that is being plotted.

One thing you'll learn quickly from doing this exercise is that this is a task ill-suited for a computer (so-called artificial intelligence)! The human brain together with Excel can do this much faster. I'm not saying you can't create a custom-made application just for the purpose of creating this chart. That can be done and it would run quickly once it's done. But I find it surprising how much work it would be to use standard tools like R to do this.


Let me get to my point. While this chart works wonders on a blog, it doesn't work on the printed page. There are too many colors, and it's hard to see which line refers to which recession, especially if the printed page is grayscale. So I asked CR for his data, and re-made the chart like this:


You'd immediately notice that I have liberally applied smoothing. I modeled every curve as a V-shaped curve with two linear segments, the left arm showing the average rate of decline leading to the bottom of the recession, while the right arm shows the average rate of growth taking us out of the doldrums. If you look at the original chart carefully, you'd notice that these two arms suffice to represent pretty much every jobs trend... all the other jittering are just noise.

I also chose a small-multiples to separate the curves into groups by decades. When you only have one color, you can't have ten lines plotted on top of one another.

One can extend the 2007 recession line to where it hits the 0% axis, which would really make the point that the jobs crisis is unprecedented and inexplicably not getting any kind of crisis management.

(Meanwhile, New York City calls a crisis with every winter storm... It's baffling.)

When an industry is imploding, lets focus on a metric that remains constant

Augustine F. (@acfou) was not amused by a set of charts made by the Bureau of Labor Statistics, via Business Insider (link). Here's one of them:


The article's message is that the book, periodical and music stores industry has shrunk drastically (over 50%)  in the last 10 years but unless you spend time studying the chart, you're not likely to get this picture.

The bubbles are going right and up, which usually is indicative of an increasing trend. What is tripping us up is the employment level occupying the horizontal axis rather than the expected time dimension. The only real way to see the plunge in employment is to focus on the horizontal axis, and to notice the deepening color of the bubbles.

Redo_books1The chart is actually a scatter plot of number of firms versus number of employees. The slope of the line gives us the number of firms per employee, which is also unexpected since the usual metric is its reciprocal, the number of employees per company. However, since the slope is essentially constant, highlighting this number is pointless. While the industry is collapsing, the average workforce of the surviving firms has remained more or less the same.

I added a cone to the chart to visualize the narrow range in which the employees per firm varied during the past decade.

As if it's not confusing enough, the reciprocal of the slope is coded to the size of the bubbles on the chart. This requires a legend to explain.  All of this means that readers' attention is directed to the average work force metric, instead of the drop in employment.


The following indexed chart shows that the number of employees and the number of firms dropped in step during the ten years. Both dropped about 55% during the decade. This just confirms that the average employee per firm metric is not meaningful.


If you follow the link to the BLS analysis, you'll find some other interesting data, namely the "internet publishing" industry. Does it make sense to talk about the drastic decline in traditional publishing without talking about the rise of the "substitute" industry? The chart below shows that the new jobs created in Internet publishing filled almost all of the hole left in the traditional publishing industry. The decline from 2009 on may not be specific to the industry; it could just be the Great Recession. (As defined, I don't think the two industry sectors are exactly what I'm looking for, but it's close enough.)