« December 2009 | Main | February 2010 »

Leaving ink traces

Stefan S. at the UNEP GEO Data Portal sent me some intriguing charts, made from data about the environment.  The following shows the amount of CO2 emissions by country, both in aggregate and per capita.  We looked at some of their other charts before.

Co2emission

These "inkblot" charts are visually appealing, and have some similarities with word clouds.  It's pretty easy to find the important pieces of data; and while in general we should not sort things alphabetically, here, as in word clouds, the alphabetical order is actually superior as it spaces out the important bits.  If these were sorted by size, we'll end up with all the big blots on top, and a bunch of narrow lines at the bottom - and it will look very ugly.

The chart also breaks another rule. Each inkblot is a mirror image about a horizontal line. This arrangement is akin to arranging a bar chart with the bars centered (this has been done before, here).  It works here because there is no meaningful zero point (put differently, many zero points) on the vertical scale, and the data is encoded in the height of each inkblot at any given time.

Breaking such a rule has an unintended negative.  The change over time within each country is obscured: the slope of the upper envelope now only contains half of the change, the other half exists in the lower envelope's slope.  Given that the more important goal is cross-country comparison, I think the tradeoff is reasonable.

Co2emission2

Colors are chosen to help readers shift left and right between the per capita data and the aggregate data.  Gridlines and labels are judicious.

As with other infographics, this chart does well to organize and expose interesting bits of data but doesn't address the next level of questions, such as why some countries contribute more pollution than others.

One suggestion: restrict the countries depicted to satisfy both rules (per capita emissions > 1000 kg AND total emissions > 10 million tonnes).  In this version, a country like Albania is found only on one chart but not the other.  This disrupts the shifting back and forth between the two charts.




From light to heavy

Yesterday's post led to a number of dissenting comments. Some readers think the charts should be treated as serious works, and they also feel that there is nothing wrong with using the average as the reference level. When I first saw these, I appreciated the effort of the site to mine and analyze the data and wanted to enjoy them as amusing diversions. Well, since you demanded it, this post contains some heavy stuff.


On these charts, I side with Aleks who submitted them.  First, there is a mismatch of the axis with what is actually being plotted.  On the left side, I printed the data according to the vertical scale, and on the right, I printed the data according to the area/height of the columns (the differentials relative to the average). If the reader reads from the vertical axis, she will be reading out data that do not map to the height of the columns.

Okcupid2


Also notice that as the gray bars increase in height, the axis on the left chart tells us the percentages decrease.  The largest number (26%) corresponds to the shortest column. This sort of mismatch makes people dizzy!

As a rule, the scale should agree with what is in the plot.  It would not be a problem if the new scale were to be a mere shift of the old one.  For instance, I like to label my log-scales with the original data as opposed to 1, 10, 100, etc.  However, the following chart shows that the two scales in question (absolute, relative) are not a mere shift in location... for categories above average, they plotted X - average, while for categories below average, they plotted average - X.  This is the source of confusion for Aleks and myself.

Okcupid3

Secondly, there is no reason why the average level need be the dividing line between the blues and grays.  In fact, looking at the bar chart above, one might ask whether the top 2 categories belong together, rather than the top 3.  One way to determine the right clustering of categories is to look at the increase in "value" from one bar to the next higher bar.  The chart below shows this data, and it's clear that there is a large increase between the second and third categories.  Thus, it would make more sense to put the dividing line between 2 and 3.

Okcupid4


And finally, I agree with Andrew that the chart is much better just by turning it sideways so that there is room for the text labels.


Light entertainment

Reader Aleks is none too happy with the sets of charts here (what looks to be an online dating site).  It's hard to take these seriously but the people who made them sounded pretty serious.

Okcupid1 

Pay special attention to the whimsical placement of the horizontal axis.  Are those negative numbers?  Positive columns floating mid-air, and up-ended?

More at the site ("The 4 Big Myths of Profile Pictures").


[Update: 1/26/2010.  See the next post for my response to the comments below.]


Cashing in good

A superb effort by the Wall Street Journal.

This chart tells us corporations are hoarding cash, and the level of stashing varies by industry.

Wsj_companycash

The article is here, and the interactive version of the chart here


Wsj_cashhoardexpandedClicking on each small multiple reveals the detailed trajectory over the 20-year period.

In the print version, the trajectory is printed in faded white within the larger plot. 

Both online and in print, the designers were thinking about foreground / background.




Most impressive is the highly successful attempt to simplify the data, or equivalently, to elevate the trend: each trajectory is represented by three points, with straight lines drawn between those points.  The points chosen were a decade apart so the lines represented straight-line growth/decline within a decade.

Can three points truly represent the 20-year trajectory?  You bet.  Clicking on each of the industry charts, I found that the only one industry for which the three points did not adequately capture what happened was the Energy industry.


Wsj_cashenergy The three-point summary obscured the increase in the cash stockpile for Energy companies in the early 2000s which peaked in 2004. Simplifying anything runs the risk of misrepresenting specific elements; however, a simplified message is much, much more likely to affect readers than inundating readers with too much data.




I do have reservations about the use of color and the legend.  The industry names could be printed at the bottom of each chart and it would be clearer.  For such a well-designed chart, color is not necessary either.


Reference: "Jittery companies stash cash", Wall Street Journal, Nov 3 2009.

 


Peek into beauty 2

Jeff W made some astute comments on the New York Times Netflix visualization, which I praised in the last post.  He pointed out that there is so much more to the underlying data than what can be shown within the confines of these maps.  For example, he wanted to know the relationship between Metacritic scores and Netflix ranks (or rentals), explore the heavy-tailed distribution of titles, expose regional differences, etc.

What he is hitting on is the shortcoming of the current approach to infographics... an approach which is about putting order to messy data, rather than summarizing, extracting and generalizing.  And it is also the difference between "data graphics" and "statistical graphics".

This is related to the modelers versus non-modelers dichotomy Andrew Gelman just discussed in this blog post.  (He cites Hal Stern as the source of the quote.)

Basically, non-modelers have the same philosophy as infographics designers - they want to make as few assumptions as possible, to rely exclusively on the data set.  By contrast, modelers want to reduce the data, their instinct is to generalize.  The stuff that Jeff wanted all require statistical modeling.  As I mentioned before (say, here), I believe infographics has to eventually move in this direction to be successful.


Take the correlation betwen Metacritic score and Netflix ranking... the designers actually thought about this and they tried to surface the correlation, in a way that is strait-jacketed by the infographics aesthetics.  What they did was to allow the movies to be sorted by Netflix ranking, or by Metacritic score, using the controls on the top right.  And when the Netflix ranking is chosen for sorting, the Metacritic score is printed next to the map, so as the reader scrolls along, he or she can mentally evaluate the correlation.  Of course, this is very inefficient and error-prone but we should give the designers props for trying.

Building a model for this data is no simple matter either because multiple factors are at play to determine the Netflix ranking.  A good model is one that can somewhat accurately predict the Netflix ranking (color) based on various factors included in the model, such as the type of movie, the cost of movie, the number of screens it's played, any affinity of a movie to a locale (witness "New in Town"), regions (at different levels of specificity), recency of the movie, whether it's been released on multiple format, etc. etc. 


Jeff's other point about ranking vs number of rentals raises another interesting statistical issue.  I suspect that it is precisely because the number of rentals is highly skewed with a long tail that the analyst chose to use rank orders.  If an untransformed number of rentals is used, the top few blockbuster films will dominate pretty much every map.


Keep the comments coming!


Definitely not working

Catherine Rampell ponticificated on the nature of working families in the NYT Economix blog recently.  Are both members of the family working?  Or not working?

Nyt_workstatusfamiliesOne thing is for sure.  Her chart just doesn't work.  Too many colors, unnecessary gridlines, should label the series directly in lieu of the legend.  Fails our self-sufficiency criterion: once every piece of data is printed on the chart via data labels, everything else is rendered redundant!

And the chart conveys the wrong message!  Looking at this chart, the reader would think the message is: the distribution of work status among married couple families has not changed materially between 2001 and 2009.  That is pretty much Rampell's conclusion.  She said that the gray bar moving from 17.5% to 18.8% in 2009 was "not a huge jump from year to year", and the the pink bar going from 8.2% to 9.9% in 2009 was a "slightly bigger increase".

This unfortunately paints the whole picture.  For a start, most stacked column charts can be replaced with line charts like this:

Redo1_married

Because the vertical scale is no longer constrained at the upper end, the changes from year to year will appear more magnified in this version.  We may think some of the 2009 data to be different from the other 8 years, and we would be right.  Sometimes a small change is a big change - because the thing being measured just doesn't vary much.  The next chart highlights this:


Redo2_married I went back to the raw data, expressed in the number of married couple families, and then created an index, with the 2003 count as 100 for each category.  The black line shows the underlying demographic growth in the total number of such families over the 9 years.  It's now obvious that there has been a pretty remarkable jump in families with only the wives working.

The difference between this chart and the other two charts is the choice of comparisons.  The first two purports to look for year-on-year changes; such comparisons are hard because how big a shift is to be considered noticeable?  In the last version, we regard as noticeable a growth curve over time that differs significantly from the underlying growth trend (black line).

More sophisticated statistical tests can also be used to establish whether the shift in distribution is "statistically significant" but I think the visualization is sufficient.


Reference: "Working Families, Not Working", New York Times, Jan 15 2009.

 
 


Peek into beauty

This graphic feature is the best from the NYT team yet. I particularly love the two columns on the right which allows us to see regional differences.  For example, this "New in Town" movie was much popular in Minneapolis than any of the other metropolitan areas, and was particularly unwatched in New York.  Also, note the choice of sorting allowed on the top right.

Click here and enjoy!


Nyt_netflix

Reference: "A Peek into Netflix Queues", New York Times, Jan 10 2009.


Playthings in the unreal world 3

Some readers may be interested in the R code used to generate the small multiples charts.  The code also highlights one of the virtues of R, which is "elegance": because it natively handles vector and matrix computations, the programmer can (if he or she chooses to) reduce the use of (inelegant) looping.  (Yes, coding elegance is a kind of romantic ideal, and inelegant codes have many practical advantages -- easier to debug, easier to manage, easier to collaborate on with others, etc.)


# reading in data

bigmac = read.csv("bigmac.csv", header=T)

# initializing empty matrix

bigmac2 = matrix(0,nrow(bigmac),nrow(bigmac))
colnames(bigmac2) = bigmac$Country
rownames(bigmac2) = bigmac$Country


# main computation

for(i in colnames(bigmac2)) bigmac2[,i] = bigmac$Price/bigmac$Price[bigmac$Country=="i"]
bigmac3=round(bigmac2-1,3)


# this matrix holds the colors of each bar to be plotted

bigmaccol= (bigmac3>0)


# graphical parameters

par(mar=c(3,7,3,1), mfrow=c(2,2),cex.axis=0.8, cex=1)


# plotting

for (i in c("US","EuroArea","Japan","China")) {
    barplot(rev(bigmac3[,i]), horiz=T, xlim=c(-1,3),las=2, col=rev(bigmaccol[,i]), main=paste("Relative to ",i))
}


In the main computation step, the one formula takes the original vector of prices (the left column in the Economist's chart), computes relative prices 23 times using successively each country's price as the standard, and deposits all 23 vectors into a matrix.  Then, the next step takes the entire matrix, subtracts 1 from each entry, and rounds each entry to 3 decimal places.

In plotting, the default options are not satisfactory: the changes I made included switching from columns to bars, reversing the order of plotting, setting the left and right edges of the value axis, turning the country labels on its side (this also turns the value labels -- there is a way to set this for one axis only but not the other but I did not bother), making the positive bars black and the negatives white, and supplying titles that are dynamically assigned according to reference country.

Also, it is almost always true that the global graphical parameters need to be adjusted.  Here, I controlled the amount of white space on each side of the plotting window, set up a 2x2 grid of charts, and changed the font size on the axes.


Playthings in the unreal world 2

Reader Bill anticipated my next post, which is to use small multiples to explain the challenges of using relative scales.  Zbicyclist's point 1 is absolutely correct in the sense that the "model" uses the US$ as a standard, and only establishes "relative values".  That is true but does address the question I posed, which is: under this "model", a researcher cannot conclude that the US$ is over- or under-valued; in other words, it always must have the correct valuation.  Now, just like in other economic models, the theorist does not state explicitly the assumption that US$ cannot have over- or under-valuation.  It is just that the assumption of the US$ as a standard necessarily leads to the result that it is always correctly valued.

If we choose another currency, say the Euro, as the standard, then under that theory, the US$ can have over- or under-valuation.  However, now the Euro has been assumed to have the correct valuation.

The following charts show over (black) and under (white) valuation with different currencies as the standard:

Redo_bigmac

This is a pretty complicated issue.  The problem is that there are no external metric to measure value.

Note that I have revised my recommendation on what scale to use for the value axis based on zbicyclist's comments.  Since this chosen scale cannot attain values below -1, we should treat -1 as minimum value and use that as the left edge. (By the way, my scale multiplied by 100 is the Economist's scale.)

Also food for thought: should such a strange scale, allowing values between -1 and +infinity, be used?  Percentage scales often have the characteristic that a 20% increase and a 20% decrease are not merely a difference in direction but also in magnitude.  However, it is natural to assume that a 20% increase/decrease is different only in direction so such scales are misleading.  What's better?



Reference: "Playthings in the unreal world", Junk Charts.


Playthings in an unreal world

It's been said that the economic models used by many mainstream economists this decade suffered from a fatal flaw: that of many unrealistic assumptions (such as no speculative bubbles) that are often needed to "make the math work"; sometimes, these are not direct assumptions but consequences of other assumptions.  See for example Willem Buiter, Paul Krugman, Scientific American.


Another favorite plaything of economists is the so-called "Big Mac index".  The Economist magazine, which seems to own this toy, proclaims it "the most accurate financial indicator to be based on a fast-food item", and the sub-title of the page is "Exchange-rate theory".  It is the cost in US$ of a Big Mac overseas divided by the cost of a Big Mac in the U.S., and is an indicator of whether the current foreign exchange rate (vs. US$) is over-valued or under-valued.

Econ_bigmac

I saw this chart on the Business Insider site, where a claim is made about the Chinese currency being undervalued by 50%.

Sadly, the Economist has gone the way of USA Today in embellishing its graphics with distracting, loud, uninformative images.  Besides the chartjunk, we should always place the zero line in the middle of the chart for this sort of scale where the data could theoretically lean in either direction.  This allows readers to mentally judge the magnitude of the differentials. As pointed out by zbicylist, this statement isn't appropriate to this particular scale.  The scale chosen has the peculiar range of between -100% and +infinity: in order to help readers appreciate this, I would set the left edge of the scale to -100%, and let the right edge expand to cover the actual data.

Food for thought: could it be that under this economic "theory", the US$ can never be over- or under-valued, that it is always correctly valued?