Apr 25, 2008

Knit-picking

Nyt_tuitionfree2 In celebrating the recent trend by "elite" colleges to lowering the cost of education, the Times printed this chart, the top part of which is shown here.

The three colors represent different levels of aid.  Blue means "grants replace loans"; red means "free tuition"; yellow means "parents pay nothing".  The colleges are grouped by the minimum qualifying income for the blue category.

The whole effect is of a knit.  We shall call this the "knit chart".

I believe a simple data table will do the job nicely.  If any reader has other ideas, please show us your work!

A few points to note about the original:

  • Ordering by the minimum income to qualify for "grants replace loans" is arbitrary, as is alphabetizing colleges within each group
  • Qualifying "at any income level" should be shown on the left of "$40,000 or below" rather than to the right of $100,000.  The current order is such that qualifying level increases with income from left to right, except from $100,000 to "any income", where it falls off a cliff.
  • Qualifying at any income level is better shown as a separate column on the right disconnected from the income scale.  The current configuration devalues the effort spent in making a proper income scale.
  • Too many lines of equal length, and too few yellow and red lines to make the knit chart effective
  • Should the graph cater to parents interested in seeing what aid they qualify for given their income level?  Or should the graph highlight the breadth of aid available at individual colleges?

Reference: "The (Yes) Low Cost of Higher Ed", New York Times, April 20 2008.

PS. The original point about the "any income level" was incorrect as pointed out by Chris below.  I have replaced that with a different issue.

PPS. Matias' version (see comments) is a superb demonstration of the power of data tables, well-applied.   It is clean and simple, and addresses both the questions pointed out in the last bullet point.  The only thing sacrificed was the visual representation of the relative size of the income requirements, which I agree is the least valuable part of the original.  As usual, many thanks to our readers for coming up with great ideas!

Redo_tuitionfree2

Apr 12, 2008

Hanging tough

Orig_literacy

Reader Nick B. sent in this example calling it "interesting".  The chart tells a compelling story once we figure out what it is.  Grasping the tree structure is key.

It illustrates the important idea that averaging sometimes masks  variations in the data.  For example, while the province of Guerrero scored 78% on literacy, the municipalities within Guerrero had scores ranging from 28% to 90%.

It also shows that the gender gap was larger in lesser Metlatonoc municipality than in more literate Cuautitian.

In addition, it tells us that while Mexico on average measured very well on literacy, subpopulations within Mexico spanned the world's best and worst (from about Mali's level to Italy's).

While I find this chart adequate, the pieces hanging off each other did not seem ideal, especially the two overlapping municipality pieces which were placed next to each other.  However, it is tough to come up with an alternative.  Here's one attempt; the changes are mild.

Redo_literacy_2 I prefer the horizontal orientation.

The branches are emphasized (as opposed to the "T" junction) because that's a key part of the story.

The national level, especially the span between Mali and Italy, is de-emphasized; I treat it as gridlines.

Instead of placing the overlapping pieces next to each other, I let the ranges literally overlap, which serves to stress this feature.


 

 

Mar 01, 2008

Don't believe what you see

Mankiw's blog linked to a press release by the Congressman Jim Saxton, using CBO data to show "middle income tax burden at lowest level in decades".  Cbo_taxrateThe attached graph, as Junk Charts readers will immediately recognize, is classic chartjunk.  Every time the vertical axis does not start at zero,  one suspects something is amiss.  And what with the gridlines and data labels?

"Don't believe it? Check out the data source yourself."  I followed Mankiw's suggestion and was indeed surprised... but not by the great fortune of the "middle class".  The surprise was how the chart painted a dishonest picture of the CBO data.

The original chart plotted only the tax rate experienced by the middle 20% of the population. 
Redo_taxrate1The CBO provided data for all five quintiles; why not plot them all?  In this new chart (right), the "surprise" windfall to the middle 20% proved not to be anything special at all!  All five quintiles, especially the middle three, followed pretty much the same trend over time.  The effect of singling out the middle 20% is to deprive the context by which the data should be interpreted.

Further, what might be the result of the declining middle income tax burden?  Redo_taxrate3 The CBO data painted an unexpected picture.  Paradoxically, as the middle 20% see their tax rate decrease, they also earn a smaller share of the nation's after-tax income (black line at right).  At the same time, the top 1% saw their share of after-tax income double from about 8% to almost 16% (blue line).  The top 20% line is also upward-sloping although less pronounced.  So, the implication that the middle class have had it good is plainly wrong.

What is going on?  Two factors were at play and the Congressman presented
only one side of the story (the tax rate).  What he omitted was that during this period, the nation's wealthy took home larger and larger shares of the pre-tax income.  This shift in pre-tax income more than offset any relative reduction in tax rate for the middle 20%.

This distortion can be traced back to the use of quintiles (or more generally, ranks).  We use them to cope with data having extreme distributions but a by-product is losing information about how extreme are the extreme values.  As demonstrated here, the quintiles from old are really different from the quintiles from today because the underlying distribution has become much more extreme.

Finally, another bit of mystery (to me) is how the middle 20% came to be considered "middle class".  Is there a widely accepted definition?

Reference: "CBO Data Show Middle Income Debt Burden At Lowest Level in Decades", Feb 21 2008.

Dec 09, 2007

Lacking buzz

Nielsen, they of the ratings, is roughing it in the information age.  When they announced on-line tracking tools, Wired quipped: "It's looking like online video policing companies will have to make room for another deputy."  Last year, cable companies revolted over a service measuring the effectiveness of commercials.

Via the Data Mining blog, I learnt about yet another new on-line offering, called "Hey! Nielsen" for obscure reasons.  (Perhaps Hey! Nielsen is the new Yahoo! !)

The site is an enigma wrapped in a mystery.  The official description says:

Hey! Nielsen is the place to make a name for yourself while trading opinions on TV, movies, music, personalities, web sites and more.

How does one "trade" opinions?

According to the FAQ, the "Hey! Nielsen" score, the cornerstone of the site, is:

a real-time indicator of a topic's impact and value and you play a major role. As the site evolves and users submit their opinions and commentary, the score will rise or fall based on a number of factors including, but not limited to, user opinions, news coverage, and raw data from our sister sites Billboard.com, HollywoodReporter.com, and BlogPulse.com.

Sounds like a product aimed at marketers to help them track public opinion but offering little control over sampling. 

The "Hey! Nielsen" buzz chart (below) captures the change in "Hey! Nielsen" score over time.

Heynielsen

This chart is an unfortunate case of flipping background into foreground.  What grabs our attention are those hideous white circles with numbers in them.  The legend explains that these are the daily numbers of opinions on the subject, in other words, the daily sample sizes.  As they stand now (with the site still in beta), they serve to expose the low level of participation, leading to small sample sizes, and irrelevance.  But what when the site became super-popular, would the circles say 56234, 19245, 90257, etc.?  Why would visitors care about daily sample sizes anyway?  Mousing over these circles reveal text but in most cases, they are blocked by neighboring white circles.

In the meantime, the circles obscure the line which shows the trend in the "Hey! Nielsen" score over time.  This chart reminds me of that Google toy known as Google Trends.  The Googlers provide no vertical scale so the graphs are unreadable.  "Hey! Nielsen"ers provide a vertical scale -- kind of -- but the graphs are still meaningless: what does a score of 881 mean?  how about 724?  what is the maximum score?  what is the minimum?  Beware numbers without context.

The vertical axis does start from zero but has an odd spacing of tick labels. The gridlines are distracting and serve no purpose.  The orange area under the curve also makes little sense.

We look forward to seeing version 2.0.

 

Jul 21, 2007

Exception to the rule

It's pretty hard to decree hard-and-fast rules for graphical design; every rule seems to admit its exception.  This reinforces Tufte's contribution as he has successfully organized the rules in his collection of books.

Dustin J sent in this chart from the Economist.  Its first impression is ugly and overly complex.

Econ_petrol

Dustin commented:

Steven Few says not to use stacked bar charts because you cannot compare individual values very easily and as a rule I avoid stacked bars with more than six or seven divisions. What do you think of this stacked bar--I think it is quite effective in telling the story.

On this blog, I have also re-done some stacked bar charts but this one is truly an exception to the rule.  The reason why this one works is that it's not about the individual components, it's showing that the US consumes more than all those countries combined. 

If only it has the proper caption!  The Economist is uncharacteristically detached here: "Petrol consumption per day", "Litres bn, 2003".  How about "Goliath v. Davids"?  "US v. the World"? "Dream Team USA"?

It'd help if they tone down the colors; also, by simply annotating the total litres for the US and the total for the other countries, they would have made a clearer point without using gridlines.  But these are minor glitches in an otherwise effective chart.

Source: Economist, July 2007.

Jul 16, 2007

Gauging the water level

Nyt_waterThis set of charts covered the back page of one of New York Times' sections this weekend.

Regular readers will share my enthusiasm for the top chart.  It makes a clear, cogent case to support the article's thesis concerning the rise of bottled water.  Various renditions of this type of chart have appeared here, for example.

Specifically, the smart use of color to cluster the line objects helps interpret the trends.  Blue sets out the two primary interests.  (It's a mystery to me why the gray lines were separated into darker and lighter hues.)

The twenty-year horizon used is another nice touch. I'd remove the gridlines although they aren't too distracting here.

Sadly, the second graphic does not meet the high standard of the first.  The biggest problem concerns the red rectangle, purportedly showing how much of the bottled water was imported.  The choice of differently-sized bottles as objects makes it impossible to gauge what proportion of the total was imported.  If the rectangle was placed over 1-litre bottles instead, it would look smaller.

Source: "A Battle Between the Bottle and the Faucet", New York Times, July 15, 2007.

Jun 26, 2007

Baby names and success

Wsj_babynamesWhile we speak of baby names, David F. nominates this set of 6 charts from WSJ.  Compare this with Wattenberg's names voyager, and the benefit of interactive graphics is immediately evident.

In David's words:

They show graphs of six different names, but the two on the bottom use a dramatically different scale (from 1st to ~20th, instead of from 1st to 1000th). The introductory text notes the difference, but it is still a shock.

We like the use of "small multiples" but their impact is compromised if we don't keep the background material constant so that readers can compare between charts.  By having  different scales, the message was distorted: Mary has had a much larger drop than David, and it's easily missed in these charts.

Lines should take the place of areas which carry scant meaning in this context.

The use of blue and red is a nice touch but dovetailing the male and female charts strikes us as excessive fun.  It would have been clearer to give the sons and the daughters their own columns.

The article itself relates the anguish of modern parents in naming their babies.  Much of this angst can be traced to serious econometric studies that claim to have found cause-and-effect relationships between someone's name and their eventual success in life.  Some of this research was highlighted in Freakonomics, for example.  My stance is that all such studies are dubious, there being innumerable confounding factors (socio-economic, genetic, cultural, luck, etc. etc.).  In addition, the measured response can range from "happiness" to income to many other metrics.  The danger of finding something because one looks hard enough is very real.  We don't currently have tools powerful enough to substantiate this sort of studies.

Source: "The Baby-Name Business", Wall Street Journal, June 22, 2007

Jun 17, 2007

Foreground, background

Derek C. points us to this effort by a science journalist to use graphs to help "clarify the concept of climate change".  The graph on the left shows that actual greenhouse gas emissions have exceeded the level predicted by the most pessimistic climate models.  The 3D bar chart on the right examines which countries had most increased emissions since 1990. Warming

While the bar chart contains many of Tufte's "ducks" (not sorted by percent change, 3D, color, gridlines, sufficiency, etc.), it's the left chart that can be made more powerful.  Redo_warming2

The casual observer does not need to know which model led to which trajectory of predictions; the graph is vastly simplified, and the message much clearer in the junkart version.  (I only included the CDIAC data because I didn't locate the EIA numbers.)

The general point here is recognizing what is foreground, and what is background.  Aside from gridlines, data labels, axis labels and so on, some of the data usually constitute background material, often as in this case being used to establish comparability.

One message I got out of this chart is that these climate models have done a good job!  (Now, I have no idea if part of the curve included the training period.  It is curious that the predictions were very narrowly contained in the early 1990s.)

Source: The Island of Doubt Blog, June 6, 2007.

Mar 01, 2007

Information gain and loss

The previous two posts indicated that CNN, TWC and Intellicast had the best on-line weather forecasting accuracy by looking at the median and mean error in predicting daily low and high temperatures over 41 days.  Is it possible to differentiate between those three?

For that, we need more data so I switched from summary statistics back to the data.  In this new chart, the day by day errors were plotted.  The gridlines labelled errors within 5 degrees, which is an arbitrary guideline for acceptable / unacceptable.  The three scatters looked remarkably similar although CNN appeared to hit the bull's eye (the middle square) with less bias (errors more evenly distributed) but not much better accuracy overall (similar number of unacceptable errors).

Redoonlineweather3

Feb 12, 2007

Horrid stuff

Ec_smoke Small multiples can work wonders when data are replicated, as in this case.  The chart accompanied an Economist article on pollution levels in several European cities, as indicated by the concentration of nitrogen dioxide and particulates.

In the junkart version, I plotted the data series side by side, rather than one over the other.  Further, the order of cities was according to decreasing levels of NO2, which seemed to be the worse pollutant.  All gridlines are removed except the 30 line which worked pretty well to separate out the highly polluted cities.

Redopollutant An odd pattern has now surfaced.  Namely, there is some degree of negative correlation between the concentration of the two pollutants.  Environmental scientists may be able to tell us why.


Reference: "The Big Smoke", Economist, Feb 3 2007.

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Recent Comments

Search Junk Charts


  • Custom Search

Residues

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31