Apr 25, 2008

Knit-picking

Nyt_tuitionfree2 In celebrating the recent trend by "elite" colleges to lowering the cost of education, the Times printed this chart, the top part of which is shown here.

The three colors represent different levels of aid.  Blue means "grants replace loans"; red means "free tuition"; yellow means "parents pay nothing".  The colleges are grouped by the minimum qualifying income for the blue category.

The whole effect is of a knit.  We shall call this the "knit chart".

I believe a simple data table will do the job nicely.  If any reader has other ideas, please show us your work!

A few points to note about the original:

  • Ordering by the minimum income to qualify for "grants replace loans" is arbitrary, as is alphabetizing colleges within each group
  • Qualifying "at any income level" should be shown on the left of "$40,000 or below" rather than to the right of $100,000.  The current order is such that qualifying level increases with income from left to right, except from $100,000 to "any income", where it falls off a cliff.
  • Qualifying at any income level is better shown as a separate column on the right disconnected from the income scale.  The current configuration devalues the effort spent in making a proper income scale.
  • Too many lines of equal length, and too few yellow and red lines to make the knit chart effective
  • Should the graph cater to parents interested in seeing what aid they qualify for given their income level?  Or should the graph highlight the breadth of aid available at individual colleges?

Reference: "The (Yes) Low Cost of Higher Ed", New York Times, April 20 2008.

PS. The original point about the "any income level" was incorrect as pointed out by Chris below.  I have replaced that with a different issue.

PPS. Matias' version (see comments) is a superb demonstration of the power of data tables, well-applied.   It is clean and simple, and addresses both the questions pointed out in the last bullet point.  The only thing sacrificed was the visual representation of the relative size of the income requirements, which I agree is the least valuable part of the original.  As usual, many thanks to our readers for coming up with great ideas!

Redo_tuitionfree2

Mar 30, 2008

Small multiples re-imagineered

Nyt_disney

This chart gave me trouble.  I kept staring at it, staring.  Searching for the legend.  What could the several lines, in different colors, represent?  Take a look yourself.




Well, it turns out all three graphs were duplicates.  A different line was given dark blue to highlight a particular amusement park.

I have not seen this tactic used before.  This is like a small multiples concept except that every chart contains the same data.  Is it better than having just one chart?

Reference: "Will Disney Keep Us Amused?", New York Times, Feb 10 2008.




PS. [4/6/2008]  Here are two alternative charts contributed by our readers.  See comments below.

Derek suggested using sparklines:

Redo_parks1

Zuil reverted to basics:

Redo_parks2

Feb 25, 2008

Playful and exploratory

I share reader Bernard L.'s enthusiasm for this very imaginative chart, courtesy of the graphics people at NYT.  The chart captures the ebb and flow of weekly movie receipts over the last two decades.
Nyt_films
The details that particularly interest me include:

  • The addition of area colors (on top of lines) serves to highlight box office successes; this really helps readers sort out the massive amount of data
  • Nicely spaced text (and dots) does not interfere with our reading of the chart
  • The hiding of text for less important films, plus taking advantage of interactivity to show their titles if the reader mouses over the respective areas

All of the above indicate a keen sense of foreground versus background.  Besides, the authors had the good sense to speak of inflation-adjusted box office sales; I'm tired of the movie industry proclaiming higher sales each year when ticket prices are rising, and the population is growing.

This is another chart where more data do not easily translate into better communication (see my guest post at Flowing Data).  While I like the playful nature of the interactive chart, it is left to the reader to discover the information buried in the data, such as the assertion in the header that Oscar-winning films typically take time to attain box-office success while many blockbusters do not Oscars make.

In this presentation, it is challenging to compare the total receipts of one film versus another (this requiring comparing oddly shaped, partially obscured areas).  It is also hard to compare across years since the data is spread out over a lot of space.

There may really be two types of graphics: the one like the example here which is a dictionary and designed for exploration; and the other kind where the designer has selected a subset of the data to make a specific point.

Reference: "The ebb and flow of movies", New York Times, Feb 23 2008.

Feb 10, 2008

Ordering and grouping

The Times reported that January retail sales generally disappointed, and consumers showed a preference for discount retailers over department stores.

Nyt_retailjan


Redo_retailjan

Taking the bar chart on the right, re-ordering by change in same-store sales, and grouping companies by type of retailer, we can present the data to match the text more closely.  The divergent performance between discount retailers and department stores is readily visible.












Reference: "Weak January dashed retailers' gift-card hopes", Feb 8 2008.

 

Jan 31, 2008

Jittering lines

A reader alerted me to this NYT chart a few weeks back.  The chart plots daily changes in stock index prices (gray lines) and yearly changes (color blocks). 

Nyt_volatility

The blue blocks represent bad down years but notice that the daily changes during many of those periods give no such impression.  Nyt_volatility2_2In fact, the gray lines are quite equally balanced on both sides of 0, and yet the annual tallies swing from positive to negative quite frequently.  It is by no means true that one exceptional down day predicts a down year.

The problem arises from cramming too much data into too small a space.  We can't judge the density of the lines on paper and so can't judge whether there were more up lines than down lines.

This issue is not dissimilar to the jittering question when used with large data sets.

Source: "The Pulse of Uncertainty", New York Times, Jan 4 2008.

 

Jan 17, 2008

Football rankings 2

Nyt_nfloffense

The above chart is another one in the NYT series on the NFL playoffs.  It evaluates the mix of passing and rushing attempts by offense.  The convoluted way by which the caption strains to tell a story indicates trouble ahead:

Of the three playoff teams that threw the ball the most, two of them come from cities known for cold weather.  Conversely, of the three teams that ran the most, two of them play their home games in milder weather.

The implication is that teams from cold-weather cities are supposed to want to rush more, and vice versa.  And the data (total of six samples) pointed to the opposite.

This presentation suffers from low data-to-ink ratio:  too much ink is spilled over not much data.  The designer arbitrarily picks one of the two variables (passing attempts, rushing attempts) as the primary, sorting variable -- trace the orderly green diamonds on the right chart.  This makes it hard to see a pattern in the brown diamonds.  As usual, a scatter plot works much better with two data series.

Redo_nfloffenseIn the junkart version, the raw numbers of attempts are converted into proportion of attempts that were passing versus rushing.  This easy move immediately collapses the two dimensions into one.  Now, we have room to include an extra variable which matters: the average amount of snowfall in these cities.

So what does the data say about the relationship between propensity to pass and cold weather?  There appears to be very little relationship as the dots are all over the chart.  In particular, the teams playing in cities with the highest snowfall span the range of passing percents; similarly, those playing in lowest-snowfall cities also span the range of passing percents. 

The caption ignores all the blue dots, focusing only on the gray ones.  A more direct examination of the relationship reveals the folly of the so-called "not so conventional wisdom".

References: "NFL Offences Undergo a Thaw in Thinking", New York Times, Jan 5 2008; government snowfall statistics.

Jan 10, 2008

Football rankings 1

The Times' sports pages made wise use of graphics in a series of NFL articles recently.  Here is a rank plot (below left) comparing Jaguars quarterback David Garrard to seven other quarterbacks who started the weekend of January 5.

Nyt_garrard

Simple and effective, this chart does not fuss around in showing us where Garrard ranks relative to the others. 

Redo_garrardThe junkart revision (below right) plays with a different scale: the spacing between the tick marks represent proportional differences in the underlying metric.  This gives us a little more: for example, Garrard's second rank in completion percentage is less remarkable than first thought as he essentially tied with the 3rd and 4th best while the top six were bunched between 60 and 65 percent.

But Garrard's touchdown to interception ratio stands out as the next best quarterback attained only about half his ratio.  (Todd Collins who had not thrown an interception until that time was omitted; he also had only started four games.)


References: "Two Dreams (One Big, One Tiny) Come True", New York Times, Jan 4 2008; ESPN statistics.

Dec 25, 2007

Doctoring charts

Reader Chris P. alerted us to a fascinating post from Errol Morris' blog, which presents results in graphical form from a readers' poll related to this other post.  This other post deals with a pair of photographs taken during wartime, previously discussed by Susan Sontag and others.  Sontag believed the pair documented a before-and-after setting: it was alleged that the photojournalist shifted some cannon balls from their natural position between takes. 

Morris polled his readers asking them in which order they thought the photos were taken ("on before off", "off before on", "undecided"), and which factors were used to make the decision.  He presented results in two formats, first plotting frequencies in bar charts and then plotting proportions in pie charts.  He preferred the pie chart construct.

Nyt_sontag

Most here would share Chris' reaction: "Oh my.  What people do with Excel."

The biggest problem with these pie charts is the unreasonable baseline.  This is one of those polls that allow respondents to pick any number of factors and clearly, the pie chart creator used the 1,151 responses as the baseline, as opposed to 910 people who voted.  Consider these two statements:

  • 52% of respondents who decided "on before off" listed "sun shadow" as a decision factor
  • 30% of the decision factors submitted by respondents who decided "on before off" were "sun shadow"

It is tough to figure out what the second statement means.  It is as if the respondent who selects more than one factors gets more than one votes in the final tally.  To put it differently, the 30% is meaningless unless one also knows how many decision factors were selected by each respondent, on average and in distribution.  The 52% is independent of such consideration.

Combining the data given in the bar charts and pie charts, one discovered that 469 out of 910 respondents could not decide which photo was taken before the other; besides, these respondents on average expressed 0.9 opinions on the decision factors whereas the respondents who made a decision expressed 1.6 opinions.


A simple illustration to show the key decision variables by type of respondents is shown below.  Redo_sontag_2From this chart, one sees that the number and position of the cannon balls were crucial to at least 50% of those who came to a conclusion.  Sun shadow were much more important to those who decided "on before off" while those who decided "off before on" noticed character artistic, shelling and rocks.  Most other factors did not differentiate the three groups.

Source: "Not Your Mum's Apple Pie Chart", Errol Morris, Dec 18, 2007.


 

Dec 16, 2007

Hits and misses

In this NYT article, we are told that "the most likely result when a policeman discharges a gun is that he or she will miss the target completely."  That's a shocker for those of us conditioned by Hollywood movies to think anyone who picks up a gun for the first time hits the villain right on the temple.  The following graphic attempts to tell the story.

Nyt_bullets

The one hit here is how the distances are visually presented.  The elliptical lines remind us of the neglected variable of direction; it also means the scale is correct only along one direction.

The dot matrix construct highlights the absolute numbers of shots, hits and misses but barely addresses the key issue of hit rates (accuracy). Nyt_bullets3 Specifically, this data set was presumably collected to explore the relationship between hit rates and distances from the target.  The use of different widths clouds our judgement of proportions.  To wit, it is not obvious that the 10-wide block and the 40-wide block shown left depict roughly equal hit rates (23%, 29%).

Redo_bullets The junkart version adopts a different approach.  This is the Lorenz curve, often used to show income inequality (see also here and here).  Here, the shots were ordered from closest to furthest from target, then summed up by distance segments.  For example, shots from 0 to 6 feet accounted for 60% of all shots but 72% of all hits.

If distance does not affect hit rates, we'd expect 60% of all shots to result in 60% of all hits.  This data point would show up on the 45-degree diagonal on the chart, labelled "totally unpredictable".  Any data appearing above the diagonal indicates that closer shots are more accurate, accounting for more than their fair share of hits.

Comparing the fitted blue line and the diagonal, one sees that distance is a weak predictor of hit rate.  The police commissioner explains this in the article; many other variables also affect accuracy, including "the adrenaline flow, the movement of the target, the movement of the shooter, the officer, the lighting conditions, the weather..."

Note that the shots with "unknown" distances were removed from the analysis.  Also, the categories of 21-45 and 45-above were combined: the rates were similar and with only three hits, it does not make sense to treat these as separate categories.

Of course, this version would not work well in the mass media.  For that, one can just plot hit rates against the distance categories.

Source: "A Hail of Bullets, a Heap of Uncertainty", New York Times, Dec 9 2007; New York Firearms Discharge Report 2006.

Nov 12, 2007

New York Times: a tribute

As many of you realize, this blog owes a lot to the New York Times.  The Times is unique in its willingness to print interesting, sophisticated graphics.  Via the Social Science Statistics blog, I found out that Matthew Ericson is a deputy graphics editor, and he recently gave a gigantic presentation at the IEEE InfoVis conference. (You can download the entire document from his website.)

Nyt_houseshiftAs the SSS blog pointed out, the section on how they decided to visualize the shift in party margins by House districts, specifically to declare scatter plots as too "difficult for the masses", is fascinating.  It illustrates the idea of sketching that I have advocated here in the past. (The PDF of the complete graphic can be downloaded from here.)

From my point of view, the issue is less the type of chart than the level of aggregation.  The chart has a very appealing data-to-ink ratio (a la Tufte) but could less be more?  One of the secrets of making a good chart, and any data analysis for that matter, is to reduce complexity.  For example, is it crucial for every single district to receive equal treatment?  (Similarly, if scatter plots were chosen, is it crucial to include every district?)

*********************************************************

Nyt_bondsetal_2 Several examples of great charts can be found in Matt's presentation.  On slide 83, I admire the Bonds/Aaron/Ruth chart.  The inset showing the acceleration of Bonds from age 35 to 39, as compared to the decline of Aaron and Ruth during the same age span, is powerful.  Similarly, the effective use of foreground (blue) and background (gray) in comparing ARod, Pujols and Griffey against the big 3 is masterly (see right).

There is also a sequence on mapping the San Diego wildfires (slides 2-10), showing how they gathered population data to complement fire data, thus adding context to the threat to highly populated regions.

******************************************

On a different vein, the SSS blog, written by the people at Harvard's Institute for Quantitative Social Sciences, has written a number of engaging posts on data graphics recently.  Take a note at Visualizing Electoral Data, which coincidentally addresses a similar issue as the NYT party vote share graphic discussed above.

Sss_partisanswing This graphic plots the degree of party swings by UK parliamentary constituency.  The darker the color, the tighter the stranglehold by one party.  Going from top to bottom, the authors show party swings over successive elections.  The swing constituencies are therefore near the middle of the chart. 

 

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Recent Comments

Search Junk Charts


  • Custom Search

Residues

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31