Jan 10, 2008

Football rankings 1

The Times' sports pages made wise use of graphics in a series of NFL articles recently.  Here is a rank plot (below left) comparing Jaguars quarterback David Garrard to seven other quarterbacks who started the weekend of January 5.

Nyt_garrard

Simple and effective, this chart does not fuss around in showing us where Garrard ranks relative to the others. 

Redo_garrardThe junkart revision (below right) plays with a different scale: the spacing between the tick marks represent proportional differences in the underlying metric.  This gives us a little more: for example, Garrard's second rank in completion percentage is less remarkable than first thought as he essentially tied with the 3rd and 4th best while the top six were bunched between 60 and 65 percent.

But Garrard's touchdown to interception ratio stands out as the next best quarterback attained only about half his ratio.  (Todd Collins who had not thrown an interception until that time was omitted; he also had only started four games.)


References: "Two Dreams (One Big, One Tiny) Come True", New York Times, Jan 4 2008; ESPN statistics.

Jul 29, 2007

Transgender trends

One of the many gratifications of blogging is to connect with others who have similar interests; so it has been fantastic to receive user submissions (though admittedly I don't check my inbox frequently enough).  The thoughtfulness of these nominations continues to impress me.

Evan sent in 254 charts he created after looking at the post on baby namesJordanv31970200528yrs_2An example is shown on the right. 

He is particularly interested in the question of names that are given to both males and females. 

For example, the bottom chart shows that Jordan is primarily a male name, and saw a period of growth followed by decline, although the decline has been more severe on the male side than the female side. 

It's a nice touch to label the most recent year.  I'd also label the values for the most recent year on the axes.

Evan also offers the following solution to the scaling problem we identified in the original WSJ chart:

My solution was just to put two charts on each chart. One at a fixed scale for every chart to give a sense of size and one at a variable scale to better show the shape of the plot.

In other words, for less popular names, the top chart would look much more compressed.

There are many more charts to sift through on his site.  Evan welcomes suggestions.

Jul 26, 2007

Noisy subways

This NYC subway report is impossible to read.
Nyt_subwayreport

However, it is very difficult to find a good way to show the information.  In fact, the data contained very little of that.  Curiously, the ratings are very dispersed so that each line is graded high on some category and low on others.  Here's one view of it:

Redo_subwayreport

I have grouped the subway lines together (A/C/E, 4/5/6, etc.).  The metrics are plotted left to right in the same order as in the original.  Is it all noise and no signal?

(I just realized the vertical axis is reversed: best ratings are at the bottom, worst ratings at the top.  Doesn't matter anyway since I can't see any patterns.)

Source: "No. 1 Train is Rated Highest by Commuter Advocates", New York Times, July 24 2007.

PS. Two contributions from readers.  Still looking for insight from this data...

Trains789fg5_2 Trainspotmatrix_2


Jul 12, 2007

More prevalent versus more likely

Aleks pointed to an interesting Business Week chart used to explain what people in different age groups are doing on-line.  This is a pretty chart that does an admirable job with a difficult data set.

Bw_onlinedataThe key to this chart, unfortunately missing, is that the percentages must be read as vertical columns to make sense.  So the top left square says 34% of "Young Teens" who answered the survey said they create web pages on-line.  In addition, the total of each column can be much more than 100% because multiple responses were allowed.

Realizing the above, we should interpret the bottom (grey) row as saying: "Older boomers" and "seniors" are more likely to be "Inactives" than younger people.  A tempting interpretation is: "Inactives" are more likely to be "seniors" and "older boomers".  But this is wrong because the chart hides the age distribution.  While 70% of "Seniors" are inactive, "Seniors" may represent a small proportion of the population, and thus they may not account for a large proportion of "Inactives".  This is the difference between prevalence and incidence rate.  (Another way to grasp this is to add the percentages across a row and try and fail to understand what the row sum could mean.)

The construct of the square grids is less damaging than it seems.  In effect, the data has been rescaled by dividing by 10.  The reader is then forced to apply "rounding".  If you are someone who sees $19.95 as $19, then you'd round down the partial rows.  If you see $19.95 as $20, you'd round up the partial rows.  So the designer has pushed you to think in terms of whole numbers between 0 and 10, in other words, in units of 10%, rather than units of 1% or, horror of horrors, 0.1% or at some other unrealistic precision.

Here's another example where the profile chart shines.  Because the percentages don't sum up to 100%, the other alternatives like stacked bar charts and "Merrimeckos"/mosaic charts don't work.  (Prior discussion of this issue here.)

Redo_onlinedata

This version gives a column view of the data, the lines linking percentages of each age group performing on-line activities.  The profiles nicely cluster into three groups: the younger people are more likely to say they are "joiners", "spectators" or "creators" but less likely to be "inactives".  We also see that the likelihood of being "Collectors" has little to do with age.

Source: "Inside Innovation -- In Data", Business Week, June 11 2007.


May 17, 2007

People picture

Ind_cancersurvival This graphic appeared on the front page of the British paper, the Independent.  I find it to be effective, although defiantly not efficient a la Tufte: the data-to-ink ratio is abysmal.  Two data points on the entire page, with both data labels drawn in extra large font!

It can be improved if the 24 guys are given a different color so we can see the amount of improvement between 1971 and "NOW".

Some may complain that the use of percentages obscured population growth during this period.  Perhaps there should be fewer men on the left than on the right.  Unfortunately, that would in turn obscure the comparison of percentages.

A bit of research into the data (at Cancer Research UK) reveals that the average survival rate hides a very wide range of rates (by type of cancer, by gender, by gender and type, etc.).  One might argue that the average is quite meaningless for most users.

An alternative construct is a time series chart showing the increase in survival rate over time.  It would plot more data and depict a trend (or lack thereof).  I'd have to agree with the editor that such a chart would look unattractive on the newstand.

Source: "Cancer: the good news", The Independent, May 16, 2007


May 06, 2007

Visualizing sensitivity

A reader wrote:

I'm a loyal reader who hopes you'll indulge him in just one or two questions.

In finance (valuation, specifically), we often create two-way sensitivity tables. Unfortunately, a three-way sensitivity table is what's most often called for. Of course, we work around this by producing multiple two-way tables.

Now, obviously, it's pretty hard to build  three-way table or chart in two dimensions, and the use-bigger-bubbles method doesn't really make sense in this kind of application-- but can you conceive of a good way to present the data in any other form?

3waydata_2 Like he indicated, we typically see multiple two-way data tables for such data.  The virtue of this approach is that the data is exceptionally well-organized; it's great for looking up the outcome given the three dimensions (I called them Red, Green and Blue to protect the innocent.)

Further, starting from a baseline i.e. a particular cell in the table, it's easy to move our eyes up, down or jump tables to observe the impact of changing dimensions (so-called sensitivity analysis).

These data tables facilitates "local" sensitivity analysis but obscure "global" sensitivity: staring at those numbers, we feel lost in the trees and can't see the forest.  What's the effect of increasing Green on average?  What's the effect of increasing Green while decreasing Blue? etc. etc.

3waygraph The junkart construct (right) is made to address these questions.  The black stripes establish the baseline, the overall range of values.  Then, if interested in the effect of Red = 0.11, we can compare those red stripes with the black.  Since the spread is wide, we note that Red = 0.11 is not a strong indicator of value, and to the extent it is, it points to lesser values.

What about Red = 0.11 and Green = 2?  Now, we focus on the first red stripes and the first green stripes.  We note that the overlapping region (which is where both conditions apply) is highly concentrated to the low end of value range.  Thus, we conclude that under those conditions, value is low (below 10,000) and further, that it is low primarily because Green = 2.

On and on for any one-way, two-way or three-way effects.

Although it's not the purpose of the chart, local sensitivity can also be observed.  For example, the highest value comes from Red = 0.09, Green = 16 and Blue = 0.30.  What if Blue decreases to 0.28?  We start on the Blue = 0.28 layer; going from right to left, as we see a blue stripe, we scan vertically to find the corresponding red and green stripes; the 3rd stripe from the right, we find the scenario of interest.  Such analysis would benefit from adding an interactive vertical guiding line.

Do you prefer 3-D plots?  Contour plots? Feel free to share your ideas!

Apr 08, 2007

Peripherals 1

Like any technology, charts also come with peripherals: I'm talking about legends, data labels, grid-lines and so on.  These things typically give us the most trouble, especially with complex data sets.  The analogy is apt: one may feel inextricably knotted up like bunches of cords and wires.

Interactive graphics is a particularly elegant solution to this problem, and Google Finance has done a fantastic job leading the way.  One trick is to show the legend only when the user asks for it. 
Google_sectorsum_lgUsing bar charts (on the left), Google summarizes neatly the performance of stocks within each industry sector.  The bar chart gives a sense of the dispersion which adds to the average returns printed next to them.  For example, most sectors gained on average but then about 30% of the individual stocks in most sectors actually declined on that day.  So the fact that technology stocks gained 0.48% on average doesn't necessarily mean that the two tech stocks you own gained 0.48% or gained at all.

Typically, we would put a legend on the side or at the bottom of the chart, which all be told, is an ugly duckling next to a well-executed chart.  Here, the legend is hidden behind the "What's this?" link.  The side benefit is that the legend can be as verbose as needed since it doesn't interfere with the chart.

There are a few minor things to consider:

  • "What's this?" is not very informative: Why not call it a "legend" or "key"?
  • The graph designer seems to think that the most important information sought by readers was the extremes, i.e. the percentage of stocks that gained/lost more than 2%.  By darkening the sides of the bar, it draws attention away from the middle which is the boundary between the gainers and the losers.  I'd like to see that boundary delineated.
  • Similar to the above point, I'd sketch out a version which aligns the gainer/loser boundary to the middle so it's easy to see the balance between gainers and losers.  This version however would require more space
  • I'd provide sorting by average return, and by percentage of gainers

Mar 21, 2007

March mildness

The Times published this great graphic to show 2007 was an upset-starved year in the recent history of the NCAA Basketball tournament, which is on-going.

Nyt_mildness Each box contains the number of upsets in a given year of a given pairing, e.g. in 1998, there was one case of a 9-seed beating an 8-seed.  An upset is defined as a lower seed beating a higher seed although the editorial comment argued that 9 beating 8 is "rarely considered an upset".

The rightmost column (which sums across a row) tells us that the number of upsets fluctuates wildly between the years, ranging from 3 to 13.  (That's why people bet on NCAA pools.)

A couple of improvements will make this chart even more effective:

  • Include a row showing the average number of upsets for each pairing;
  • Include a column of zeroes for 16-1 pairings.

This second point cannot be emphasized more.  The fact that no 1-seed has ever lost to a 16-seed should not be relegated to a footnote.  Think of it this way: if the results for 15-2 and 16-1 were reversed so that no 15-seed had ever beaten a 2-seed but one 1-seed had lost to a 16-seed, nobody would omit the 15-2 column! 

In his seminal work, The Visual Display of Quantitative Information, Tufte discussed the Challenger disaster at considerable length.  A key learning was that non-events (things not happening) contain important information, and should never be dropped from an analysis without unassailable logic.

The mildly improved chart would look like this. Redo_mildnessWhat then to make of the comment that "9 beating 8 is rarely an upset"?  For one thing, 9-8 upsets happen about as frequently as 10-7 upsets so if the comment refers to the surprise factor, then even 10-7 upsets should be excluded.

But the comment also underlines a deeper issue, which is hindsight.  Obviously, the seeding committee felt, and predicted, that the 8 seed would beat the 9 seed.  It was only after the fact that we found out 9 had beaten 8.  Instead of denying the 9-8 upset, would it make more sense to ask if there was a seeding error?

Reference: "March Mildness", New York Times, March 17, 2007, p.D2.

Mar 01, 2007

Information gain and loss

The previous two posts indicated that CNN, TWC and Intellicast had the best on-line weather forecasting accuracy by looking at the median and mean error in predicting daily low and high temperatures over 41 days.  Is it possible to differentiate between those three?

For that, we need more data so I switched from summary statistics back to the data.  In this new chart, the day by day errors were plotted.  The gridlines labelled errors within 5 degrees, which is an arbitrary guideline for acceptable / unacceptable.  The three scatters looked remarkably similar although CNN appeared to hit the bull's eye (the middle square) with less bias (errors more evenly distributed) but not much better accuracy overall (similar number of unacceptable errors).

Redoonlineweather3

Feb 13, 2007

Horrid stuff 2

Jp_horridstuff Jon P took my comment on negative correlation and explored it furtherGiven the large ranges of values cited in the original Economist chart, Jon concluded that there wasn't enough evidence to make a judgement.

I agree to a large extent.  Apart from the high variability of individual measurements, we also face the tiny sample of 5 cities. 
In his chart, he made an implicit assumption that the correlation of two factors is related to the product of the ranges (variability) of each factor by plotting the rectangles.

A different way of looking at it is to plot only the mid-range values (i.e. ignoring the within-city variability).  The graph on the left hand side shows very little pattern.

Resorting to the formula, I found that the correlation = -0.03.  So barely detectable negative correlation.  Lets visualize this. 

Redo_pollutant2 On the right graph, I added the mean lines for both variables.  This divides the graph into four quadrants; dots that fall into the lower right and upper left quadrants make the correlation value negative.  There were three of those versus two in the positive quadrants; hence, the tiny negative correlation. 



Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Recent Comments

Search Junk Charts


  • Custom Search

Residues

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31