Jan 04, 2008

Maps and dots

Happy New Year

The cosmos of university ranking got more interesting recently with the advent of the "brain map" by Wired magazine.  This new league table counts the total number of winners of five prestigious international prizes (Nobel, Fields, Lasker, Turing, Gairdner) in the past 20 years (up to 2007); and the researcher found that almost all winners were affiliated with American institutions.
Wired_brainmap
As discussed before, the map is a difficult graphical object; it acts like a controlling boss.  In this brain map, the concentration of institutions in the North American land mass causes over-crowding, forcing the designer to insert guiding lines drawing our attention in myriad directions.  These lines scatter the data asunder, interfering with the primary activity of comparing universities.

Wired_dots The chain of dots object cannot stand by itself without an implicit structure (e.g. rows of 10).  This limitation was apparent in the hits and misses chart as well.  Sticking fat fingers on paper to count dots is frustrating.  Simple bars allow readers to compare relative strength with less effort.

Redo_brainmap_2

In the junkart version, we ditched the map construct completely,  retaining only the east-west axis.  [For lack of space (and time), I omitted the US East Coast and Washington-St. Louis.]  With this small multiples presentation, one can better contrast institutions.

To help comprehend the row structure, I inserted thin strikes to indicate zero awards. A limitation of the ranking method is also exposed: UC-SF has a strong medical school and not surprisingly, it has received a fair share of Nobel (medicine), Lasker and Gairdner prizes; but zero Lasker and Gairdner could be due to less competitive medical schools or none at all!


Reference: "Mapping Who's Winning the Most Prestigious Prizes in Science and Technology", Wired magazine, Nov 2007.

Sep 23, 2007

Buffer time

As this report from the Department of Transportation makes clear, congestion on our roadways causes travellers to add "buffer time" to their planned journeys.  So, for instance, one may have to allocate 32 minutes for a trip that would have taken 20 minutes in uncongested traffic if one would like to guarantee on-time arrival.  The 12 minutes would either become time spent sitting on the road or wasted time due to arriving too early.

Buffer time can be applied to graphs too.  Some graphs require readers to spend time fishing out the information.  The chart used to illustrate travel time belongs to this category. 
Dottraveltime_2The clock analogy fails; in fact, it confuses matters as the hour hand just sits there serving no purpose.  The buffer time between staring and comprehending is too much!

Only four numbers underly this chart: travel time when uncongested and buffer time to guarantee on-time arrival, for 1982 and 2001.  The following version gets to the point without fuss. 
RedotraveltimeIt shows that the travel time increased significantly even under uncongested traffic; worse, the buffer time multiplied.

Reducing buffer time is always good but some buffer time may be inevitable.  In the traffic analogy, to eliminate all buffer time would mean lots of unused capacity.  In the context of graphs, more complicated charts would require more time; the key is whether the reader is rewarded for the time spent figuring out the chart.



Source: "Traffic Congestion and Reliability", Department of Transportation.

Aug 17, 2007

As light as Friday

Guardian_penguin


Our readers are on a roll, here is another great submission.  Margaret can't say it any better:

Thought you might enjoy this infographic from The Guardian newspaper, which appears to describe how humans evolved from penguins

Source: "Prehistoric Penguin", Guardian (UK)

Jun 17, 2007

Foreground, background

Derek C. points us to this effort by a science journalist to use graphs to help "clarify the concept of climate change".  The graph on the left shows that actual greenhouse gas emissions have exceeded the level predicted by the most pessimistic climate models.  The 3D bar chart on the right examines which countries had most increased emissions since 1990. Warming

While the bar chart contains many of Tufte's "ducks" (not sorted by percent change, 3D, color, gridlines, sufficiency, etc.), it's the left chart that can be made more powerful.  Redo_warming2

The casual observer does not need to know which model led to which trajectory of predictions; the graph is vastly simplified, and the message much clearer in the junkart version.  (I only included the CDIAC data because I didn't locate the EIA numbers.)

The general point here is recognizing what is foreground, and what is background.  Aside from gridlines, data labels, axis labels and so on, some of the data usually constitute background material, often as in this case being used to establish comparability.

One message I got out of this chart is that these climate models have done a good job!  (Now, I have no idea if part of the curve included the training period.  It is curious that the predictions were very narrowly contained in the early 1990s.)

Source: The Island of Doubt Blog, June 6, 2007.

Mar 01, 2007

Information gain and loss

The previous two posts indicated that CNN, TWC and Intellicast had the best on-line weather forecasting accuracy by looking at the median and mean error in predicting daily low and high temperatures over 41 days.  Is it possible to differentiate between those three?

For that, we need more data so I switched from summary statistics back to the data.  In this new chart, the day by day errors were plotted.  The gridlines labelled errors within 5 degrees, which is an arbitrary guideline for acceptable / unacceptable.  The three scatters looked remarkably similar although CNN appeared to hit the bull's eye (the middle square) with less bias (errors more evenly distributed) but not much better accuracy overall (similar number of unacceptable errors).

Redoonlineweather3

Feb 27, 2007

Mean and median

In the comments of the last post on on-line weather forecasts, Hadley raised the evergreen statistical question of mean vs median.  In this context, median error is unaffected by particular days in which the forecaster makes extreme errors while mean error takes into account the magnitude of every forecasting error in the sample.

Which one to use depends on the situation.  Brandon, who did the original analysis, was motivated by planning a trip to a unfamiliar location.  In this case, he might be better served by lower mean error, which would imply few extremely bad forecasts.

On the other hand, if I am interested in my local weather, then I'd likely be less concerned about a few extremely bad forecasts, and more concerned that the forecast is on the money on most days.  Then perhaps the median error would come into play.

Redoonlineweather2 It turns out it doesn't much matter for our weather forecast data.  In this new chart, I superimposed the mean error data (in black).  The scatter of points was exactly as it was for median error (in red).  (MSN had a particularly bad forecast for a low temperature one day, which pulled its location to the left.)

This shows further that the difference between CNN, Intellicast and The Weather Channel is negligible.

Feb 25, 2007

Going out on a limb

Earlier in the month, Prof. Gelman linked to Brandon's fascinating analysis of on-line weather forecasting accuracy.  I have done some additional analysis of the data and the result can be visualized as follows.

Redoonlineweather


I'll concentrate my comments on three observations:

  • CNN was the clear winner in forecasting accuracy during this period based on two criteria: its median error in forecasting daily lows, and its median error in forecasting daily highs.  Moreover, both the median errors were zero, which gives us confidence about its accuracy.  The Weather Channel (TWC) and Intellicast (INT) were not far behind.
  • The ability to forecast highs was better across the board than that of forecasting lows (except BBC).  I am not sure why this should be so.
  • Overall, our weather forecasters were much too risk-averse.  Notice that the errors were heavily biased in the lower left quadrant.  A negative error on low temperatures means predicted low is higher than actual low; a negative error on high temperatures means predicted high is lower than actual high.  Taking these together, we observe that the range of actual temperatures have generally been larger than the range of predicted temperatures!  No one was willing to go out on a limb, so to speak, to forecast extremes.

Actually, I believe this inability or unwillingness to forecast extreme values is endemic to all forecasting methodologies.

Before closing, I mention that the graph was based on a subset of Brandon's data.  I only considered same-day forecasts, did not consider Unisys (because they didn't provide forecasts for lows), and also noted that there might be bias since there were breaks in the time series.  Also, I retained the sign information and didn't take absolute values as Brandon did.

Feb 22, 2007

Bubbles of death 2

Here is an alternative way to present the death risk data.  It's a variation of Tukey's stem-and-leaf plot.  Instead of presenting the exact odds, I believe it is sufficient to generalize the data by grouping them into categories.  Not much is to be gained by knowing that the odds of dying from fire and smoke is 1 in 1113 as opposed to the odds being in the range 1 in 1000 to 1 in 10,000 and comparable to that of drowning, motorcycle accident, etc.

Redooddsdying


PS. Be sure to look at Derek's chart in the comments.

Feb 21, 2007

Bubbles of death

Thanks to Dustin J for bringing this stupendous chart to our attention.  I have to admit I have trouble understanding it.  The red curve appears to be part of a gigantic circle confirming that all life do end on this earth.  How it is connected to the rest of the chart I am unable to discern.  In addition, the trajectory of the bubbles, the overlaps between bubbles, the separation between bubbles all may or may not carry meaning.

Odds_dying_1

Reference: "What are the odds of dying?", National Safety Council.

Feb 13, 2007

Horrid stuff 2

Jp_horridstuff Jon P took my comment on negative correlation and explored it furtherGiven the large ranges of values cited in the original Economist chart, Jon concluded that there wasn't enough evidence to make a judgement.

I agree to a large extent.  Apart from the high variability of individual measurements, we also face the tiny sample of 5 cities. 
In his chart, he made an implicit assumption that the correlation of two factors is related to the product of the ranges (variability) of each factor by plotting the rectangles.

A different way of looking at it is to plot only the mid-range values (i.e. ignoring the within-city variability).  The graph on the left hand side shows very little pattern.

Resorting to the formula, I found that the correlation = -0.03.  So barely detectable negative correlation.  Lets visualize this. 

Redo_pollutant2 On the right graph, I added the mean lines for both variables.  This divides the graph into four quadrants; dots that fall into the lower right and upper left quadrants make the correlation value negative.  There were three of those versus two in the positive quadrants; hence, the tiny negative correlation. 



Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Recent Comments

Search Junk Charts


  • Custom Search

Residues

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31