Feb 27, 2007

Mean and median

In the comments of the last post on on-line weather forecasts, Hadley raised the evergreen statistical question of mean vs median.  In this context, median error is unaffected by particular days in which the forecaster makes extreme errors while mean error takes into account the magnitude of every forecasting error in the sample.

Which one to use depends on the situation.  Brandon, who did the original analysis, was motivated by planning a trip to a unfamiliar location.  In this case, he might be better served by lower mean error, which would imply few extremely bad forecasts.

On the other hand, if I am interested in my local weather, then I'd likely be less concerned about a few extremely bad forecasts, and more concerned that the forecast is on the money on most days.  Then perhaps the median error would come into play.

Redoonlineweather2 It turns out it doesn't much matter for our weather forecast data.  In this new chart, I superimposed the mean error data (in black).  The scatter of points was exactly as it was for median error (in red).  (MSN had a particularly bad forecast for a low temperature one day, which pulled its location to the left.)

This shows further that the difference between CNN, Intellicast and The Weather Channel is negligible.

Feb 25, 2007

Going out on a limb

Earlier in the month, Prof. Gelman linked to Brandon's fascinating analysis of on-line weather forecasting accuracy.  I have done some additional analysis of the data and the result can be visualized as follows.

Redoonlineweather


I'll concentrate my comments on three observations:

  • CNN was the clear winner in forecasting accuracy during this period based on two criteria: its median error in forecasting daily lows, and its median error in forecasting daily highs.  Moreover, both the median errors were zero, which gives us confidence about its accuracy.  The Weather Channel (TWC) and Intellicast (INT) were not far behind.
  • The ability to forecast highs was better across the board than that of forecasting lows (except BBC).  I am not sure why this should be so.
  • Overall, our weather forecasters were much too risk-averse.  Notice that the errors were heavily biased in the lower left quadrant.  A negative error on low temperatures means predicted low is higher than actual low; a negative error on high temperatures means predicted high is lower than actual high.  Taking these together, we observe that the range of actual temperatures have generally been larger than the range of predicted temperatures!  No one was willing to go out on a limb, so to speak, to forecast extremes.

Actually, I believe this inability or unwillingness to forecast extreme values is endemic to all forecasting methodologies.

Before closing, I mention that the graph was based on a subset of Brandon's data.  I only considered same-day forecasts, did not consider Unisys (because they didn't provide forecasts for lows), and also noted that there might be bias since there were breaks in the time series.  Also, I retained the sign information and didn't take absolute values as Brandon did.

Feb 13, 2007

Horrid stuff 2

Jp_horridstuff Jon P took my comment on negative correlation and explored it furtherGiven the large ranges of values cited in the original Economist chart, Jon concluded that there wasn't enough evidence to make a judgement.

I agree to a large extent.  Apart from the high variability of individual measurements, we also face the tiny sample of 5 cities. 
In his chart, he made an implicit assumption that the correlation of two factors is related to the product of the ranges (variability) of each factor by plotting the rectangles.

A different way of looking at it is to plot only the mid-range values (i.e. ignoring the within-city variability).  The graph on the left hand side shows very little pattern.

Resorting to the formula, I found that the correlation = -0.03.  So barely detectable negative correlation.  Lets visualize this. 

Redo_pollutant2 On the right graph, I added the mean lines for both variables.  This divides the graph into four quadrants; dots that fall into the lower right and upper left quadrants make the correlation value negative.  There were three of those versus two in the positive quadrants; hence, the tiny negative correlation. 



Feb 12, 2007

Horrid stuff

Ec_smoke Small multiples can work wonders when data are replicated, as in this case.  The chart accompanied an Economist article on pollution levels in several European cities, as indicated by the concentration of nitrogen dioxide and particulates.

In the junkart version, I plotted the data series side by side, rather than one over the other.  Further, the order of cities was according to decreasing levels of NO2, which seemed to be the worse pollutant.  All gridlines are removed except the 30 line which worked pretty well to separate out the highly polluted cities.

Redopollutant An odd pattern has now surfaced.  Namely, there is some degree of negative correlation between the concentration of the two pollutants.  Environmental scientists may be able to tell us why.


Reference: "The Big Smoke", Economist, Feb 3 2007.

Oct 14, 2006

Racetrack entertainment

A warm welcome to readers of Science.  (Junk Charts is selected as "Best of the Web" this week.  Also thanks to Mitchell for the nice write-up.)

WiredgreenRacetrack graphs was a novelty item here some time ago.  They made an appearance in the October issue of Wired Magazine, known for its design.  We have already discussed information distortion in such charts.

This chart fails the self-sufficiency test, forcing readers to read and interpret the data labels, and to ignore the racetrack construct.

Graphical elements applied as cosmetics?  Charts sacrificing data integrity for entertainment?  This takes us back to our previous discussion: can good charts be entertaining?  Now flipped over: can entertaining charts be good?

Reference: "Good, Green Livin'", Wired Magazine, 10/2006.

Oct 09, 2006

Graphical equity 3

Zuil provides an alternative rendering of the Sankey diagram / flow chart.  This one is surely superior, being easier to understand while capturing more information than the previous example.

Govt_sankey2_1Ultimately, however, this type of chart will please specialists more than the general reader.

It is designed to be purely descriptive, which explains the absolute equality given to each flow, as indicated by the choice of unique colors and/or patterns for each.

As a data graphic, it can be  improved if the designer has a point to make.  In that situation, only the relevant flows can be highlighted while all others stay in the background.

As it stands, this chart murmurs but does not opine.

Reference: "U.S. Energy Flow - 2002", Energy & Environment Directorate, Lawrence Livermore National Laboratory.

Oct 02, 2006

Graphical equity 2

Based on my last post, Zuil and Lope engaged in a lively conversation about "flow charts", apparently also called "Sankey charts" in some circles.  Here is an example Zuil found at the EIA site:Govt_sankey

Zuil commented that

Though often difficult to draw, Sankey diagrams are IMHO unbeatable to represent any type of lossless flow (energy, money, fluids, etc).

I mostly agree: flow charts are great at tracing flows, and it's easy to figure out proportional sources and uses from this example.  Moreover, as Lope suggested, it's fun (to read).

But... the data content of this chart is lower than that of the network graph or the Marimekko.  Imagine removing all the lines (arcs) in the network graph: that is what the flow chart includes.  It achieves more readability by simplification.


Graphical equity 1

I've been slow checking my email lately: several of you have pointed me to interesting charts; I will work through them over the next week or so.  This post is inspired by John S. who forwarded two charts, illustrating where the U.S. gets its energy and how the U.S. uses its energy.

Govt_energy

The first visualization, created by Energy Information Administration, emphasizes the physical connections between energy sources and energy use sectors.  This construct is known as a "network graph", and widely used by engineers; the ovals/rectangles are called "nodes", the lines "arcs".  It functions well as a map visualizing physical relationships but it fails as a vessel for data.  Problems are multiple:

  • The web of arcs is messy and gets worse with more nodes
  • Here, each node has either an input or an output but not both, keeping it simple.  If a node is allowed to take both input and output (the so-called transhipment node), then the graph gets messier
  • Arcs converging at a node leave little space for data labels

Optimist123_energyNext, the Skeptical Optimist blog recast the data onto a construct known to "Marimekko" to management consultants.  Deconstructed, these are column charts,  such that the width of each column represents the relative size of each energy source.

This one does a fairly effective job showing most of our transportation needs are met with oil, our electricity needs are met with coal, our energy sources are roughly split between oil, gas and coal, and so on.

One weakness of Marimekko is "inequity": by its origin as a column chart, it elevates one variable over the other.  What's the relative size of energy used by the industrial sector (blue)?  That's not a question easily answered by this chart.  Even when the column segments are adjoining, as in the case of electricity use (yellow), it is very taxing to size up the yellow area relative to the total area.

So it is that we seek a graph that treats the two variables (source, use sector) equitably.  More later.

Update: Jon posted a response here, and points to a tutorial for creating Marimekko type charts.

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Search Junk Charts


  • Custom Search

Residues

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31