Jul 21, 2008

Joining the fun

We hope this is indication that the British paper Guardian (with one of the best websites out there) is joining the fun.  It appears that they have quietly debuted an interactive graphics feature.  The first edition addressed the oil price crisis.

This time-series chart has much to be commended:

Guk_blackgold1


The use of inflation-adjusted figures seems obvious but we don't see much of these in the press.  Highlighting the peaks and providing annotation (when moused over) is an excellent touch.  The gridlines and axis labels (especially the year axis) are thankfully restrained.  We don't see the need for the unadjusted series (blue line), however.  The fact that the gap grew larger the more time we went back told us little, as it invited readers to read into it more than what it truly was, the time value of money.

Later on, they used an oil barrel object to illustrate the components of retail oil price.  The height of the cylinder is indeed proportional to the data plotted.  If only they colored the end of the cylinder gray instead of green!  As it stands, the green portion has about the same area as the red.


Guk_blackgold2


Reference: "Interactive: oil price", Guardian, July 14 2008.

Jul 18, 2008

Seth on bar charts

Seth followed up his post about graphics with a specific post about pie charts versus bar charts.  He prefers pie charts.  We happen to agree with his unhappiness of grouped bar charts.  Unfortunately he compared an univariate pie chart (depicting point-in-time data) with a multivariate bar chart (iluustrating time-series data).

Here we present a different example, derived from a NYT article on diabetes in America.  The original chart is a series of pie charts, one for each age group, and one for the aggregate data.

Redo_diabetes

The junkart version uses a bar chart.  Readers can get a more precise comparison of the prevalence rates across age groups because it is easier to judge lengths than areas.  This has been scientifically proven by the likes of Cleveland.

Dirty trick, you might say because the original chart actually prints the data in each pie.

Nyt_diabetes

So now there is no mistaking the data.  This raises a philosophical question: why bother graphing the data if the reader needs to read the data in order to understand the chart?  We call this the self-sufficiency test.  The graphical elements of a pie chart can't stand on their own.


Reference: "Diabetes - underrated, insidious, and deadly", New York Times, July 18 2008.

Jun 30, 2008

A splitting headache

Fry_baseballsalaryTodd B didn't like this chart showing the correlation between baseball team salaries and their win-loss records.

A few problems are in plain sight:

  • Most importantly, putting a second set of logos next to the salaries column would really help
  • Unclear why the lines should be of varying widths
  • Winning percentage is more telling than win-loss, especially in the middle of a season when there is a  slight imbalance in total games played
  • the spread of salaries is so wide (10 times) that reducing the numerical scale to rank scale meant a big loss of information
  • Each column is sorted by its own metric while the most important sorting variable should be the slope of the lines (i.e. the cost per win)


The interactive feature of individual plots for each day (control bar at the top) of the baseball season is something of a gimmick.  Props though for realizing that the first few days of the season don't tell us anything.  There really is little use for investigating this correlation on a day-by-day basis.  Particularly when the salaries are given in aggregate.

On the diagram, the blue lines represent teams such as the Devil Rays and Arizona that had better winning records than their salaries would suggest.  Red lines display those teams spending more money than their records would suggest.  The steeper the line, the best/worst the team's cost efficiency.

With so many long steep lines in both colors (directions), one might posit that a negative correlation may exist between salary level and winning record. 

The following scatter plot suggests otherwise:

Redo_baseballsalary The correlation between salary and winning is very weak.  If one were to fit a linear model, it would show that the higher-salaried teams generally were doing slightly better (black line).  The Yankees were sufficiently outside the range in salaries that I didn't include them in estimating the line.  (However, as the chart shows, the line in fact estimated the Yankees winnning percentage really well.)

Teams above the line are performing better than their salaries would lead us to believe. 



Reference: Ben Fry's baseball salary page

Jun 12, 2008

Rise and fall

Via Adam came this "colorful" chart of the rise and fall of house prices since 2000, as measured by the Case-Shiller index.  He commented that this showed the old saw "the taller they are, the harder they fall".

Houseprices

A different chart allows us to test this theory directly.  From the above, we noted that each curve was composed of two phases, a long rise from 2000 to roughly mid-2000s followed by a steep decline.  We computed two data series: the average monthly growth rate during the inflation phase and the average monthly decline during the deflation phase.  The scatter plot showed the correlation. 

Redo_houseprice

The dots displayed pretty strong correlation, confirming that on average, the faster they rise, the steeper they fall.

The diagonal line indicated equal rates of growth and subsequent decline.  The cities above the line, especially Boston and New York, have witnessed declines that were much slower than the earlier rises.  On the other end, cities like Detroit, Cleveland, Atlanta and Dallas suffered price deflation much faster than earlier inflation.  Indeed, the ratio of decline to rise rates is given by the slope from the origin to the dot.

As for the original chart, it showed all the signs of Excel defaults.  It just does not make sense for a charting program to pick a different color for each time series, no matter how many there are.  Beyond four or five colors, it is impossible for readers to tell the lines apart.  In these situations, we should adopt a foreground / background strategy: decide on the key lines, highlight those with color, gray out the remaining lines.


Reference: Standard & Poor

May 27, 2008

Back to basics

The holiday weekend permitted me to browse through stacks of unread magazines, and collect more examples of charts out there in the media.  The following chart managed to turn a series of six numbers into a brain-teaser!

Dmnews_confidence

The title "Consumer confidence in a strong US economy" reminded us of the consumer confidence index often cited by the media, except that indices are seldom expressed in terms of percentages.  Further, the use of pies hinted at proportions rather than indices.  The string of data labels above the pies added mystery, especially those red and green (down and up) arrows written next to percentages, indicating growth rates.  Indices, proportions, growth rates: talk about mixed metaphors!

 Big_confidence
Some investigation was in order.  The original press release from BIGresearch provided the solution. Helpfully it even came with a graph (shown left).  It's your typical Excel bar chart with most of the default options plus some shadowy coloring.

DM News managed to turn this pedestrian chart into the series of pies, which looked nicer but was confusing.  Here, we found that the data were proportions, specifically proportions of responders in annual polls who expressed confidence or high confidence in the chance of a strong US economy.  The key message concerned the sharp drop in proportion from 2007 to 2008.

  
Redo_confidenceA least harmful use of pie charts is to depict proportions.  However, as shown clearly here, while pies display proportions adequately, they do a poor job of showing changes in proportions over time.  We show two alternatives.  The stacked bar chart on top is superior if proportions are deemed important enough to depict directly.  If not, one prefers the line chart that brings out the rates of change in proportions over time.

 



Reference: DM News, Feb 25 2008.  BIGresearch Feb 08 Executive Briefing.

May 11, 2008

A matter of timing

A reader Carly C. from Streetsblog created the following chart and wanted to know if there are better ways to present the data.  She already disliked the double axes and thought of various options including using relative scale.Blog_bikecrash

Generally speaking, dual axes in which each axis takes its own scale is like a football team with two "good" quarterbacks rotating under center, or two "great" CEOs sharing power.  We have never seen those situations work out.

When we have two quantities under comparison, we like to put them on the same scale.  In this case, converting the scale from absolute numbers to relative would do the trick.

The data paint a powerful story: as bike volume increased over time, bike accidents decreased.  The stitching together of two lines at year 1999 was an artifact of manipulating the scales.  What Carly had in mind can be accomplished using an index set at 100 in 1999.  Redo_bikecrash1This would lead to the chart shown left.  The substance of this chart and Carly's original is the same but the revised one has a single axis.

Indexing time series data is a widely used technique.  Each issue of the Economist, for example, contains many such charts.  This type of chart, however, suffers from a critical and under-appreciated problem: the visible pattern frequently and critically depends on timing.  Specifically, it makes a huge difference which year is selected as the baseline (index=100). 

A lot of mischief is possible by picking a special baseline.  Take for example, I created the same chart three times, using 1998, 1999 and 2000 respectively as baselines.  When 1999 wRedo_bikecrash2a_2 as 100 (middle chart), a criss-cross pattern showed up between 2001 and 2002, leading readers to conclude that the gap between growth in volume and growth in accidents developed during 2001.  In the other two charts, the gap appeared around 2000.  Also, the bottom chart exhibited a clear growing gap (after dumping the disagreeable data before 2000).

Unfortunately, this is a feature of such charts; whether or not timing distorts the information presented depends on how rugged the underlying data is.  Put another way, these charts can be affected by outliers.  (In this example, there were sharp changes in bike volumes in 1998-2000.)

 

Reference: Streets Blog



PS. [5/12/2008] How opportune was Andrew's post on R graphics default headaches.  I was too lazy to figure out the defaults and let R figure out the dimensions (poorly); with Jake's suggestions, the new set of charts looked much better.

Feb 03, 2008

Redundancy

Nick B., who occasionally writes about statistical graphics, found some classic chart junk from a Canadian report on the Afghan army.  Here's one example, together with the junkchart version.Redoafghan_2

Redundancy is an enemy of good graphics, and incongruous redundancy is worse.  Here, troop level is variously described as "total force size", "strength" and "army growth"; the chart on the right uses only the army concept.  The data labels ("47000 Strength"), the axis labels ("50000 Total Force Size"), and the gridlines all germinate from the five grand data points underlying the entire chart!

Another distorting feature is that use of different-sized time intervals, which we space out appropriately on the right chart.

Ultimately, the key message should be growth in the army size, not the absolute number of troops.  The slopes of the line segments encode this information.  Alternatively, a data table can be rather powerful for simple data like this:

Redoafghan2 By what is called the "end state", there would be 70% more troops than those as of December 2007.

 


Dec 09, 2007

Lacking buzz

Nielsen, they of the ratings, is roughing it in the information age.  When they announced on-line tracking tools, Wired quipped: "It's looking like online video policing companies will have to make room for another deputy."  Last year, cable companies revolted over a service measuring the effectiveness of commercials.

Via the Data Mining blog, I learnt about yet another new on-line offering, called "Hey! Nielsen" for obscure reasons.  (Perhaps Hey! Nielsen is the new Yahoo! !)

The site is an enigma wrapped in a mystery.  The official description says:

Hey! Nielsen is the place to make a name for yourself while trading opinions on TV, movies, music, personalities, web sites and more.

How does one "trade" opinions?

According to the FAQ, the "Hey! Nielsen" score, the cornerstone of the site, is:

a real-time indicator of a topic's impact and value and you play a major role. As the site evolves and users submit their opinions and commentary, the score will rise or fall based on a number of factors including, but not limited to, user opinions, news coverage, and raw data from our sister sites Billboard.com, HollywoodReporter.com, and BlogPulse.com.

Sounds like a product aimed at marketers to help them track public opinion but offering little control over sampling. 

The "Hey! Nielsen" buzz chart (below) captures the change in "Hey! Nielsen" score over time.

Heynielsen

This chart is an unfortunate case of flipping background into foreground.  What grabs our attention are those hideous white circles with numbers in them.  The legend explains that these are the daily numbers of opinions on the subject, in other words, the daily sample sizes.  As they stand now (with the site still in beta), they serve to expose the low level of participation, leading to small sample sizes, and irrelevance.  But what when the site became super-popular, would the circles say 56234, 19245, 90257, etc.?  Why would visitors care about daily sample sizes anyway?  Mousing over these circles reveal text but in most cases, they are blocked by neighboring white circles.

In the meantime, the circles obscure the line which shows the trend in the "Hey! Nielsen" score over time.  This chart reminds me of that Google toy known as Google Trends.  The Googlers provide no vertical scale so the graphs are unreadable.  "Hey! Nielsen"ers provide a vertical scale -- kind of -- but the graphs are still meaningless: what does a score of 881 mean?  how about 724?  what is the maximum score?  what is the minimum?  Beware numbers without context.

The vertical axis does start from zero but has an odd spacing of tick labels. The gridlines are distracting and serve no purpose.  The orange area under the curve also makes little sense.

We look forward to seeing version 2.0.

 

Oct 17, 2007

Points of comparison

Econ_mortgage In light of the current housing crisis, arising from mortgage defaults, I pulled this graphic from a Jan 2007 opinion piece that plotted historical default rates of mortgages.  Notice the high degree of stretching on the vertical axis that exaggerates the volatility: essentially, the annual delinquency rate ranged from 1.75% to 2.65% during the last six years or so.  One might be forgiven to think that a 2% default rate is quite acceptable.

Nyt_mortgage_2 Compare the above chart to the pair that showed up in the NYT in Oct 2007 (see right).  The default rates here are in the 10-20% range, very alarming indeed.

The two graphics illustrate a key issue of "aggregation" in statistical analysis.  The first graphic is super-aggregated: all types of mortgages of all ages are put together to calculate each year's default rate.  The second graphic hones in on subprime mortgages only.

More importantly, the second graphic presents data in "vintages".  Each line represents loans originated during a particular year (a "vintage").  This establishes comparability.  On the first chart, each point in time represents the default rate of mortgages averaged over all ages (some loans may be only a few months old; others may be 15 years old).  Since the default rate is much higher for very young mortgages than for older mortgages, such averaging hides crucial information.

Overall, the NYT graphic very effectively conveys the alarming trend of new mortgages performing much worse, especially those originated in 2007.

Redo_mortgage It can benefit from two slight edits: adding a few more years, and using vertical lines (the most critical comparisons are default rates for loans of a given age!)  Something like this...


Sources: "As Defaults Rise, Washington Worries", New York Times, Oct 16 2007; "Mounting Mortgage Credit Problems", economy.com, Jan 23 2007.

Aug 28, 2007

Cheers

Nyt_mets07


This is an exemplary chart from the NYT Sports page.  It provides a clear, informative and exciting way to visualize how the baseball season has gone for the Mets this and last year.  It's been mostly up and not much down. 

We can observe the more subtle differences: last season was a steady rise with only two prolonged down periods; this season's curve is driven by two up periods (including right now), outside of which the record has hovered around two levels (0, +3).

Especially commendable is the judicious use of axis labels.  However, I'm not clear on how some of the labels were chosen.  For example, 14 games ahead seem to me a rather arbitrary one.

All in all, a job well done.

Source: "Not Only Yankee Fans Cheering for Week 22", New York Times, Aug 27, 2007

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Search Junk Charts


  • Custom Search

Residues

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31