May 03, 2007

Less is more

Suparse Derek pointed me to the style.org site which also parses political speeches.  Their preferred graphic is not the tag cloud but a labeled bar chart.

From top to bottom, each bar represents a sentence; the length of each bar is the length of each sentence.  Further, the user can specify word pairs for comparison.  Here the red bars are sentences containing the word "freedom"; the blue bars, "security".

It's a good illustration of the "small multiples" principle in constructing comparative graphics.

However, the choice of dimensions is perplexing.  I'd be much more interested in the timing of mentions of those words, rather than which sentence they appeared in.  I also find the length of each sentence to be irrelevant.

Redo_suparse Here's one concept that brings out the point better.  It uses less space and voluntarily gives up some of the data (the sentence structure).

Apr 25, 2007

Shower of bullets

Nyt_gundeaths_sm Here's one of those infographics that makes the reader work hard (via Dustin J).  The graphic in its full glory is here; it's much too large to be reproduced, and I have clipped off the bottom half.

Much to the designer's credit, he extracted data of interest, rather than trying to cram everything onto the page.  In particular, he was most interested in the distribution of deaths among different age groups, the types of deaths (suicides, homicides) and the identities of the deceased (race, gender).

Just like the election fraud graphic, such rich data lend themselves to multiple levels of aggregation.  Here, the designer focuses on the most detailed level, making it easiest to see facts like "among the 18-25 age group, there were 6 black men murdered per day".

However, it takes much more attention to notice higher-level facts like "homicides per day are relatively flat across age groups while suicides heavily skew toward 40+".

Redo_gundeaths_sm In the junkart version, I decided to emphasize the more aggregated data, showing the number of deaths of each type across age groups. The detailed break-down of race and gender is shoved into parentheses, as they can be omitted by less serious readers.

The reader who discovers that the homicide/suicide pattern described above may surmise that homicide gunfire deaths are more "random" while suicides, being  premeditated, may affect older people disproportionately.  More research would be needed to confirm such and other suspicions.

Source: "An Accounting of Daily Gun Deaths", New York Times, April 21 2007.

 

Apr 20, 2007

Embedding logic

Bernard L. (from France) submitted this bubble chart for consideration.  It accompanied an NYT article claiming the absence of evidence of election fraud.  (Of course, as is well-known, absence of evidence is not the same as evidence of absence.  Here, I'm purely interested in data presentation.)

As a seasoned consultant, Bernard asked if a Marimekko chart would be superior.

Nyt_convictions_2 This is one ambitious chart.  Ignoring the bubbles (which are more nuisance than anything), we are asked to interpret data at three different levels of aggregation in one go.

First, there were 95 cases classified into five indictment types.  Second, these cases resulted in either convictions or acquittals/dismissals.  Third, among the cases ending in convictions (the highlighted area), we were shown the occupations of those convicted.

By flattening three levels into one table, some key information is obscured.  For example, how many cases resulted in conviction?  The reader has to compute either 95-25 or 26+31+10+3.  What percent of civil rights violation convictions were committed by party/campaign workers?  It's not 2/3 = 67% (bottom row) but rather 2/2 = 100%.

The following junkart brings out the logic that is embedded in the complicated bubble-table.  While there is a lot on the page, the text labels plus the flow directions allow readers to absorb the data one level at a time.

Redo_convictions2

I have not attempted the Marimekko as I am not a fan of such charts.  You're welcome to try.

Source: "In 5-Year Effort, Scant Evidence of Voter Fraud", New York Times, April 2007.

PS. I will be working through the backlog of reader submissions.  Thanks for your patience.  Keep them coming!

 

Remark (Apr 25 2007): Thanks to readers for keeping me honest (see comments below).  The conviction rates shown previously were indeed the inverse.  I have now fixed them.

Apr 12, 2007

Peripherals 2

In terms of interactive charting, Google Finance did much more than hide the legend.  In their main stock price chart, they used a number of neat features.

Google_ahm1

This chart effectively conveys a huge amount of information in a small space.  The bottom strip which shows relative prices for the past two years provides context to interpret the five-day movement shown in the main chart area.  I prefer to see a scale on the bottom strip as well. 

The sliding scrollbar can be dragged to show historical data.  Besides, the width of the window shown in the main area can be controlled.  For instance:

Google_ahm2

Without any effort, we are now looking at a 3-month chart for Q2 2006.  Notice the summary statistic on the top right corner also morphed.  The axis scale changed, and it never did start from zero to begin with.  (This shortcoming is alleviated by the profile chart in the bottom strip.)

Further, by placing the cursor in the chart area, we can highlight a particular day: a dot appeared on the price curve, the volume on that day was highlighted, and the text on the top right switched.  That text is what we typically place inside the chart area as a "data label".  The effect of moving it to the corner is similar to hiding the legend: it makes the graph more legible and provides space for longer descriptions.  As we move the cursor from left to right, the graph dynamically adapts.  Marvellous!

Google_ahm3

It may not be obvious the amount of data processing that has to take place to implement these sorts of features. I don't have space to address the data issue but maybe some of our readers can comment on it. 

Mar 27, 2007

Illusory disparity

The WSJ published a chart with the cheeky title of "Rich Get Richer" (reminiscent of the Economist).  The underlying data concerned one-, three- and ten-year returns for the buyout fund category.  For each return class, the overall mean and the means for the top and bottom 25% funds were depicted.

I won't go into the relevance of the title as I simply could not figure out how it connected with the data.  The following shows the original chart side by side with the junkart version.

Redo_richgetricher

Improvements include:

  • Lines show the comparisons with a minimum of fuss compared with colored bars
  • The overall mean return is placed in the middle of each line segment where it belongs, instead of being the first column
  • The axis label, "annualized return", tells readers what is the performance measure
  • Adding the word "funds" to "top quartile" and "bottom quartile" removes the possible confusion that those represent individual returns of the funds ranked at 25th and 75th percentiles, rather than the average returns of the bottom 25% and top 25% of funds
  • The linear construct paints the correct picture that individual fund returns fall into a continuum

(Thanks to my students for some of these points.)

Reference: Wall Street Journal, Mar 3-4 2007.

Feb 25, 2007

Going out on a limb

Earlier in the month, Prof. Gelman linked to Brandon's fascinating analysis of on-line weather forecasting accuracy.  I have done some additional analysis of the data and the result can be visualized as follows.

Redoonlineweather


I'll concentrate my comments on three observations:

  • CNN was the clear winner in forecasting accuracy during this period based on two criteria: its median error in forecasting daily lows, and its median error in forecasting daily highs.  Moreover, both the median errors were zero, which gives us confidence about its accuracy.  The Weather Channel (TWC) and Intellicast (INT) were not far behind.
  • The ability to forecast highs was better across the board than that of forecasting lows (except BBC).  I am not sure why this should be so.
  • Overall, our weather forecasters were much too risk-averse.  Notice that the errors were heavily biased in the lower left quadrant.  A negative error on low temperatures means predicted low is higher than actual low; a negative error on high temperatures means predicted high is lower than actual high.  Taking these together, we observe that the range of actual temperatures have generally been larger than the range of predicted temperatures!  No one was willing to go out on a limb, so to speak, to forecast extremes.

Actually, I believe this inability or unwillingness to forecast extreme values is endemic to all forecasting methodologies.

Before closing, I mention that the graph was based on a subset of Brandon's data.  I only considered same-day forecasts, did not consider Unisys (because they didn't provide forecasts for lows), and also noted that there might be bias since there were breaks in the time series.  Also, I retained the sign information and didn't take absolute values as Brandon did.

Feb 22, 2007

Bubbles of death 2

Here is an alternative way to present the death risk data.  It's a variation of Tukey's stem-and-leaf plot.  Instead of presenting the exact odds, I believe it is sufficient to generalize the data by grouping them into categories.  Not much is to be gained by knowing that the odds of dying from fire and smoke is 1 in 1113 as opposed to the odds being in the range 1 in 1000 to 1 in 10,000 and comparable to that of drowning, motorcycle accident, etc.

Redooddsdying


PS. Be sure to look at Derek's chart in the comments.

Dec 13, 2006

The trouble with percentages

In the aftermath of the Democratic victory in the 2006 mid-term election, the NYT published a column floating the idea that "it was the economy, stupid".  For statistics buffs, this column provides much food for thought. 

Suffice it to say, if you were my student, you would not want to hand this in as an essay.  To the author's credit, he did backload the article with lots of disclaimers.

The key thesis of the piece is:

if your state wasn’t among the best economic performers in the last six years, judged by the growth of personal income, it appears that you were three times as likely to vote to throw the bums out.

Redo_election06b_1 (We'll just assume he didn't mean "you" but "your state".) To help us understand the author's logic, I created a scatter plot, relating the change in state average personal income (2000-2006) to the change in percent of Republican seats.

He first segmented the states into two groups: the red dots had the top 10 income growth rates; the blue dots were the remaining states.  Then for each group, he computed the average drop in % Republican.  For the reds, it was 2%; for the blues, it was 7%.  (These levels are indicated by the horizontal lines.  My data are slightly different from his.)  Case proven -- with disclaimers.

Some of you are already counting the dots.  If you only find 42, you'd have counted correctly.  The following explanation provided by the analyst is classic:

It’s easier to answer this question if you leave out the six states that didn’t elect any Republicans in 2000; after all, they didn’t have any to throw out. If you also remove New Hampshire and South Dakota, where the percentage of Republicans elected dropped to 0 from 100 — New Hampshire only has two seats in the House and South Dakota has one — a pattern starts to appear.

I will leave the emergent pattern thesis to a future post.  For this post, I am interested in the trouble with percentages.  He is right to point out that for those 100% Blue states, the change in %Republican is constrained to be positive, from 0% up to 100%.  For most other states, the change can be positive or negative.

Good observation but wrong remedy -- those six states with 0% Republicans in 2000 are not special; removing them from the analysis is wrong-headed.  What about those states with 100% Republicans in 2000?  There, the change in %Republican can only be 0% or negative.  In fact, the possible range for the change in seats for each state is different, and it depends on the Republican proportion in 2000!  For example, if in 2000 the Republicans held 30% of the seats, then in 2006, the change must be between -30% and +70%.

The situation is worse: the range of possible values also depends on the number of seats in each state.  The fewer total seats there are, the fewer possible values that can be taken.  As the author notes, with only 1 seat, you either lose it, gain it or retain it, so that the change will be either -100%, +100% or 0%.  No other values are possible!

Both the above troubles arise because we use percentages to describe something discrete (number of seats).  This is a difficult problem and I don't know of a general solution. Redo_election06c However, in this example, because the change in seats is small across all states, regardless of the total number involved, I recommend that we avoid percentages and stick with positive, zero and negative changes.

The boxplot shows that there is little correlation between income growth and whether Republicans would win or lose House seats in 2006.  Here, the states are divided into three groups depending on whether the Republicans gained, lost or retained seats in the 2006 mid-term election.  The median income growth are similar in all three groups and the boxes overlap heavily.

Reference: "Maybe You Did Vote Your Pocketbook", New York Times, Nov 12 2006.

PS. If you like this post, consider sending me a holiday gift.

 


Dec 05, 2006

Time travel

Cambridge_traveltime_web

One of my scientific heroes and seminal teachers is Professor Frank Kelly at Cambridge.  What a pleasant surprise to see his involvement in a data visualization project.  To cite his wise words:

The travel-time maps are more than just pretty to look at; they also demonstrate an innovative way to use and present existing data. We are entering a world where we have access to vast quantities of data, and ways of turning that data into information, often involving clever ideas about visualisation, are becoming more and more important in science, government and our daily lives.

The little black dot near the center of the map indicates the Mathematics building at Cambridge.  The contours (vaguely visible at our scale) represent intervals of 10 minutes by public transportation away from the black dot.  Any colored dot on the map refers to the time at which a traveller must leave in order to get to the Math building by 9 am, taking into account traffic situation, time of day, and decisions.  The hope of such maps is to help commuters (by public transit) plan their travel.

Professor Kelly has a very nice write-up on the intricacy of generating the data for such a map, which includes techniques of sampling, smoothing, extrapolation and so on.  It is rare that we get insights into the chart-making process.  He also carries a larger version of the travel-time map.

A similar article can be found at Plus magazine.

Dec 01, 2006

Smoking-Screening

Smokeathome2

Behind the smokescreen lies the informative conclusion: among households with smokers, about 40% smoke in residence all the time while about half never smoke in residence.

This graphic, unfortunately chosen, contains many distractions from the main message, including:

  • the liberal sprinkling of colors
  • the inclusion of data for 1, 2, 3, 4, 5, 6 days, almost all of which were effectively zero
  • the redundant vertical scale, as all the data already appeared on the chart itself
  • the comparison of smokers to "total sample" (rather than non-smokers)
     

The last point merits special attention.  The total sample contains households with smokers as well as households without smokers. Any data from the total sample is a weighted average of these two types of households.  It is better to directly compare the two household types than to indirectly compare one type to the overall.

Further, households without smokers should be extremely likely to have no smoking in residence all week. 
And if most households have no smokers (76% of this sample), then the statistics of the total sample will mimic those of no-smoker households. That is to say, the total sample statistics do not add much to the analysis.  Our junkart version below corrects for this as well as other things.

Redo_smokeathomeOne of the key functions of a graph is data reduction, i.e. to aggregate data in such a way as to expose the information contained within.  Typically, a graph that uses aggregated data is clearer and stronger than one that plots every piece of data.  In this example, by combining 1-6 days into a single category ("smokes in residence part of the week"), we have a graph that is much more readable.

I want to thank Dr. Mike Rabinoff for inspiring me to look up these second-hand smoking statistics.  Mike recently published a book called "Ending the Tobacco Holocaust", which tells you more than you want to know about the tobacco industry.


Reference: "Second Hand Smoke Survey: Final Report", Madison Department of Public Health, Dec 2003.

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Search Junk Charts


  • Custom Search

Residues

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31