Apr 05, 2008

Making maps

I love articles that expose the behind-the-scenes of creating complex graphs.  This Wall Street Journal blog post tells us some dirty secrets behind these cartograms that depict the "influence" of different media outlets throughout the world.

Wsj_mediacartogram

(Via Andrew Sullivan; he's dissing NYT again)

Additional links of interest:

Original posting at Paris-based L’Observatoire des Médias blog

Boing Boing

Gawker

Online Journalism Blog (warning: this link is taken over by a rogue script from an advertiser or some other entity that distributes scripts so it wasn't loading when I tried)


Apr 04, 2008

Believe it or not

Via Social Science Statistics blog, I found this article in the Times about baseball's longest hitting streaks.  The authors ran 10,000 simulations of "baseball seasons using historical data to come up with a probability distribution of the longest hitting streak in each season.  They showed the following chart.

Nyt_streaks The record was 56 consecutive games with hits in a season, which in some circles is seen as unbeatable.  These authors -- "in a fit of scientific skepticism -- found that in any season, the simulated longest streak ranged from 39 to 109, with the median at 53 games.  They concluded that "the unlikely becomes likely".


That is sure to turn some heads.  I have a question for them as I can't make sense of these numbers.  A median of 53 meant that 50% (or 5000 out of 10,000) simulated seasons ended up with a hitting streak exceeding 53 games.  Empirically, according to here, Dimaggio's was the only one to go over 53.  Using the authors' time line of 1871 to 2005, that would be 134 seasons.  One out of 134 is 0.75% probability.  0.75 versus 50... sounds like something has gone wrong.

The article doesn't give enough details on the simulation so it is hard to understand what is going on.  I hope I am not misinterpreting their analysis.


 

Source: "A Journey to Baseball's Alternate Universe", Samuel Arbesman and Steven Strogatz, Mar 30 2008.


PS. As readers pointed out, each simulation is of all the seasons.  So the histogram is saying that the particular sequence of 134 seasons that we lived to see is not a rarity considering all the possibilities.  I'm not sure this is telling us much.  It doesn't address the question of how likely the 56-game record would be beat in the future.  It can't address this question because the particular sequence is now already set; the alternative universes are irrelevant because we can't jump from one universe to another mid-stream.

Also, readers want to have each hitter's probability be modeled rather than using the historical average; in other words, factor in opposing pitcher, home/away, etc.

I'll throw in another... there must have been an assumption of independence between one game to the next.  One would think the pressure would be so much higher on the hitter once he gets to 45, 50, 53 etc. games and it would be inappropriate to assume the hitting probability would remain the same.

Along those lines, why should the hitting probability be treated as fixed, rather than modeled as a probability distribution, which would account for variance as one of the readers suggested?

For more discussion, see this Wall Street Journal discussion.
 

Mar 01, 2008

Don't believe what you see

Mankiw's blog linked to a press release by the Congressman Jim Saxton, using CBO data to show "middle income tax burden at lowest level in decades".  Cbo_taxrateThe attached graph, as Junk Charts readers will immediately recognize, is classic chartjunk.  Every time the vertical axis does not start at zero,  one suspects something is amiss.  And what with the gridlines and data labels?

"Don't believe it? Check out the data source yourself."  I followed Mankiw's suggestion and was indeed surprised... but not by the great fortune of the "middle class".  The surprise was how the chart painted a dishonest picture of the CBO data.

The original chart plotted only the tax rate experienced by the middle 20% of the population. 
Redo_taxrate1The CBO provided data for all five quintiles; why not plot them all?  In this new chart (right), the "surprise" windfall to the middle 20% proved not to be anything special at all!  All five quintiles, especially the middle three, followed pretty much the same trend over time.  The effect of singling out the middle 20% is to deprive the context by which the data should be interpreted.

Further, what might be the result of the declining middle income tax burden?  Redo_taxrate3 The CBO data painted an unexpected picture.  Paradoxically, as the middle 20% see their tax rate decrease, they also earn a smaller share of the nation's after-tax income (black line at right).  At the same time, the top 1% saw their share of after-tax income double from about 8% to almost 16% (blue line).  The top 20% line is also upward-sloping although less pronounced.  So, the implication that the middle class have had it good is plainly wrong.

What is going on?  Two factors were at play and the Congressman presented
only one side of the story (the tax rate).  What he omitted was that during this period, the nation's wealthy took home larger and larger shares of the pre-tax income.  This shift in pre-tax income more than offset any relative reduction in tax rate for the middle 20%.

This distortion can be traced back to the use of quintiles (or more generally, ranks).  We use them to cope with data having extreme distributions but a by-product is losing information about how extreme are the extreme values.  As demonstrated here, the quintiles from old are really different from the quintiles from today because the underlying distribution has become much more extreme.

Finally, another bit of mystery (to me) is how the middle 20% came to be considered "middle class".  Is there a widely accepted definition?

Reference: "CBO Data Show Middle Income Debt Burden At Lowest Level in Decades", Feb 21 2008.

Feb 12, 2008

What is data?

I did a guest post over at Nathan's Flowing Data site.  See here.

Jan 27, 2008

The buzz

What other statistics and/or graphics blogs are chattering about:

1. Andrew Gelman asks: Does jittering suck?

2. Gelman reports progress on some (really cool) simple statistical methods that solve practical problems, such as scaling regression coefficients and using priors to deal with separable data in logistic regression; he also tells us which commonly used methods he does not like

3. Rindskopf's rules for statistical consulting (delivered at a mini-symposium organized by Gelman)

4. Information Aesthetics looks at travel-time maps (which we also discussed here)

5. EagerEyes considers the makeup of past U.S. presidents (see also older post here), and argues for "expressive visualization", or charts that opine





Dec 25, 2007

Doctoring charts

Reader Chris P. alerted us to a fascinating post from Errol Morris' blog, which presents results in graphical form from a readers' poll related to this other post.  This other post deals with a pair of photographs taken during wartime, previously discussed by Susan Sontag and others.  Sontag believed the pair documented a before-and-after setting: it was alleged that the photojournalist shifted some cannon balls from their natural position between takes. 

Morris polled his readers asking them in which order they thought the photos were taken ("on before off", "off before on", "undecided"), and which factors were used to make the decision.  He presented results in two formats, first plotting frequencies in bar charts and then plotting proportions in pie charts.  He preferred the pie chart construct.

Nyt_sontag

Most here would share Chris' reaction: "Oh my.  What people do with Excel."

The biggest problem with these pie charts is the unreasonable baseline.  This is one of those polls that allow respondents to pick any number of factors and clearly, the pie chart creator used the 1,151 responses as the baseline, as opposed to 910 people who voted.  Consider these two statements:

  • 52% of respondents who decided "on before off" listed "sun shadow" as a decision factor
  • 30% of the decision factors submitted by respondents who decided "on before off" were "sun shadow"

It is tough to figure out what the second statement means.  It is as if the respondent who selects more than one factors gets more than one votes in the final tally.  To put it differently, the 30% is meaningless unless one also knows how many decision factors were selected by each respondent, on average and in distribution.  The 52% is independent of such consideration.

Combining the data given in the bar charts and pie charts, one discovered that 469 out of 910 respondents could not decide which photo was taken before the other; besides, these respondents on average expressed 0.9 opinions on the decision factors whereas the respondents who made a decision expressed 1.6 opinions.


A simple illustration to show the key decision variables by type of respondents is shown below.  Redo_sontag_2From this chart, one sees that the number and position of the cannon balls were crucial to at least 50% of those who came to a conclusion.  Sun shadow were much more important to those who decided "on before off" while those who decided "off before on" noticed character artistic, shelling and rocks.  Most other factors did not differentiate the three groups.

Source: "Not Your Mum's Apple Pie Chart", Errol Morris, Dec 18, 2007.


 

Dec 09, 2007

Lacking buzz

Nielsen, they of the ratings, is roughing it in the information age.  When they announced on-line tracking tools, Wired quipped: "It's looking like online video policing companies will have to make room for another deputy."  Last year, cable companies revolted over a service measuring the effectiveness of commercials.

Via the Data Mining blog, I learnt about yet another new on-line offering, called "Hey! Nielsen" for obscure reasons.  (Perhaps Hey! Nielsen is the new Yahoo! !)

The site is an enigma wrapped in a mystery.  The official description says:

Hey! Nielsen is the place to make a name for yourself while trading opinions on TV, movies, music, personalities, web sites and more.

How does one "trade" opinions?

According to the FAQ, the "Hey! Nielsen" score, the cornerstone of the site, is:

a real-time indicator of a topic's impact and value and you play a major role. As the site evolves and users submit their opinions and commentary, the score will rise or fall based on a number of factors including, but not limited to, user opinions, news coverage, and raw data from our sister sites Billboard.com, HollywoodReporter.com, and BlogPulse.com.

Sounds like a product aimed at marketers to help them track public opinion but offering little control over sampling. 

The "Hey! Nielsen" buzz chart (below) captures the change in "Hey! Nielsen" score over time.

Heynielsen

This chart is an unfortunate case of flipping background into foreground.  What grabs our attention are those hideous white circles with numbers in them.  The legend explains that these are the daily numbers of opinions on the subject, in other words, the daily sample sizes.  As they stand now (with the site still in beta), they serve to expose the low level of participation, leading to small sample sizes, and irrelevance.  But what when the site became super-popular, would the circles say 56234, 19245, 90257, etc.?  Why would visitors care about daily sample sizes anyway?  Mousing over these circles reveal text but in most cases, they are blocked by neighboring white circles.

In the meantime, the circles obscure the line which shows the trend in the "Hey! Nielsen" score over time.  This chart reminds me of that Google toy known as Google Trends.  The Googlers provide no vertical scale so the graphs are unreadable.  "Hey! Nielsen"ers provide a vertical scale -- kind of -- but the graphs are still meaningless: what does a score of 881 mean?  how about 724?  what is the maximum score?  what is the minimum?  Beware numbers without context.

The vertical axis does start from zero but has an odd spacing of tick labels. The gridlines are distracting and serve no purpose.  The orange area under the curve also makes little sense.

We look forward to seeing version 2.0.

 

Dec 05, 2007

Lost in translation

Since English is my second language, I have always been intrigued by automatic translation.  My "Turing" test for translation engines is to feed the translated output back into the same engine in the opposite direction.

Case in point: the first sentence of this post is translated by Babelfish into Italian -

Poiché l'inglese è la mia seconda lingua, sono stato incuriosito sempre tramite la traduzione automatica.

Now, Babelfish translates the above Italian text into English, as:

Since English is my second language, has been made curious always through the automatic translation.

Not that bad, really.


The tag line of this blog is "recycling chartjunk into junk art".  What happens in the other direction?  The answer is on this page!

This entry is inspired by Michael M.


Nov 30, 2007

Digging deeper

Two items from other places caught my eye this week as they directly relate to some things we discussed on this blog.

First, I second Andrew's suggestion of a recent NYT article for teaching the concept of margin of error, or how to read political poll coverage intelligently.  Towards the end of this piece is a small gem:

Some pundits began by saying the horse race numbers were close but then tried to marshal evidence that they were not. On ABC's own Web site, Chris Cillizza, wrote: "Among women in the Post poll, Obama actually leads Clinton 32 percent to 31 percent among women. Voters 45 years of age or older are similarly divided, choosing Clinton by a 27 percent to 26 percent margin over Obama. Ditto for those who earn $50,000 or less a year; 29 percent for Clinton, 29 percent for Obama."

Mr. Cillizza failed to mention that if the margin of sampling error is plus or minus five percentage points for all of the likely Democratic caucus goers, then it is even higher for subgroups like women.

In a recent post, I call this the "oft-used device of subgroup support of a hypothesis".  This example illustrates the fallacy more clearly.  It's the "let dig deeper since we haven't found the gold yet" phenomenon.  Such analysis suffers from two serious statistical problems.  The article deals with the sample size problem: the margin of error at the subgroup level is by definition larger; what this means is the bar for statistical significance has been raised; and rare is the case where such analysis could lead to any further insights.  (Of course, I am assuming the original poll was not designed to be analyzed at the subgroup level.)

The other issue -- more difficult to explain and omitted in the article -- is the multiple hypothesis problem.  It is well known that if we dig around long enough, we may get so dizzy that anything that glitters will look like gold.  In other words, false positives.  Like the sample size problem, the remedy is to raise the bar for statistical significance even higher.  In practice, this frequently wipes out the rationale for such analysis.

I will address the other interesting item in a new post.

Nov 21, 2007

Social networking

I've been meaning to write about our "larger blog family", or social network, for a while.  It's taken time since this requires a bit of digging around.  These sites share significant traffic with us, which means readers like you also like these sites:

Statistical Modeling, Causal Inference, and Social Science
aka, the Gelman blog.  Active community of statisticians, with regular commentary on visualization and graphics, and occasional diversions into unexpected topics (spam! art!).  Required reading.

Information Aesthetics
Lovely blog focusing on graphics that are pleasing to the eye.  They like entertaining; we like informing.  Nice complement or counterpoint to our point of view.

Jason Kottke
"kottke.org is a weblog about the liberal arts 2.0 edited by Jason Kottke".  Eclectic. 

Process Trends
D Kelly O'Day's collection of chart tips and links to on-line resources.

Juice Analytics
Website/blog run by a data analysis consulting firm.  Very active posting.  Good on tools.  They even awarded us the Juicy Award for "charts and graphs": a belated thank you!

Edward Tufte
The meeting point of Tufte fans.  Tufte himself also joins in the forums.  For the very serious.

Mahalanobis
Before this blog shut down due to employer interference,  two hedge fund guys shared their wit
here.

EagerEyes
Robert Kosara's blog on all things visual.  We featured his scribble maps here.

Statistical Graphics and Data Visualization
This blog suffered twice.  First, a certain StatGraphics company forced them to abandon their original URL.  Then, the blog activity withered.

Malaprensa
Josu's Spanish blog ("Bad Press") picking out factual errors in the Spanish press.  (Thanks to Josu and Jorge for correcting my misattribution.)

The next batch includes:

Social Science Statistics
Data Mining: Text Mining, Visualization and Social Media
Science Magazine
Design*Notes
L'economie sans tabou
R Project Wiki

The problem is it's hard to keep up because a lot of other sites are showing up in the recent history.  But then most of you come directly to the site, or through Google, or through an RSS reader, or del.i.cious and so on.

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Search Junk Charts


  • Custom Search

Residues

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31