May 05, 2008

Turning the table

Nyt_runningbacks We recently showed an example of when data tables worked well to clarify the data.  Last week, there was an example from the Times which did the opposite.

The accompanying article boldly claimed that

the 40-yard dash stands above them all as having the strongest correlation to success in the NFL.  The three-cone drill, the shuttle run, the bench press -- none correlate to NFL success.  The 40 is king.

Further, it cited Bill Barnwell from FootballOutsiders.com who created an "index" using both 40 time and body weight that is "an even better predictor than 40 time alone".  In other words, this formula Nyt_runningback_eqt

does the trick.

The data table, shown above, presumably clinched the case.

Redo_runningback1 We were mystified when we put the data to the test, however.  Among the set of 15 running backs, the Index did not predict the Yards Per Carry at all!  The Index explained only 8% of the variation in Yards Per Carry between the backs.

The data table obscures this bivariate relationship.  As it was sorted by the Index, we would look for the column showing Yards Per Carry to be naturally sorted in the same order.  But it is hard to tell the trend from the noise in a table.

What went wrong?  It turned out neither 40 Time nor Body Weight had any relationship with Yards Per Carry.

Redo_runningback2

These variables did not explain the range of Yards Per Carry attained by this set of running backs.

Redo_runningback3Finally, we found strong correlation between 40 Time and Body Weight.  (The heavier you are, the slower you run!) This meant that both variables contained similar information and some unlikely formula involving the two would be unlikely to perform significantly better than each variable alone.

So we are left to turn the table on the table.  More pertinent evidence is needed to prove the case.

The entire analysis suffers from survivorship bias as only the top running backs are examined, and no adjustment is made to deal with wide-ranging tenures.  Apparently, there is more data available in a book.  There is no indication of how the model shown above was validated.

Reference: "The Race of Truth: 40-Yard Times Can Tell the Future", New York Times, April 27, 2008.

 

Aug 22, 2007

The Tufte count

One of the things I picked up from Tufte is the horrible habit of counting the amount of data on a chart.  This is part of the info gathering to estimate the data-ink ratio (amount of data divided by the amount of ink used to depict them).

Leon B, a reader, left this in my inbox, months ago it turned out.  This is the British government's way of informing people how energy-efficient their homes are.  As Leon said:

these charts might be a great example of governments going overboard with colours, bars, letters and numbers and lines for something that really only has four data points.



Ukhomeenergy

In addition, I find the use of two different scales to be confusing and unnecessary.  If it is decided that scores in a particular range can be grouped as A, B, ..., G, then the original scale should be discarded.  52 is E and 70 is C.  (This is especially so since the score ranges are not intuitive, like 69-80 = C ?!)

Even worse, what's the point of citing the 0-100 scale without explaining what is the metric?

A table presentation does a far better job in a fraction of the space:

Redoukenergy_2










Source: Home Information Pack, UK Government.  Graph from Wikipedia.


 

PS. This post set off a torrent of emotions (see the comments).  Another version that I discarded was the simplest table possible.  In my view, there is still way too much distracting "junk" in the original design.  No one has yet explained why the 0-100 scale should be emphasized, or what it means!

Redo2ukenergy

Aug 08, 2007

On the bubble

Nyt_candminsA couple of you noticed this table of bubbles in the Times, and asked what I think of it.  Dustin J suggested that this could be considered a decent application of bubble charts.  I agree, with some reservations.

The data set is the best thing about this chart.  The riches that lay beneath!  Many questions can be addressed, including:

  • Which Presidential candidates are getting the most face time?
  • Are candidates seen equally often across the stations?
  • Are there differences between network and cable stations in terms of total face time?  In terms of individual face time?
  • Are there Democratic/Republican leanings by station?  by type of station?

The intrepid can even build a regression out of it.

The bubble chart contains answers to all those questions but nothing jumps out. Okay, it's easy to see the station that gives each candidate the most face time.  Anything else requires moderate to a lot of effort.  Here's the junkart version.


Redocandmins_2 The list of things done to the data is long:

  • Candidates are grouped together by party
  • Candidates within each party are arranged in order of decreasing maximum face time
  • Stations are arranged by increasing total face time, this order happens to retain the network vs cable divide
  • A heat map construct is used instead of bubbles: the legend is missing but there are four hues for each color: darkest = top 10%; medium = 10th - 50th percentile; light = bottom 50th percentile excepting zeroes; white = no face time.  In raw numbers, 90th percentile = 81 minutes, 50th percentile = 19 minutes.
  • The only data shown are the totals by candidate and totals by station.
  • On the right margin are little bar charts that show the distribution of network/cable for each candidate.
  • On the bottom margin are little column charts showing the distribution of party affiliation by station.

A few observations follow:

  • Cable stations gave much more face time to the candidates in general.  Fox, no surprise, gives Republicans 85% of its time while all the others were roughly equal.
  • The more mainstream the candidate, the balanced was the time spent on networks versus cable.  John McCain (R), Hillary Clinton (D) and John Edwards (D) had the highest proportion of network time.
  • More time is not necessarily good since McCain was the clear winner but his campaign is struggling

Source: "Tracking Face Time", New York Times, August 1, 2007.

May 22, 2007

Visualizing web statistics

Tim inquired about:

how to create an elegant graph for Web visitor traffic statistics that shows both how many views a page gets and then how many people click that page to go further ("conversion rate"). Part of the problem is that conversion rates vary from, say, .3% to 50% (a wide range).

Lets work with this sample data set.  Web1I ordered it from highest to lowest click rate, which is the primary metric of interest.  The number of page views is of interest too as sometimes rarely-visited pages may have high click rates.

At this point, it's important to know the context.  Specifically, who controls the allocation of pages? Did the data come from a randomized experiment? Or did they get a self-selected sample (e.g. web surfers deciding which section of the site to visit)?

Web_lift The first construct I tried is the "lift curve" often used in marketing.  It's the same thing as the Lorenz curve used by demographers but interpreted differently.  Here, we see that Guitar pages accounted for 26% of the page views but 37% of the clicks; House pages accounted for an incremental 44% of the pages and 59% of the clicks; etc.  The relative click rates are immediately clear from the steepness of the line segments.  The lift curve is appropriate for the self-selected case, in which we can take the allocation of page views as fixed.

Web_scatter If the allocation of page views is a decision to be made, then it doesn't make much sense to accumulate page views.  The second construct is the "scatter plot" of % clicks versus % page views.  The steepness of the line through the origin helps us compare the click rates.  Bicycles is clearly inferior in generating clicks.

Both these constructs are highly efficient; adding new data does not expand the chart at all.

Keen readers will observe that the slope of the line is not the click rate but rather a click rate index (relative to the overall click rate).  This means that any data point above the diagonal has above-average click rate.

Apr 20, 2007

Embedding logic

Bernard L. (from France) submitted this bubble chart for consideration.  It accompanied an NYT article claiming the absence of evidence of election fraud.  (Of course, as is well-known, absence of evidence is not the same as evidence of absence.  Here, I'm purely interested in data presentation.)

As a seasoned consultant, Bernard asked if a Marimekko chart would be superior.

Nyt_convictions_2 This is one ambitious chart.  Ignoring the bubbles (which are more nuisance than anything), we are asked to interpret data at three different levels of aggregation in one go.

First, there were 95 cases classified into five indictment types.  Second, these cases resulted in either convictions or acquittals/dismissals.  Third, among the cases ending in convictions (the highlighted area), we were shown the occupations of those convicted.

By flattening three levels into one table, some key information is obscured.  For example, how many cases resulted in conviction?  The reader has to compute either 95-25 or 26+31+10+3.  What percent of civil rights violation convictions were committed by party/campaign workers?  It's not 2/3 = 67% (bottom row) but rather 2/2 = 100%.

The following junkart brings out the logic that is embedded in the complicated bubble-table.  While there is a lot on the page, the text labels plus the flow directions allow readers to absorb the data one level at a time.

Redo_convictions2

I have not attempted the Marimekko as I am not a fan of such charts.  You're welcome to try.

Source: "In 5-Year Effort, Scant Evidence of Voter Fraud", New York Times, April 2007.

PS. I will be working through the backlog of reader submissions.  Thanks for your patience.  Keep them coming!

 

Remark (Apr 25 2007): Thanks to readers for keeping me honest (see comments below).  The conviction rates shown previously were indeed the inverse.  I have now fixed them.

Jan 16, 2007

Subjectivity

Irwebfeature_1 When I look at charts like this one, I ponder: Should graph designers adopt "objectivity" as practiced by American journalists?

Is it even possible to make "objective" charts?  Every design choice we make seem to chip away some of the detachment.  In this chart, the choice to order important web-site features by shopper -- rather than merchant -- ratings is a tacit preference for those ratings.  Bringing out key messages in the data is a subjective act, isn't it?

Are "objective" charts useful?  In our example, the design choices are kept to a minimum, and so it seems is its usefulness.  In comparing shopper and merchant ratings, one would be most interested in identifying the most effective web-site features as well as those features offered by merchants that find little resonance with shoppers-users.  These questions are better addressed by directly plotting the average rank and the ranking gap between merchants and shoppers (see below).

Notice that I said "ranking" rather than "rating".  The footnote discloses that the ratings were obtained from two different surveys conducted by two different companies at two different times.  How should we interpret the difference of 13% between the 89% of shoppers rating "Free Shipping" "very to extremely helpful" and the 76% of merchants rating "Free Shipping" "somewhat to very valuable"?

RedowebfeatureIn the junkart chart, we can focus on three groups of features:

  • the three top features ("Promo Discounts", "Free Shipping" and "Keyword Search") which attained the best average rank and least ranking gap;
  • the three "orphan" features ("Recommended Products", "Top Sellers", "Gift Selection") created by loving web-site producers, abandoned by independent-minded shoppers;
  • the three "neglected stepchildren" ("Shop the Catalog", "Store Locator", "Product Comparison") whose importance to shoppers were vastly underestimated by the merchants.

Unfortunately, while being "objective",  the data table fails to point out anything of interest to the reader.

Reference: "Consumers want one thing -- merchants are delivering another", Internet Retailer, Jan 2007.

Jan 08, 2007

Table pitfall

Happy New Year to you all!  I'm now back from holiday.

Datatable At work today, I came across this data table (shown right is a small extract from the very large data table, with labels changed to protect the innocent).  I was scanning through the numbers, looking for differences between type A and type B samples.

If your eyes work like mine, you may pick out the "West" region comparison, mainly because of the jump in the leading digit.  But then I circled back, because the right side of my brain wanted both columns to add up to 100% (less rounding) and something has to compensate for the 15-21 jump.  After a moment's search, after finding the 35-30 flip in the "South" region, I let go a sigh of relief.

Even though the above differences were about the same (5 or 6 percentage points), my eyes caught the change in leading digits and stuck to it.  This problem is especially acute when scanning quickly through reams of data tables.


So data analysts beware!  This includes those who scan financial statements, financial data, computer-generated logs, statistical software output (e.g. SAS), market research data, etc. for a living.  We are easily fooled.

Not convinced?  Let your eyes decide which difference is larger:

Datatable2_6    
 

Nov 04, 2006

Finding dots

Erik W. alerted me to this CNN map that shows FBI statistics about safety of American cities.  As Eric pointed out, this is prototypical of chartjunk a la Tufte.  A lot of ink is used to depict 12 points of data (top 3 cities in safety, crime, improvement and decline).

Cnn_safest Imagine the reader trying to find the 3rd most improved city.  She either has to find all the blue dots and then figure out which is #3; or she needs to find all the #3 dots and figure out which is blue.  As they say, it's "hard work".  In fact, finding the dots among the forest of large text is hard work by itself!

How would I re-make this chart?

  • Highlight only the states containing data (California, Michigan, Missouri, Ohio, Georgia, New Jersey, New York); gray out all other states and their boundaries
  • Separate the states from the cities; only write the State name once for each State; reduce the font size
  • Instead of dots, use numbers.  So the most dangerous city (St Louis) gets a red "1", Oakland gets a purple "3", etc.
  • Remove Mexico, Canada and water from the map

The map gives the false impression that crime is relevant only along the coasts and the lakes, when in fact, the map is just saying that most cities in the U.S. are located along the coasts and the lakes.  Using such a map to depict city-level statistics creates distortion because cities are not evenly distributed across America.

Beyond that, what is the point of this map?  Is it merely a geography class telling us where each city is located?  How is it better than a simple table listing the cities in order?   

Reference: "U.S. City Safety Rankings", CNN, 2006.

Sep 29, 2006

Where are the crimes?

Msn_crimeThe author of this data table and the readers are asking the same question, "Where are the crimes?", but for different reasons.

While the author wanted to convey regional differences in crime growth, as readers, we are not sure which part of the table to look at; every cell is given equal "weight".

Redo_crimeJudging from this "profile plot", we can conclude:

  • the Mid-West (blue line) experienced a crime spurt that is very much worse than the national average (dots) in all categories except forcible rapes and murder
  • the West (red line), in general, had crime increases less severe than the national average
  • that said, the regional profiles are relatively similar, showing few meaningful regional differences (compared to other profile plots I've seen)

Reference: "Communities Grapple With Rise in Violence", MSNBC.com
Thanks to Maya for sending in the link.

Sep 17, 2006

Much data, zero info

The number crunching college football fans at the Wall Street Journal wondered out loud:

One of the biggest developments in college football in recent years was the decision by Virginia Tech and Miami -- perennial top-20 teams -- to leave the Big East conference and join the Atlantic Coast Conference.  How much has that strengthened the ACC?

Wsj_accThe data table on the right was ostensibly the answer.  Readers were drawn to the bolded numbers, the almost identical winning percentages of ACC and SEC (averaged over the last decade, as the text explained).

The question is a classic one of cause and effect: did the addition of two strong teams cause the ACC to become stronger?  Startlingly, the data cited was useless, and the analysis conducted irrelevant.

First, the difference in winning percentages between ACC and SEC is the wrong metric.  Something more pertinent is, for example, the change in winning percentage of ACC before and after the team additions.

Second, the observation period is seriously mistaken.  The ACC expansion occurred in 2004 so average winning percentages from 1995-2005 have zilch to say about its effect.

Third, a Web search uncovers that major realignment occurred again in the ACC in 2005, making it very difficult to isolate the effect of adding Virginia Tech and Miami in 2004.

Thus, the data table contains zero information for addressing the stated problem.  How to measure the effect properly seems to me a tall order, and a good discussion topic.

Besides the iffy statistics, it is also impossible to read this table.  The data in the lower left triangle is a reflection of those in the upper right triangle, containing no new information.  Head-to-head conference comparisons seem to serve no purpose.  Actual win-loss numbers create clutter while adding no insight.  (Theoretically, the larger the number of contests between any two conferences, the more reliable are the winning percentages.  Confidence intervals is a much better way to present such information but even those would be over-kill for our purpose.)

Reference: "College Football's Power Struggle", Wall Street Journal, Sept 16-17, 2006.

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Recent Comments

Search Junk Charts


  • Custom Search

Residues

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31