May 22, 2007

Visualizing web statistics

Tim inquired about:

how to create an elegant graph for Web visitor traffic statistics that shows both how many views a page gets and then how many people click that page to go further ("conversion rate"). Part of the problem is that conversion rates vary from, say, .3% to 50% (a wide range).

Lets work with this sample data set.  Web1I ordered it from highest to lowest click rate, which is the primary metric of interest.  The number of page views is of interest too as sometimes rarely-visited pages may have high click rates.

At this point, it's important to know the context.  Specifically, who controls the allocation of pages? Did the data come from a randomized experiment? Or did they get a self-selected sample (e.g. web surfers deciding which section of the site to visit)?

Web_lift The first construct I tried is the "lift curve" often used in marketing.  It's the same thing as the Lorenz curve used by demographers but interpreted differently.  Here, we see that Guitar pages accounted for 26% of the page views but 37% of the clicks; House pages accounted for an incremental 44% of the pages and 59% of the clicks; etc.  The relative click rates are immediately clear from the steepness of the line segments.  The lift curve is appropriate for the self-selected case, in which we can take the allocation of page views as fixed.

Web_scatter If the allocation of page views is a decision to be made, then it doesn't make much sense to accumulate page views.  The second construct is the "scatter plot" of % clicks versus % page views.  The steepness of the line through the origin helps us compare the click rates.  Bicycles is clearly inferior in generating clicks.

Both these constructs are highly efficient; adding new data does not expand the chart at all.

Keen readers will observe that the slope of the line is not the click rate but rather a click rate index (relative to the overall click rate).  This means that any data point above the diagonal has above-average click rate.

Mar 21, 2007

March mildness

The Times published this great graphic to show 2007 was an upset-starved year in the recent history of the NCAA Basketball tournament, which is on-going.

Nyt_mildness Each box contains the number of upsets in a given year of a given pairing, e.g. in 1998, there was one case of a 9-seed beating an 8-seed.  An upset is defined as a lower seed beating a higher seed although the editorial comment argued that 9 beating 8 is "rarely considered an upset".

The rightmost column (which sums across a row) tells us that the number of upsets fluctuates wildly between the years, ranging from 3 to 13.  (That's why people bet on NCAA pools.)

A couple of improvements will make this chart even more effective:

  • Include a row showing the average number of upsets for each pairing;
  • Include a column of zeroes for 16-1 pairings.

This second point cannot be emphasized more.  The fact that no 1-seed has ever lost to a 16-seed should not be relegated to a footnote.  Think of it this way: if the results for 15-2 and 16-1 were reversed so that no 15-seed had ever beaten a 2-seed but one 1-seed had lost to a 16-seed, nobody would omit the 15-2 column! 

In his seminal work, The Visual Display of Quantitative Information, Tufte discussed the Challenger disaster at considerable length.  A key learning was that non-events (things not happening) contain important information, and should never be dropped from an analysis without unassailable logic.

The mildly improved chart would look like this. Redo_mildnessWhat then to make of the comment that "9 beating 8 is rarely an upset"?  For one thing, 9-8 upsets happen about as frequently as 10-7 upsets so if the comment refers to the surprise factor, then even 10-7 upsets should be excluded.

But the comment also underlines a deeper issue, which is hindsight.  Obviously, the seeding committee felt, and predicted, that the 8 seed would beat the 9 seed.  It was only after the fact that we found out 9 had beaten 8.  Instead of denying the 9-8 upset, would it make more sense to ask if there was a seeding error?

Reference: "March Mildness", New York Times, March 17, 2007, p.D2.

Mar 17, 2007

Picking up the right file

The Institutional Investor advises its readers:

Going public may just be the most important -- and nerve racking -- decision any company will make.  Managing and pricing an IPO is tricky, so picking the right underwriter is crucial.  Bankers often boast of their league table prowess to win mandates, but quantity does not necessarily mean quality.

By quantity, they meant the amount of underwriting fees (revenues) earned; and by quality, the average stock performance of the newly-public companies, as of Feb 16, 2007.

Ten banks were compared on the two Qs using this chart, which is best described as the "file folder chart".

Iporanks

Amusingly, its creator sized the height of each file according to the quality metric, which is the return % listed at the top right corner of each file.  The files were sorted by decreasing quality.  Since each file is a parallelogram, its area is proportional to quality.

However, the files overlap, preventing us from comparing the areas of the files.  Besides, the point made in the article about the importance of both Qs is lost since this chart stressed quality over quantity.  Quantity showed up as a low dot on the tallest file and a high dot on the shortest file.

Redo_iporanks The junkart version restores the balance.  The blue lines highlighted several banks that scored high on one metric but low on the other.  The construct is a profile chart, with only two variables.

Curious readers may wonder if there were only 10 banks in the IPO underwriting market.  Far from it.  The chart designer introduced a selection bias because banks were included based on Quantity, and then Quality was rated.  This meant there is possibly a boutique firm with small revenues but higher quality than any of the 10 in the plot.

Furthermore, much useful information is missing, including the dispersion of returns, the number of deals, etc.

Reference: "Grading the IPO Underwriters", Institutional Investor, March 2007.

Jan 24, 2007

Convenience charting

Statisticians have long riled against "convenience sampling", that is, the practice of selecting samples based on what's easily available, not at random.  Say picking your friends.

Wpost_childmortality Dustin J sent in this example of what can only be called "convenience charting".  Dustin said he had no clue what this chart is saying, and I am not surprised. 

The chart plots a statistical object known as the "survival function".  It is likely that "survival analysis" was done, after which the chart creator  picked up the resulting statistical object and dumped it onto this "convenience chart".

If we take the top line on the "child survival" graph, it shows the probability of one child surviving up to a certain age, if the child belonged to a family with 1-3 kids.  The chance is about 92.5% that the child will survive through age 2, and 88% that the child will survive through age 18.  The difference between those percentages is due to the chance that the child may die between ages 2 and 18.

A slight transformation of the data will make this point much clearer.  What is the probability of a child dying by a certain age?  Using the example, a child has 12% chance to die by age 18, and 7.5% chance of dying between ages 0-2.

Redochildmortality The junkart chart depicts this probability.  (I reverse-engineered the data which explains why the distances between the line segments look strange.)

What this chart doesn't address is how we are to interpret the probability of "a child dying" in a family with more than one child.  Is it a random child dying?  At least one child dying?  Exactly one child dying (the other X-1 surviving)? 

The original chart also committed a number of standard errors.  The child survival function represent probabilities, not percentages.  The third category should be 8-11 kids, not 7-11.  If we are picky, then we would also like to see "confidence intervals" because there must have been many fewer families in the 12+ sample than the 1-3 sample.  In the second chart (which I don't have space to discuss), some data labels are missing, which indicates a presumption that all readers have seen the first chart.

Reference:  "Child, Parents Drive Each Other to Early Graves", Washington Post, Jan 14, 2007. 

Sep 19, 2006

Jamming

Econ_muslimsReaders may have noticed that I'm not a fan of the graphics aesthetics of the Economist.  (I love their subtle sarcasm, a way of saying something without saying it.  For example, the title of this chart is "where they are".  They let us read any meaning into the word "they".  As for their charts, I have taken issue on several occasions.)

This particular example uses one of their standard formats, stacked bars with an extra data series tagged on the right, its boxed annotation calling attention to itself.  It's a case of too much apparatus for a simple task.

The chart's purpose is to show that the US and France have the largest Muslim populations by numbers while France is by far the top country by percentage.

Redo_muslimsOur junkart version is very much cleaner.  Line segments indicating the low, mid and high estimates replaced the stacked bars (which falsely imply significance in adding the low and high estimates).  As usual, the minimum of gridlines and axes is used.  Instead of jamming two ideas onto one chart, if percentages are more important, then a separate chart should be produced, now ordered by decreasing percentages (see below).

The most crucial improvement is the fine print.  Perhaps extending their subtle sarcasm too far, the chart maker omitted context for interpreting the data: namely, that the low-mid-high range represents estimates by up to 5 different sources, each using potentially different methodologies for estimation.  This partially explains the huge variance in estimates for the US (or does it?).

Redo2_muslimsAlso missing is a comment on why these particular 6 countries were selected.  It may give a misleading picture of "where they are" in the context of world population.

Reference: "Where They Are", Economist, June 2006.

 

Aug 23, 2006

Unscientific poll?

Nyt_evolution_1This decent chart adequately brought out a, to some, shocking point that the U.S. ranks next to dead last in our unscientific attitude towards evolution.

I have commented on 3-category bar charts before: putting the "not sure" category in the middle allows the reader to compare "Yes"/"No" responses easily.  I prefer lightly-tinted boxes for "not sure" to help gauge its size.

It's a good idea to provide the 50% label at the top.  It is mischievous to use a guiding line, akin to a tick mark, to indicate the "not sure" legend.  This line segment, while entirely redundant, creates confusion as the reader, exhausted by the height of the chart, would be desperately seeking the 50% mark at the bottom.  Without such, it is taxing to figure out what % of Americans actually answered "yes".

The biggest distortion in this chart is the absence of scale, in particular, population scale.  Half of the U.S. population represent many times the number of people as half of Cyprus, for example.  The choice of countries in the survey is also heavily biased toward small European countries.  In fact, Japan appears to be the only non-European country depicted, aside from the U.S. while curiously, the "special" partner of the U.S. is missing.

Reference: NYT, approx. Aug 16, 2006.
Here's a previous post on Science, with a link to Darwin's classics.

Jul 31, 2006

Enigma of the big-buck pitcher

A data table accompanied a recent NYT article pointing out that big-buck pitchers were far from sure wins for those clubs who have taken Scott Boras' pitches.  The table contains a wealth of data but very little information is immediately revealed to the reader.

Nyt_bigcontracts


Sorting by size of contract makes no sense, especially since the key metric of success, i.e. change in winning percentage pre- and post-contract, cannot be discerned without pulling out a calculator.  Further, once the contract size is expressed by dollars per season, it is clear that all these contracts fall into the same range (about $10-13 million per year).

BigcontractsOne graphical alternative is shown on the right.  It brings out the desired message, that big-buck pitchers may or may not perform after signing big-buck contracts.  Several pitchers are annotated as these have improved or declined by more than 200 points.

A graph cannot hope to achieve the data density of a data table.  But the process of making a graph forces the designer to focus on the most important data, which itself has great benefits.

Reference: "Big-buck pitchers are often big busts", New York Times, July 16, 2006.

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Recent Comments

Search Junk Charts


  • Custom Search

Residues

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31