May 23, 2007

Looking for survival

Retention_rate_by_daniel_waisberg_2 Daniel W of esnips has started a collection of graphics on visualizing web statistics.  The following graph is an attempt to capture the ability of the web-site to attract returning customers.

The time axis serves double duty here: it is an indication of which "cohort" the users belong to, in other words, when they signed up; it is, also, the month of returning visits.

Web_surv A more typical chart used by statisticians is the survival curve.  As shown here, these are the same curves as above but having the same starting point.  Now, the time axis is interpreted as number of months after registration.  Of 100 members who registered in January, how many returned one month later, two months later, etc.

If the purpose is to evaluate the consistency of retaining customers by cohort, then this graphic is less cluttered.  I also used a fading metaphor to color the lines so that the oldest cohort (also, the longest line) is the faintest.  Line labels are best hidden, and revealed interactively when the user mouses over a line of interest.

Not sure if Daniel was plotting real data; in general, we expect a certain amount of criss-crossing.  If the data is real, then his site has seen uninterrupted improvement every month thus far.

Source: The Web Analytics Graph Collection, eSnips.

May 22, 2007

Visualizing web statistics

Tim inquired about:

how to create an elegant graph for Web visitor traffic statistics that shows both how many views a page gets and then how many people click that page to go further ("conversion rate"). Part of the problem is that conversion rates vary from, say, .3% to 50% (a wide range).

Lets work with this sample data set.  Web1I ordered it from highest to lowest click rate, which is the primary metric of interest.  The number of page views is of interest too as sometimes rarely-visited pages may have high click rates.

At this point, it's important to know the context.  Specifically, who controls the allocation of pages? Did the data come from a randomized experiment? Or did they get a self-selected sample (e.g. web surfers deciding which section of the site to visit)?

Web_lift The first construct I tried is the "lift curve" often used in marketing.  It's the same thing as the Lorenz curve used by demographers but interpreted differently.  Here, we see that Guitar pages accounted for 26% of the page views but 37% of the clicks; House pages accounted for an incremental 44% of the pages and 59% of the clicks; etc.  The relative click rates are immediately clear from the steepness of the line segments.  The lift curve is appropriate for the self-selected case, in which we can take the allocation of page views as fixed.

Web_scatter If the allocation of page views is a decision to be made, then it doesn't make much sense to accumulate page views.  The second construct is the "scatter plot" of % clicks versus % page views.  The steepness of the line through the origin helps us compare the click rates.  Bicycles is clearly inferior in generating clicks.

Both these constructs are highly efficient; adding new data does not expand the chart at all.

Keen readers will observe that the slope of the line is not the click rate but rather a click rate index (relative to the overall click rate).  This means that any data point above the diagonal has above-average click rate.

Apr 28, 2007

Cutting through the noise

A terrific application of tag clouds can be seen over at pollster.com, following the first debate of Democratic Presidential hopefuls the other night.  Here is Senator Biden's "tag cloud", depicting the top 50 words that came out of his mouth that night.  The size of each word is proportional to how often he uttered it.

Bidentag400_2 Having not seen the debate, I can use this summary device to get a quick read on what his main points were.  It's clear that he talked about the war ("Iraq", "troops"), education ("teachers", "students"), abortion ("roe", "wade" but interesting not the word "abortion").  Of course, if he had a distinct message, that would have been even better. For what the tag cloud exposed (assuming it was done right) was that he was pretty much all over the place, touching on many different things about equally often. 

It is disconcerting that a word like "so-called" made it into the top 50.  Better is "better" is his #1 word.

It is typical to process text-based data by removing all the most common words that do not carry real meaning (um, ur, the, so-called, etc.) but in this case, keeping them is helpful so the candidates can catch problems like the excessive use of "so-called".

However, the tag cloud would have been improved if "stemming" were used to collapse "talk" and "talking", "teacher" and "teachers", etc.

Clintontag400_2 Pollster did tag clouds for every candidate.  Comparing them provides even more insights!  Here's one for Senator Clinton. Her message is much more focused, quite a lot of time spent proclaiming her "readiness" for "President", quite a bit on "healthcare" and quite a bit on the "war".

As Pollster correctly pointed out, it is unclear if the size of words could be compared across tag clouds.  If so, the setup would be even more powerful.

The entire set of tag clouds can be seen here.   Long-time readers of this blog will remember that we have advocated such use back in Jan 2006, when discussing the "concordance" feature at Amazon.  This successful application validates our enthusiasm.

Apr 12, 2007

Peripherals 2

In terms of interactive charting, Google Finance did much more than hide the legend.  In their main stock price chart, they used a number of neat features.

Google_ahm1

This chart effectively conveys a huge amount of information in a small space.  The bottom strip which shows relative prices for the past two years provides context to interpret the five-day movement shown in the main chart area.  I prefer to see a scale on the bottom strip as well. 

The sliding scrollbar can be dragged to show historical data.  Besides, the width of the window shown in the main area can be controlled.  For instance:

Google_ahm2

Without any effort, we are now looking at a 3-month chart for Q2 2006.  Notice the summary statistic on the top right corner also morphed.  The axis scale changed, and it never did start from zero to begin with.  (This shortcoming is alleviated by the profile chart in the bottom strip.)

Further, by placing the cursor in the chart area, we can highlight a particular day: a dot appeared on the price curve, the volume on that day was highlighted, and the text on the top right switched.  That text is what we typically place inside the chart area as a "data label".  The effect of moving it to the corner is similar to hiding the legend: it makes the graph more legible and provides space for longer descriptions.  As we move the cursor from left to right, the graph dynamically adapts.  Marvellous!

Google_ahm3

It may not be obvious the amount of data processing that has to take place to implement these sorts of features. I don't have space to address the data issue but maybe some of our readers can comment on it. 

Apr 08, 2007

Peripherals 1

Like any technology, charts also come with peripherals: I'm talking about legends, data labels, grid-lines and so on.  These things typically give us the most trouble, especially with complex data sets.  The analogy is apt: one may feel inextricably knotted up like bunches of cords and wires.

Interactive graphics is a particularly elegant solution to this problem, and Google Finance has done a fantastic job leading the way.  One trick is to show the legend only when the user asks for it. 
Google_sectorsum_lgUsing bar charts (on the left), Google summarizes neatly the performance of stocks within each industry sector.  The bar chart gives a sense of the dispersion which adds to the average returns printed next to them.  For example, most sectors gained on average but then about 30% of the individual stocks in most sectors actually declined on that day.  So the fact that technology stocks gained 0.48% on average doesn't necessarily mean that the two tech stocks you own gained 0.48% or gained at all.

Typically, we would put a legend on the side or at the bottom of the chart, which all be told, is an ugly duckling next to a well-executed chart.  Here, the legend is hidden behind the "What's this?" link.  The side benefit is that the legend can be as verbose as needed since it doesn't interfere with the chart.

There are a few minor things to consider:

  • "What's this?" is not very informative: Why not call it a "legend" or "key"?
  • The graph designer seems to think that the most important information sought by readers was the extremes, i.e. the percentage of stocks that gained/lost more than 2%.  By darkening the sides of the bar, it draws attention away from the middle which is the boundary between the gainers and the losers.  I'd like to see that boundary delineated.
  • Similar to the above point, I'd sketch out a version which aligns the gainer/loser boundary to the middle so it's easy to see the balance between gainers and losers.  This version however would require more space
  • I'd provide sorting by average return, and by percentage of gainers

Mar 21, 2007

Dot com bubbles

Web_dotcombubbles Thanks to Dustin J for the pointer as well as the title of this post.  Dotcom bubbles is the most appropriate name for this overblown chart (featured as the "chart of the day" here).

The chart has no title or axis labels so only the diligent reader will figure out that the data consist of acquisition value of several high-profile Internet companies in the past three years.

There are less data than it seems.  Both the heights and the areas of the bubbles indicate the same thing, the deal values.  If we are supposed to see a trend, we are not finding it.

Most of these deals are not directly comparable anyway.  Webex and Ironport are infrastructure type companies with real business models.  Skype is a phone service.  Ask Jeeves is not a leader in its own space. Myspace and YouTube are traffic sites.

Reference: "Chart of the Day: Web deals", Valleywag, Mar 15 2007.

Mar 12, 2007

Lines of death

I've been reading my friend's anti-smoking tome, and traced this "infographic" back to its source (World Health Organization). 

Who_tobacco I was very intrigued by the "lines of death" which seemed to make the point that the risk of death had a spatial correlation: specifically, that the death risk for male smokers was higher in northern hemisphere (above the line), primarily developed countries, as compared to the southern hemisphere, mostly developing nations.

I find that somewhat counter-intuitive but in a fascinating book like this, that brings together both scientific, psychological and societal commentary, I was expecting to learn new things.

Looking at the legend, the red areas were regions in which deaths from tobacco use accounted for over 25% of "total deaths among men and women over 35".  This explained some, as perhaps there were more reasons to die (warfare, other diseases, mine accidents, etc.) in developing nations than in developed nations, or that they had larger populations (so more deaths even at lower rates).

Who_tobacco2 However, the description of the "lines of death" raised my eyebrows.  It is now claimed that more than 25% of middle-aged people (35-69 years old) die from tobacco use in the red regions. 

Did they mean 25% of the dead middle-aged people die from smoking?  Or 25% of all middle-aged folks die from smoking?  A gigantic difference!

Percentages are very tricky things to use.  Every time I see a percentage, the first thing I ask is what is the base population.  Here, the baseline appeared to have gotten lost in translation.

This set of maps also shows the peril of focusing too much on  entertainment value, and losing the plot. 

For those concerned about the effect of smoking on our society and our children, I highly recommend Dr. Rabinoff's highly readable new book, "Ending the tobacco holocaust".  It contains lots of interesting tidbits and really brings together every cogent argument that exists, including the common ones you've heard and others you haven't.

Reference: "Ending the tobacco holocaust" by Michael Rabinoff; The Tobacco Atlas by the World Health Organization

Feb 06, 2007

Digging it out

Tr_diggbgAnother sunset photo compilation?  Not quite.

This chart acts and smells like the sunset chart, being generated by many unknowing collaborators, this time, visitors to the content aggregation site, Digg.  For those unfamiliar, web browsers can "digg" any web page they find interesting (by clicking on an image), which causes a link to be generated at Digg's web-site.  We can use the number of Diggs to judge the value or popularity of a web page.

In effect, Digg is a gigantic save folder for the masses.  What happens when we have huge amounts of data?  We have to work really hard to dig out the useful information.  This chart goes quite a long way to answer one specific question.

Digg users are plotted horizontally and the stories they Digged are plotted vertically.  The bright white vertical strip represents suspicious activity; some user digged a large number of stories within the time window of the chart, most likely a bot trying to usurp the mass rating system.

Flickr and Digg are two of the more prominent stories of the so-called "Web 2.0", or mass collaboration on the Web.    Between my last post and this post, I have kind of lost enthusiasm for this type of charts, at least from a statistical perspective.  There is no real collaboration: the photographer who contributed sunset No. 103 does not know the one who uploaded No. 31, for example.  Using this logic, every survey or census ever conducted qualifies as mass collaboration, just because there are many participants providing data. 

What's worse, a typical survey brings together results from a random sample.  These charts all have highly biased samples, and I haven't seen any discussion yet of this issue.  They cannot be interpreted without understanding who participated.

Reference: "How Digg Combats Cheater", Technology Review, Jan 24, 2007.

Jan 16, 2007

Subjectivity

Irwebfeature_1 When I look at charts like this one, I ponder: Should graph designers adopt "objectivity" as practiced by American journalists?

Is it even possible to make "objective" charts?  Every design choice we make seem to chip away some of the detachment.  In this chart, the choice to order important web-site features by shopper -- rather than merchant -- ratings is a tacit preference for those ratings.  Bringing out key messages in the data is a subjective act, isn't it?

Are "objective" charts useful?  In our example, the design choices are kept to a minimum, and so it seems is its usefulness.  In comparing shopper and merchant ratings, one would be most interested in identifying the most effective web-site features as well as those features offered by merchants that find little resonance with shoppers-users.  These questions are better addressed by directly plotting the average rank and the ranking gap between merchants and shoppers (see below).

Notice that I said "ranking" rather than "rating".  The footnote discloses that the ratings were obtained from two different surveys conducted by two different companies at two different times.  How should we interpret the difference of 13% between the 89% of shoppers rating "Free Shipping" "very to extremely helpful" and the 76% of merchants rating "Free Shipping" "somewhat to very valuable"?

RedowebfeatureIn the junkart chart, we can focus on three groups of features:

  • the three top features ("Promo Discounts", "Free Shipping" and "Keyword Search") which attained the best average rank and least ranking gap;
  • the three "orphan" features ("Recommended Products", "Top Sellers", "Gift Selection") created by loving web-site producers, abandoned by independent-minded shoppers;
  • the three "neglected stepchildren" ("Shop the Catalog", "Store Locator", "Product Comparison") whose importance to shoppers were vastly underestimated by the merchants.

Unfortunately, while being "objective",  the data table fails to point out anything of interest to the reader.

Reference: "Consumers want one thing -- merchants are delivering another", Internet Retailer, Jan 2007.

Oct 06, 2006

For love of Color

Derek C. pointed us to this piece of chartjunk on Wikipedia.  This chart compares the mass of solar system objects, relative to the Earth's mass.Wiki_solar

Derek's comment:

The bars are inappropriate, as their length is proportional to the
logarithm of the ratio of the masses of the object and the Earth. Also
the multiple colours are distracting.

I'm also mystified by the first bar called "Solar System".  It seems to convey the idea that the Solar System is much larger than the Earth;  combined with the second bar ("Sun"), it tells us that every object but the Sun pales into insignificance.  If this is true, then the Solar System needs to be labelled differently as it is not a "solar system object".

Derek sent in a much improved chart:

Derekc_solar

His version is much cleaner.  The axis labels, properly oriented, are much easier to read.  The use of color is admirably restrained: I suspect that he is as baffled as I about the asterisks (now blue dots) in the original chart. I'd retain the vertical line through the Earth (relative mass = 1) to help anchor the chart.

But a job well done!  He should send it in to the powers to be at Wikipedia.


Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Search Junk Charts


  • Custom Search

Residues

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31