Error spotting
Horrid stuff

Digging it out

Tr_diggbgAnother sunset photo compilation?  Not quite.

This chart acts and smells like the sunset chart, being generated by many unknowing collaborators, this time, visitors to the content aggregation site, Digg.  For those unfamiliar, web browsers can "digg" any web page they find interesting (by clicking on an image), which causes a link to be generated at Digg's web-site.  We can use the number of Diggs to judge the value or popularity of a web page.

In effect, Digg is a gigantic save folder for the masses.  What happens when we have huge amounts of data?  We have to work really hard to dig out the useful information.  This chart goes quite a long way to answer one specific question.

Digg users are plotted horizontally and the stories they Digged are plotted vertically.  The bright white vertical strip represents suspicious activity; some user digged a large number of stories within the time window of the chart, most likely a bot trying to usurp the mass rating system.

Flickr and Digg are two of the more prominent stories of the so-called "Web 2.0", or mass collaboration on the Web.    Between my last post and this post, I have kind of lost enthusiasm for this type of charts, at least from a statistical perspective.  There is no real collaboration: the photographer who contributed sunset No. 103 does not know the one who uploaded No. 31, for example.  Using this logic, every survey or census ever conducted qualifies as mass collaboration, just because there are many participants providing data. 

What's worse, a typical survey brings together results from a random sample.  These charts all have highly biased samples, and I haven't seen any discussion yet of this issue.  They cannot be interpreted without understanding who participated.

Reference: "How Digg Combats Cheater", Technology Review, Jan 24, 2007.


Feed You can follow this conversation by subscribing to the comment feed for this post.

Tom Carden

Long comment, sorry :)

It's a shame you're tired of this kind of chart just because you don't think the sites are collaborative (and who said they were?). We (Stamen) have more of these to come, including some that are even more reminiscent of the Flickr sunset chart in your last post.

With regards to the data sample, I don't think it's right to use a loaded word like biased. As much as anything, these kinds of charts serve as thinking aids: part of understanding the bigger picture, not the picture itself.

I also think you're off the mark a bit about the collaboration aspect of things. Flickr isn't a site to collaborate to find what time sunset is, it's a site for sharing photos and discussing them. Digg isn't a collaborating to find out what patterns of activity there are on the web, it's a site to find interesting webpages and share/discuss them.

If you're looking for "web 2.0" collaboration in these sites, it's easy to find though. Some of the groups on Flickr collaborate to collect and identify photos of plants for example. I would argue that the Digg front page is a collaborative effort to find the most interesting pages on the web, right now, with a heavy emphasis on novelty.

All said and done, you're quite right that there's not enough discussion of the sample choice with this kind of chart. Most often it's just the most recently available data, not necessarily a statistically useful slice. I recently talked to a math expert who perceives a small renaissance in university statistics departments, largely thanks to data mining and the internet. Here's hoping it rubs off on those of us involved with sorting the meaning from the mess.


I have to admit in my haste, I sounded as if I have lost interest in this area. Far from it! What I really meant was that the initial excitement has been taken over by a dose of reality...

Internet data represent lots of challenges for statisticians. Lurking behind the whole discussion of what caused the noise in the sunset chart is the issue of selection bias as well. Is there any control over how many of the sunset photographs are taken from different time zones, etc.? To be more exact, it's the classical statistical problem of self-selection.


I find the "charts" (to be simplistic at best) and plotting of internet community usage and submissions activity, such as on sites like digg - completely and entirely fascinating. I can definately see how web stats can drive a "small renaissance" in university statistics departments.

thanks for the post

The comments to this entry are closed.