« People picture | Main | Looking for survival »

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341e992c53ef00d8357b7e2f69e2

Listed below are links to weblogs that reference Visualizing web statistics:

Comments

Daniel Waisberg

Hi,

The graphI would like to add your graph to the Web Analytics Collection that I have created.

Daniel Waisberg

I am sorry for the comment above...

The graphs in this post are very interesting and full of insights.

I would like to add them to my Web Analytics graph collection at http://www.esnips.com/web/WebAnalyticsGraphs

Is it ok?

derek

I'm not sure I get the purpose of the slope. If it's to gauge the superiority of Guitars to Home, would not the click rate itself be the best choice of ordinate, as here?

Hadley Wickham

What's up with the funky axis labelling? If we wanted to read off the exact values we could use the table (or interaction). What does adding them gain us? How would it scale for more data points? How can we tell the overall range of the axes? Are the scales equal or different?

Derek, why does your plot lack a bounding box? To my eye, the home observation floats off, unrelated to the rest of the graph.

derek

The answer to both your questions is that many of us are influenced by a classic work of data graphics called The Visual Display of Quantitative Information by Edward Tufte, and we don't feel they need these things. You have a point about it not scaling to larger numbers, but it isn't larger numbers yet, is it?

In the case of large numbers of points, the thing to do is change the strategy. One way to do that is to revert to a conventional scale of three or four numbers equally spaced. The reason Kaiser didn't go for a regular scale is that it almost certainy would have had at least three numbers per scale - how wasteful to present three numbers, and none of them the actual numbers!

An alternative when there are large numbers involved, is to put statistical data on the scale, such as the minimum, maximum, and median or mean. The first and third quartiles are possibilities as well, and that make five numbers. This way, the scales are made to work harder; instead of just being passive indicators of the size of space, they do a second job of indicating the one-dimensional distribution of the two-dimensional data on the graph. Other quantile measures are possible as well, like quintiles for economists, and percentiles.

I thought about a bounding box, but then I thought... nah, what for? However, now I look at my version again, I think I was too hasty omitting an origin, and I may put one in after all.

Timz

I'm Tim who submitted this question, thanks for the answer! These are much better than the direction I was going in. The second one really suits my purposes well.

I might consider changing % of Page Views to # of Page Views since that number will (hopefully) grow over time and is of interest (especially when comparing over periods).

Kaiser

Derek: the lines are actually key to reading the graph! The reason why we'd plot two dimensions is because both dimensions factor into our evaluation; the slope provides a one-stop shop to rank order the items. Your version also works but it is not as obvious which points are superior. With many more data points, that would become more of an issue. This is a great issue you are bring up; thanks!

Daniel: you're welcome to use anything on our blog, just provide a link back

derek

Now I see where you're coming from. You have three quantities going on there, the x, the slope, and the y. My three equivalents are the x, the y, and the area of the rectangle whose upper right corner the point defines.

My preferred way of representing three quantities in two dimensions like that, where a=b*c, is to turn both scales logarithmic and overlay a third sloping scale. That means I can't use data scales like here, so I won't try it in this case, but here's an example of an older graph of mine.

I've drafted another version of my first graph that I hope addresses some of Hadley's concerns.

Hadley Wickham

While I think Tufte produces beautiful graphics, he is, at the end of the day, a graphic artist, not a statistician. Axes and gridlines already have an important role to play - they provide a set of consistent reference points that help us read off observations more accurately, and enhance pattern perception [1]. When we start placing them at arbitrary locations, we loose this important visual reference.

What does displaying selected marginal quantiles really gain us? I might be more convinced if you were arguing for marginal histograms, but still, they can only ever show marginal behaviour, not the joint relationship between the two variables. (Derek: I do like your improved plot much better, however). I prefer my tools to do one thing and do it well, rather than trying to do too many things at once.

A little off topic, but if I'm going to cite Cleveland, I might as well cite his seminal work [2], which you haven't already read, you should track down a copy now! Sure the graphics might not be as pretty as Tufte, but Cleveland has spent considerable effort researching the efficacy of his guidelines.

[1] W. Cleveland. A model for studying display methods of statistical graphics. Journal of Computational and Graphical Statistics, 2:323–364, 1993.

[2] W. Cleveland and R. McGill. Graphical perception: The visual decoding of quantitative information on graphical displays of data. Journal of the Royal Statistical Society. Series A (General), 150(3):192–229, 1987.

Kaiser

Hadley: I think both Derek and myself agree with you. If there were more data, we'd both have used traditional axis labels; plus the marginal histograms. With only three data points here, why not provide more?

It's less useful in the second plot. In the first plot, I've just made it easier to estimate incremental changes.

Derek: like Hadley, I think having overlapping grid systems causes over-crowding in that graph. I'd probably use two graphs in that case.

derek

Sorry, which graph is it you think is overcrowded?

Timz

Anyone have tips on how to build #2 in Excel?

Kaiser

Timz: You'd have to trick Excel in thinking that you're plotting three lines. So add a (0%,0%) before each pair of data (e.g. 26%, 37%). Then use Scatter Plot.

Derek: I was referring to the one where you have the diagonal gridlines.

derek

Oh, but that's nothing more than an artifact of the fact that I was too lazy to make my own log scales, and used the Excel defaults instead, that only offer ranges of a power of ten. Here's an example of what you get when you take the touble to set the scales properly.

By contrast, here's the same thing using linear scales, an origin, and lines approaching the origin from different angles. Even though the page sizes and text fonts are the same, and the linear graph occupies more area, the data is still squashed up. I fear the same would happen with the web stats after the first few data points.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Marketing analytics and data visualization expert. Author and Speaker. Currently at Vimeo and NYU. See my full bio.

Book Blog



Link to junkcharts

Graphics design by Amanda Lee

The Read



Good Books

Keep in Touch

follow me on Twitter

Residues