Visualizing web statistics
May 22, 2007
Tim inquired about:
how to create an elegant graph for Web visitor traffic statistics that shows both how many views a page gets and then how many people click that page to go further ("conversion rate"). Part of the problem is that conversion rates vary from, say, .3% to 50% (a wide range).
Lets work with this sample data set. I ordered it from highest to lowest click rate, which is the primary metric of interest. The number of page views is of interest too as sometimes rarely-visited pages may have high click rates.
At this point, it's important to know the context. Specifically, who controls the allocation of pages? Did the data come from a randomized experiment? Or did they get a self-selected sample (e.g. web surfers deciding which section of the site to visit)?
The first construct I tried is the "lift curve" often used in marketing. It's the same thing as the Lorenz curve used by demographers but interpreted differently. Here, we see that Guitar pages accounted for 26% of the page views but 37% of the clicks; House pages accounted for an incremental 44% of the pages and 59% of the clicks; etc. The relative click rates are immediately clear from the steepness of the line segments. The lift curve is appropriate for the self-selected case, in which we can take the allocation of page views as fixed.
If the allocation of page views is a decision to be made, then it doesn't make much sense to accumulate page views. The second construct is the "scatter plot" of % clicks versus % page views. The steepness of the line through the origin helps us compare the click rates. Bicycles is clearly inferior in generating clicks.
Both these constructs are highly efficient; adding new data does not expand the chart at all.
Keen readers will observe that the slope of the line is not the click rate but rather a click rate index (relative to the overall click rate). This means that any data point above the diagonal has above-average click rate.
Hi,
The graphI would like to add your graph to the Web Analytics Collection that I have created.
Posted by: Daniel Waisberg | May 22, 2007 at 03:57 AM
I am sorry for the comment above...
The graphs in this post are very interesting and full of insights.
I would like to add them to my Web Analytics graph collection at http://www.esnips.com/web/WebAnalyticsGraphs
Is it ok?
Posted by: Daniel Waisberg | May 22, 2007 at 04:00 AM
I'm not sure I get the purpose of the slope. If it's to gauge the superiority of Guitars to Home, would not the click rate itself be the best choice of ordinate, as here?
Posted by: derek | May 22, 2007 at 04:16 AM
What's up with the funky axis labelling? If we wanted to read off the exact values we could use the table (or interaction). What does adding them gain us? How would it scale for more data points? How can we tell the overall range of the axes? Are the scales equal or different?
Derek, why does your plot lack a bounding box? To my eye, the home observation floats off, unrelated to the rest of the graph.
Posted by: Hadley Wickham | May 22, 2007 at 06:13 AM
The answer to both your questions is that many of us are influenced by a classic work of data graphics called The Visual Display of Quantitative Information by Edward Tufte, and we don't feel they need these things. You have a point about it not scaling to larger numbers, but it isn't larger numbers yet, is it?
In the case of large numbers of points, the thing to do is change the strategy. One way to do that is to revert to a conventional scale of three or four numbers equally spaced. The reason Kaiser didn't go for a regular scale is that it almost certainy would have had at least three numbers per scale - how wasteful to present three numbers, and none of them the actual numbers!
An alternative when there are large numbers involved, is to put statistical data on the scale, such as the minimum, maximum, and median or mean. The first and third quartiles are possibilities as well, and that make five numbers. This way, the scales are made to work harder; instead of just being passive indicators of the size of space, they do a second job of indicating the one-dimensional distribution of the two-dimensional data on the graph. Other quantile measures are possible as well, like quintiles for economists, and percentiles.
I thought about a bounding box, but then I thought... nah, what for? However, now I look at my version again, I think I was too hasty omitting an origin, and I may put one in after all.
Posted by: derek | May 22, 2007 at 10:07 AM
I'm Tim who submitted this question, thanks for the answer! These are much better than the direction I was going in. The second one really suits my purposes well.
I might consider changing % of Page Views to # of Page Views since that number will (hopefully) grow over time and is of interest (especially when comparing over periods).
Posted by: Timz | May 22, 2007 at 11:38 AM
Derek: the lines are actually key to reading the graph! The reason why we'd plot two dimensions is because both dimensions factor into our evaluation; the slope provides a one-stop shop to rank order the items. Your version also works but it is not as obvious which points are superior. With many more data points, that would become more of an issue. This is a great issue you are bring up; thanks!
Daniel: you're welcome to use anything on our blog, just provide a link back
Posted by: Kaiser | May 22, 2007 at 01:04 PM
Now I see where you're coming from. You have three quantities going on there, the x, the slope, and the y. My three equivalents are the x, the y, and the area of the rectangle whose upper right corner the point defines.
My preferred way of representing three quantities in two dimensions like that, where a=b*c, is to turn both scales logarithmic and overlay a third sloping scale. That means I can't use data scales like here, so I won't try it in this case, but here's an example of an older graph of mine.
I've drafted another version of my first graph that I hope addresses some of Hadley's concerns.
Posted by: derek | May 22, 2007 at 02:11 PM
While I think Tufte produces beautiful graphics, he is, at the end of the day, a graphic artist, not a statistician. Axes and gridlines already have an important role to play - they provide a set of consistent reference points that help us read off observations more accurately, and enhance pattern perception [1]. When we start placing them at arbitrary locations, we loose this important visual reference.
What does displaying selected marginal quantiles really gain us? I might be more convinced if you were arguing for marginal histograms, but still, they can only ever show marginal behaviour, not the joint relationship between the two variables. (Derek: I do like your improved plot much better, however). I prefer my tools to do one thing and do it well, rather than trying to do too many things at once.
A little off topic, but if I'm going to cite Cleveland, I might as well cite his seminal work [2], which you haven't already read, you should track down a copy now! Sure the graphics might not be as pretty as Tufte, but Cleveland has spent considerable effort researching the efficacy of his guidelines.
[1] W. Cleveland. A model for studying display methods of statistical graphics. Journal of Computational and Graphical Statistics, 2:323–364, 1993.
[2] W. Cleveland and R. McGill. Graphical perception: The visual decoding of quantitative information on graphical displays of data. Journal of the Royal Statistical Society. Series A (General), 150(3):192–229, 1987.
Posted by: Hadley Wickham | May 22, 2007 at 04:05 PM
Hadley: I think both Derek and myself agree with you. If there were more data, we'd both have used traditional axis labels; plus the marginal histograms. With only three data points here, why not provide more?
It's less useful in the second plot. In the first plot, I've just made it easier to estimate incremental changes.
Derek: like Hadley, I think having overlapping grid systems causes over-crowding in that graph. I'd probably use two graphs in that case.
Posted by: Kaiser | May 22, 2007 at 07:42 PM
Sorry, which graph is it you think is overcrowded?
Posted by: derek | May 23, 2007 at 05:21 AM
Anyone have tips on how to build #2 in Excel?
Posted by: Timz | May 23, 2007 at 06:49 PM
Timz: You'd have to trick Excel in thinking that you're plotting three lines. So add a (0%,0%) before each pair of data (e.g. 26%, 37%). Then use Scatter Plot.
Derek: I was referring to the one where you have the diagonal gridlines.
Posted by: Kaiser | May 23, 2007 at 07:33 PM
Oh, but that's nothing more than an artifact of the fact that I was too lazy to make my own log scales, and used the Excel defaults instead, that only offer ranges of a power of ten. Here's an example of what you get when you take the touble to set the scales properly.
By contrast, here's the same thing using linear scales, an origin, and lines approaching the origin from different angles. Even though the page sizes and text fonts are the same, and the linear graph occupies more area, the data is still squashed up. I fear the same would happen with the web stats after the first few data points.
Posted by: derek | May 24, 2007 at 02:28 PM