Dissecting charts from a Big Data study
Mar 28, 2014
Today's post examines an example of Big Data analyses, submitted by a reader, Daniel T. The link to the analysis is here. (On the sister blog, I discussed the nature of this type of analysis. This post concerns the graphical element.)
The analyst looked at "the influence of operating systems and device types on hourly usage behavior". This dataset satisfies four of the five characteristics in the OCCAM framework (link).
Observational: the data are ad impressions coming from the Chitika Ad Network observed between February 26 and March 11, 2014. This means users are (unwittingly) being tracked by cookies, pixels, or some other form of tracking devices. The analyst did not plan this study and then collect the data.
Lacking Controls: There will be a time trend but what should we compare against? How do we know if something is out of the ordinary or not?
Seemingly Complete: Right up top, we are impressed with the use of "a sample of tens of millions of device-specific online ad impressions". At least they understand this is a sample, not everything.
Adapted: All weblog data are adapted in the sense that web logs originally serve web developers who are interested in debugging their code. Operating systems and device types are tracked because each variant of OS and devices require customization, and we need that data to understand how webpages render differently. I wrote about the adaptedness of this data in a separate blog post. (link)
The analysis did not require merging data, the fifth element of the framework.
***
Here is the chart type used to present the analysis. There are many problems.
The conclusion the analyst drew from the above chart is: "North American Android users are more active than their iOS counterparts late at night and during the majority of the workday." In other words, the analyst points out that the blue line sits on top of the orange line during certain times of the day.
Daniel is very annoyed with the way the data is processed, and rightfully so. The chart actually does not say what it appears to say. This is because of the use of indexing.
This simple chart is not so simple to interpret!
This is because each line is "indexed to self". For example, at 12 pm EST, Android users are at 75% of their peak-hour usage while iOS users are at 2/3 of their peak-hour usage. The trouble is the peak-hour usage by iOS users is more than 2.5 times as high as the peak-hour usage of Android users, so 100% blue is less than half of 100% orange by count.
Later in the same post, the analyst re-indexed both series to the iOS peak. This chart tells us that iOS users are more active no matter what time of the day.
***
The Chitika analyst is not doing anything unusual. This type of indexing is a pandemic in Web analytics. The worst thing about it is that a lot of Web data is long-tailed and the maximum value is an outlier. Indexing data to an outlier isn't wise. (Usually, the index is used to hide actual values of the data, usually for keeping company secrets. But there are better ways to accomplish this.)
***
Digging a little deeper, we've got to note other key assumptions that the analyst must have made in producing this analysis -- and about which we are in the dark.
Are users with both Apple and Microsoft devices counted on both blue and orange lines?
How is "volume" of Web usage determined? Is it strictly number of ad impressions?
Why is total volume displayed? If Microsoft PCs dominate Macs, and the chart shows the PC line well above the Mac line, is it speaking to market share or is it speaking to usage patterns of the average user?
How representative is the traffic in the Chitika network?
How did the analyst deal with bot traffic?
***
Finally, using EST (Eastern Standard Time) rather than local time is silly. Think of it this way: if you extract only New York and California users, and compare their curves, without even looking at the data, you can surmise that you will see a similar shape but time-shifted by approximately three hours. Ignoring time difference leads to silly statements like this: "Both sets of users are most active during the workday, with usage volume dropping off in the late evening/early morning."
Andrew from the Chitika Insights research team here. Within the graph and text of every report we try to make our indices as clear to readers as possible so that they understand exactly what is being shown, along with what is not being shown. That being said, there are always ways to improve, and admittedly, Daniel and yourself brought up some excellent points on how to better visualize these hour by hour data sets. As such, in the future, we will be employing the following:
- We will utilize hourly raw volume percentages observed over a time period, essentially dividing volume by hour over total volume. For example, if we observed roughly 10,000,000 Android impressions for the entire sample, and about 500,000 came at the 0 hour, then the 0 hour for Android would equal 5%. This method keeps the peak hour visible, but in the context of total daily usage.
- To visualize comparative volume, we will use an area chart to highlight what percentage of total relative traffic each data point represents. Again, using the recent data set as an example, if Android makes up 32% of relative traffic (i.e. combined iOS and Android) at the 0 hour, an area chart will feature Android taking up 32% of said x-axis point, while iOS takes up the remaining 68%.
Just to address some other points within your posts - the data utilized as part of our Insights studies are not collected for the purposes of debugging, but rather, part of the basic information included within a user agent, which we catalog whenever a user loads a webpage containing our ad code, whether or not an ad actually appears. Indeed, we can only observe the activity occurring within our network of around 350,000 websites, and are only one data source out of many. Regarding controls, while we can't say with absolute certainty that these usage patterns are representative of a typical day, the 14-day sample we averaged over one 24-hour period generally presented consistent usage patterns across the data set. Also to clarify, these graphs essentially show the usage patterns of the entire aggregate user base of that OS, rather than any given user or user type. Additionally, our back end statistical processes do account for any "invalid traffic" from bots or similar sources, and we disregard it within our sample.
Again, thank you for the feedback. Our aim is always to make these aggregate usage trends as valuable and clear as possible, and these changes will help us better deliver on that moving forward.
Posted by: Andrew | Mar 28, 2014 at 02:12 PM
Hi Andrew, thanks for your note! My top advice is (a) be clear about what the interesting research question is and (b) be clear about what the model of the relationship between OS/devices/etc. and diurnal usage pattern is.
Regarding "invalid" traffic, how do you know your processes removed all of them? This isn't a problem specific to your situation but is an industry wide issue. I don't think any of us can even estimate what proportion of such traffic is caught by these processes. In fact, I keep running into suspicious traffic every time I look at web data.
Great to hear you're making improvements.
Posted by: Kaiser | Mar 31, 2014 at 10:53 PM