Dissecting charts from a Big Data study
Mar 28, 2014
Today's post examines an example of Big Data analyses, submitted by a reader, Daniel T. The link to the analysis is here. (On the sister blog, I discussed the nature of this type of analysis. This post concerns the graphical element.)
The analyst looked at "the influence of operating systems and device types on hourly usage behavior". This dataset satisfies four of the five characteristics in the OCCAM framework (link).
Observational: the data are ad impressions coming from the Chitika Ad Network observed between February 26 and March 11, 2014. This means users are (unwittingly) being tracked by cookies, pixels, or some other form of tracking devices. The analyst did not plan this study and then collect the data.
Lacking Controls: There will be a time trend but what should we compare against? How do we know if something is out of the ordinary or not?
Seemingly Complete: Right up top, we are impressed with the use of "a sample of tens of millions of device-specific online ad impressions". At least they understand this is a sample, not everything.
Adapted: All weblog data are adapted in the sense that web logs originally serve web developers who are interested in debugging their code. Operating systems and device types are tracked because each variant of OS and devices require customization, and we need that data to understand how webpages render differently. I wrote about the adaptedness of this data in a separate blog post. (link)
The analysis did not require merging data, the fifth element of the framework.
***
Here is the chart type used to present the analysis. There are many problems.
The conclusion the analyst drew from the above chart is: "North American Android users are more active than their iOS counterparts late at night and during the majority of the workday." In other words, the analyst points out that the blue line sits on top of the orange line during certain times of the day.
Daniel is very annoyed with the way the data is processed, and rightfully so. The chart actually does not say what it appears to say. This is because of the use of indexing.
This simple chart is not so simple to interpret!
This is because each line is "indexed to self". For example, at 12 pm EST, Android users are at 75% of their peak-hour usage while iOS users are at 2/3 of their peak-hour usage. The trouble is the peak-hour usage by iOS users is more than 2.5 times as high as the peak-hour usage of Android users, so 100% blue is less than half of 100% orange by count.
Later in the same post, the analyst re-indexed both series to the iOS peak. This chart tells us that iOS users are more active no matter what time of the day.
***
The Chitika analyst is not doing anything unusual. This type of indexing is a pandemic in Web analytics. The worst thing about it is that a lot of Web data is long-tailed and the maximum value is an outlier. Indexing data to an outlier isn't wise. (Usually, the index is used to hide actual values of the data, usually for keeping company secrets. But there are better ways to accomplish this.)
***
Digging a little deeper, we've got to note other key assumptions that the analyst must have made in producing this analysis -- and about which we are in the dark.
Are users with both Apple and Microsoft devices counted on both blue and orange lines?
How is "volume" of Web usage determined? Is it strictly number of ad impressions?
Why is total volume displayed? If Microsoft PCs dominate Macs, and the chart shows the PC line well above the Mac line, is it speaking to market share or is it speaking to usage patterns of the average user?
How representative is the traffic in the Chitika network?
How did the analyst deal with bot traffic?
***
Finally, using EST (Eastern Standard Time) rather than local time is silly. Think of it this way: if you extract only New York and California users, and compare their curves, without even looking at the data, you can surmise that you will see a similar shape but time-shifted by approximately three hours. Ignoring time difference leads to silly statements like this: "Both sets of users are most active during the workday, with usage volume dropping off in the late evening/early morning."