Avinash Kaushik's masterful post on the "mutli-channel attribution problem" in Web analytics is required reading for anyone seeking an understanding of what Big Data is really about. Kaushik's posts are marathons; I provide here a little background, plus some highlights from his post to save you some time. But you absolutely should read the whole thing!
I will start from the elementary. Big Data is big because the Internet was set up in such a way that there are extensive log files (Weblogs) documenting details of how information flows from point A to point B. When you click on a link in your browser, it sends a bunch of requests to a remote server, asking for the elements (images, text, buttons, etc.) that form a Web page. The remote server keeps a log of all such requests from all browsers from everywhere around the world. One of these requests goes to the Facebook server (this is why Facebook blankets huge numbers of pages with their Like buttons.) The Facebook server then has a record of what browsers asked for what pages, in other words, Facebook knows you navigated to that page. Something called a cookie is involved here; the cookie identifies your browser.
Let's say you buy flowers at FTD.com. The retailer uses Google Analytics to track user movement on its website (This is a hypothetical example. I don't know what tools they use.) This means that some computer code is inserted into the web pages at FTD.com, and each time your browser requests a page, a call is made to the Google Analytics server with your cookie information.
The point is that lots of data reside in these Weblogs. They tell when you visited which Web page. Eventually, you may end up buying flowers - that is known as an "action". If you purchased flowers, the FTD.com marketing team wants to understand how you arrived at that decision. They assume that the path by which you ended up on the purchase web page provides hints as to why you bought flowers that day. Note I used the words "assume" and then "hints".
Here is an example of a path analysis:
Take a look at the first row, which represents all the recorded interactions that took place prior to a user making the purchase. Read from right to left and you are reading time backwards from more recent interactions. The "action" of purchasing is off the chart on the right side. The five interactions immediately preceding the purchase are classified as "direct", which basically means we don't know how the user arrived at the site. The next, earlier interaction was traced to "organic search" meaning someone came to the website from (most likely) Google, and (most likely) searching directly for our company name. Looking further back, we also see a social network generating leads, and six more unknown interactions.
Marketers stare at this diagram and impose a causal explanation on it. Every interaction is assumed to be a contributor to the final purchase, albeit with different levels of importance. (In Chapter 3 of Numbersense, I caution that such a path is not causal. To see this, think counterfactually. If for example you remove Social Network spending, so that the purple arrows disappear, would the purchase still have occurred? If yes, the Social Network interaction had zero causal effect.)
The concept of "attribution" is to distribute credit among one or more of these prior interactions. Kaushik walks through a bunch of models that can be used to divide credit.
Kaushik then points out that analyzing the above figure (especially when it has thousands of other rows) is a waste of time: "There are too many paths, and you can't actually control the path that a potential customer can take."
While Kaushik dives into the mechanics, here are some high-level takeaways from his post:
- The attribution problem has no intrinsic answer. There is no single correct answer. Everything is subjective.
- Many decisions affect the attribution outcomes. e.g. which sources are credited and which are not, which positions in the path are privileged and which are not, the time window for eligiblility, what counts as an "action" and what doesn't. Different decisions lead to different attribution schemes.
- Having more data creates more complexity but does not reduce subjectivity. On the contrary, more data creates more levers resulting in more assumptions.
See Kaushik's full post here.
PS. Let me make a small clarification. I'm not against subjectivity. I'm saying every attribution scheme is subjective. One can spend time arguing over the details of one scheme versus another but the new scheme is also subjective. See the section on "Perversion of Measurement" in Chapter 2 of Numbersense for further discussion of this important point.