A co-worker pointed me to a Reddit thread that someone put up to showcase a small data-science side project - to establish whether NBA star James Harden's performance is affected by his "affinity for ... strip clubs" (thereafter, SC).
The analytical plan used in this project is typical of many data science projects. In this post (and possibly a few more), I raise issues that analysts should ponder when conceptualizing this type of analysis. These comments are designed to help those who are interested in advancing the sophistication of their analytical work. I am not re-analyzing the data, or debunking what's been done. I am impressed by what "AngryCentrist" has put together - starting with how AC gathered the data to how AC executed an analytical plan resulting in some conclusions.
***
Let me summarize the analytical game plan as follows: (you can read the gory details in this google doc)
- The outcome variable is John Harden's on-court performance, expressed in an inverse scale (i.e. the larger the metric, the worse his performance). The explanatory variable is the quality rating of SCs in the cities where Harden played away games. The analysis looks at the correlation between these two variables.
- Only the last four seasons of away games in which Harden played are used. In each season, there are usually one or two away games against any given team. Team and city are synonymous except for Los Angeles which fields two teams. (Brooklyn and New York are treated as separate cities.)
- On-court performance is measured as the number of metrics on which Harden's in-city performance is worse than his season average. Only six metrics (e.g turnovers) are considered over four seasons so this performance metric is a number between 0 and 24. Zero means Harden out-performed his season average for each of the 6 metrics over each of the 4 seasons. By contrast, if Harden did worse than his season average for every metric over those four seasons, then the performance metric attains the maximum value of 24. The redditor, AngryCentrist, called this the "number of sub-par performances".
- SC ratings are based on the first 10 results with customer ratings when doing a search on Google of the type "[city] strip clubs".
- The analyst concluded that the two variables were positively correlated based on fitting a linear regression model. Higher SC quality rating is associated with worse on-court performance. The r-squared was 20 percent (which is not impressive).
***
For this post, I focus on the explanatory variable, the SC average rating, and the concept of "noise" in our data.
The SC average rating is an example of an indirect metric, a "noisy" proxy for what we want to measure. Notice that the outcome variable is about a particular person (Harden) while the explanatory variable measures a large set of people, of which Harden is allegedly a member.
The "noise" comes from several sources. If Harden was not a member of the set, then the explanatory variable has no direct connection with the outcome. So we hope that Harden not only visited SCs in all those cities, but also reviewed his experiences on Google, thus contributing to the average ratings.
If Harden did enjoy himself at the SCs and submitted reviews to Google, his ratings would still have been immaterial, just one vote in the average. Since the analysis only extracted "top 10" SCs based on a Google search, one expects the selected SCs are the more popular ones, which means that Harden's ratings would be even more insignificant.
If we don't require Harden's ratings to be included in these SC averages, we would at least require that his ratings - if revealed - to be highly correlated with the average ratings among Google reviewers. This assumption has two components - Harden would have to patronize the popular SCs, rather than ones that are more exclusive, or discreet; and Harden's opinion of his experiences there would have to be similar to that of the average patron.
Let's pretend that Harden visited one club prior to playing an away game in Miami. The average rating of the top 10 clubs in Miami is then not quite up to the task. That specific club might not even be part of the ten. This gap is also part of the "noise" of indirect measurement.
Timing is yet another source of noise. What if Harden engaged in the adult activity after the game rather than before the game?
Timing enters the picture in another way. Adult entertainment venues come and go, just like any other businesses. If Harden had visited a club before a match early in the NBA season, and that club subsequently shut down, then this club would not feature in the Google search, and would be absent from the average rating. Besides, new clubs that opened after Harden's visit would enter the average rating, if they become popular enough. Either of these situations causes the indirect metric to stray away from its proper value.
***
Here's the big picture. The indirect measure of average quality rating of SCs is a good proxy only under a large set of presumptions. None of the above conditions is directly observable, so each is presumed.
In general, the further away the measurement is from what is supposed to be measured, the more "noisy" is the measurement.
A workaround is to change our theory of the link between SC rating and Harden's performance. Maybe the link does not require physical visits to these establishments and subsequent reviews. Perhaps it is Harden's knowledge of a vibrant SC scene, regardless of active participation, that is doing the trick. This link though cannot be proven or disproven by the data, and thus is also presumed.
Assumptions are not avoidable. Verifiable assumptions are better than unverifiable ones. A good analysis plan should make the fewest, and weakest assumptions.
Comments
You can follow this conversation by subscribing to the comment feed for this post.