In a mailing list I subscribe to, some users were not happy about the academic research using Dropbox data collected on academics, as written up by the co-authors in Harvard Business Review (link).
In a nutshell, the researchers obtained "anonymized" "project-folder-related" data from Dropbox on university-affiliated accounts, and did some simplistic bivariate correlations, and proceeded to draw several conclusions about "best practices" on "successful team collaborations." This type of research is very common in this "Big Data" age, and I have already written extensively about its many challenges.
This Dropbox dataset has all five descriptors of the OCCAM characteristics of "BIg Data". It is Observational, seemingly Complete, with no Controls, Adapted from its original use, and Merged (with data from Web of Science). These characteristics cause many problems with the analysis, which I describe below. For more on OCCAM data, check out this post, and my other posts on OCCAM data.
Implicit in their analysis - and most other uses of "found data" - is the assumption of complete data. The authors believe that because their data consist of tens of thousands of researchers and 500,000 projects, they must have all the data. In this case, the authors knew that there are other platforms out there but they waved away the inconvenience. This implies they believe they have all of the "informative" data.
The authors also assumed that (a) all relevant collaborative research involves putting all relevant files on one of the major online platforms (e.g. nothing on Slack or emails) and (b) all project collaborators use one and only one platform. Further, they assume that everyone has highly organized and structured folder directories within Dropbox for which an external person who knows nothing about the projects, or a machine, can infer its contents. These problems arise because the researchers did not start with a research question and design the data collection. They chose to adapt "found data" to their own objectives.
Ecological Fallacy, and Story Time
A typical conclusion is "People at higher-performing universities seemed to share work more equally, based on the frequency of instances that collaborators accessed project folders." Top universities are ranked by the aggregate performance of all teams (that use Dropbox, and are identified correctly). It does not follow that every team at a top university is a top team.
This is an instance of "story time." A piece of data is offered about something related, then while the reader is dozing off, it is linked to a conclusion that is not directly supported by the data. That conclusion is elaborated by a lot of words hypothesizing why it must be true. In this case, they say "It’s likely that more frequent collaborations led to positive spillover of information, insights, and team dynamics from one project to another." But they provide no evidence at all for this last statement. That's just a story.
Here's another one of the conclusions: "People at higher-performing universities seemed to share work more equally, based on the frequency of instances that collaborators accessed project folders." They make a conclusion about work allocation between different collaborators but what they actually measured was the relative frequency at which collaborators accessed the project folders in Dropbox. That's a proxy measure, and it's convenient based on "found data", but not a good proxy.
It does not appear that a multiple regression model was run. The presentation walks through apparently a series of bivariate analyses. The word control does not appear in the entire article. So this work suffers from xyopia - in each analysis, the one explanatory variable being analyzed is presumed to be the chief and only variable that influences the outcome.
Causation Creep, and More Story Time
The authors made no attempt to establish causality at all. They just interpreted every correlation as causal. So every conclusion is "story time". They print one analysis of the data, then they draw a causal conclusion that one would believe only when half-asleep.
People are also upset about data privacy.
- It does not appear that the academic users understood that using Dropbox means they are part of research studies.
- People don't believe the data are truly anonymized. It's pretty clear that the anonymization can be easily reversed. Just take the HBR article for example. If they removed the names of the authors but retained information about the authors as: one junior faculty at Northwestern Univ. Business School, one senior faculty at Northwestern Univ. Business School, and one employee of Dropbox - I don't think you can find another article that fits those criteria. So is that anonoymous?
- It's unclear if how they "anonymized folders" or analyzed them. There are folders with highly descriptive names, there are folders with partially descriptive names that only the collaborators may be able to decipher, and there are folders with names that do not identify the project (e.g. old_work). If they converted all folder names to alphanumeric strings, then all information about the contents of the folders is lost. If they don't convert those names, then there are clearly privacy concerns.
It's clear that some kind of IRB review is necessary to approve Big Data research projects to make sure privacy is protected.