« Ethical dilemmas in data science: an update | Main | If it's an RCT, it can be trusted »


Feed You can follow this conversation by subscribing to the comment feed for this post.


Great article!

Thank you for explaining the process of data scraping in such simple terms. I would like to add another perspective to what you said above.

Idea 1: The need for aggregators when there is missing data and lack of resources

In many countries with a federal system, the union and state governments handle different subject matters. For example, health in India is a state subject, whereas education is an overlap. This can lead to a lot of data being scattered across different sources. In some cases, official authorities may not even publish all data at once, such as the number of public schools or hospitals in each pincode or restrict by pagination. This can make it difficult for researchers or data enthusiasts to get the complete picture. In these cases, aggregators can play a valuable role by collecting data from different sources and making it available in a single place.

Sometimes, the reason why official authorities do not publish certain data is because it is expensive or difficult to collect and maintain. For example, it may be costly to call an API millions of times to get all the data you need. Or, the data may not be well-collected or structured, making it difficult to collate into a meaningful picture. In these cases, aggregators can also help by providing a more user-friendly interface for accessing the data.

Idea 2: Who owns the data?

Another important issue to consider is the ownership of data. In the example you gave, the NGO that collects data on the likelihood of a particular disease may own the data. However, if they share the data with the government, who then uses it to make policy decisions, the government may also be considered a data owner. This can be a complex issue, and there is no easy answer. Ultimately, it is up to the data owners to decide who they want to share their data with and how it can be used but people don't know what all are the possible consequences of approving to share the data. Often, it's a just a tick on a screen and most people don't contemplate.

I would love to hear more about your thoughts on this topic.


R: Thanks for the thoughtful comment. On Idea 1, merging of data is a separate issue from scraping. If the separate entities were to offer datasets for public usage, you'd download the datasets from all the sources, and then you still have to perform merging because these datasets typically are not standardized as you correctly said.

My point is that if the data owner is happy for others to use the data, then it is simpler for the data owner to offer a text file, and much simpler for the data enthusiast to use the text file.

Now - and I think you're getting at this, there may be public-interest projects in which one creates social value by scraping data in defiance of the owner. For example, you'd have to scrape Amazon to prove that it is practicing discrimination by geography. This concept supports the gist of the post - this is an example of scraping against the will of the website owner!

I'm particularly worried about a slippery slope. Some questionable practices are being justified by singular use cases that generate social value but the majority of use cases don't create such value. The former becomes a kind of trojan horse.

Idea 2 is very important. Facebook for example claims ownership of the data on its platforms; however, it also claims zero liability for the harm done by such data while profiting from them. It's a tricky area that requires legal expertise so I don't have much more to say about it.

The comments to this entry are closed.

Get new posts by email:
Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR, Wired.

See my Youtube and Flickr.


  • only in Big Data
Numbers Rule Your World:
Amazon - Barnes&Noble

Amazon - Barnes&Noble

Junk Charts Blog

Link to junkcharts

Graphics design by Amanda Lee

Next Events

Jan: 10 NYPL Data Science Careers Talk, New York, NY

Past Events

Aug: 15 NYPL Analytics Resume Review Workshop, New York, NY

Apr: 2 Data Visualization Seminar, Pasadena, CA

Mar: 30 ASA DataFest, New York, NY

See more here

Principal Analytics Prep

Link to Principal Analytics Prep