There really is not much more to say than to print the first sentence of this Bloomberg artcile (link):
Facebook owner Meta Platforms Inc. for years paid a contractor to scrape data from other websites while publicly condemning the practice and suing companies that pulled data from its own social-media platforms.
***
Web scraping is a very strange industry that operates in a shadow. At the end of this post, I hope you'll appreciate why the scrapers are hiding.
Let's trace how data land on websites. I'm going to use an example of baseball statistics in Baseball-Reference.com. Here is how one set of data looks like to human eyes:
The browser does not see it this way. This is what it seems (HTML):
Each line is a row of data and these lines extend far into the right. Here is the next part:
You start to see the numbers embedded among the HTML tags here (49, 27.3, ....)
Not surprisingly, the HTML format is not efficient for storing data. The raw data are stored in databases, and there is a process that extracts the data out of these databases, and transform them to HTML format. The transformation essentially adds formatting to let the browser know how to display the data.
The act of "web scraping" is to grab the HTML-formatted files, then reverse the above process - i.e. strip out the formatting, and reduce the information back to something that looks like an Excel spreadsheet.
***
The mystery is why this reverse process is needed.
What the scraper wants is the original dataset without the formatting. You'd think they can just ask for it directly.
In fact, some websites encourage others to use their data, so they create tools to distribute them. For example, Baseball-Reference.com has a tab called "Share & Export", with these options:
The key point is that all these options place the data directly into the users' hands. There is no scraping, no stripping of tags, etc.
This means that if scraping is necessary, it's usually against the website's wishes. Many websites (apparently including Meta/Facebook) want to "own" their data. These websites typically put up lots of barriers to impede scrapers. The process is not as smooth as the one I described above. For example, they can easily differentiate between a reader browsing the site and a scraper pulling down every page of every table - and they block the latter. They may use a variety of tools to hide the information from the scrapers. (You may have come across some webpages that make it impossible to copy data out of the tables - using Ctrl-C.) They may restrict the number of pages that can be viewed within a time window.
There is a bit of an arms race between scrapers and these data hoarders.
Pretty much all the big tech companies (Meta, Google, Microsoft, LInkedin, Amazon, etc.) don't want scrapers. Twitter was an exception although that is quickly changing. That's why scrapers are operating in a gray area. It may also be seen as rebellious - upholding a belief that data should be free, or that data should be owned by the people who create them.
***
From a technical perspective, data analysts should prefer to take the data directly out of databases, rather than processing scraped files. The reason is that the insertion of formatting is not always a reversible process. Take for example the following excerpt from the table of pitching statistics:
Notice that on line 9 (Chase Anderson), there is a blank space under the column "GmScA." In a proper database, missing values are not blanks but usually represented by symbols like "." or NA. When the data are ingested to the HTML table, those missing indicators are removed and replaced by blanks, which looks better on the browser.
However, the existence of such blank spaces may confuse the scraping tool. As a result, the data might be shifted, as if the blank cell did not exist. It appears easy to spot when I highlight a specific instance like this. But larger datasets may be spread out over dozens if not hundreds of tables, and the missing values can appear anywhere. Another example of an annoying feature of HTML tables is footnote symbols printed next to the data, e.g. 45.6#. Such formatting is for browser presentation and would not exist in the underlying database.
Other than working around obstacles planted by websites, the back and forth transformation has little value and may introduce impurities.
Great article!
Thank you for explaining the process of data scraping in such simple terms. I would like to add another perspective to what you said above.
Idea 1: The need for aggregators when there is missing data and lack of resources
In many countries with a federal system, the union and state governments handle different subject matters. For example, health in India is a state subject, whereas education is an overlap. This can lead to a lot of data being scattered across different sources. In some cases, official authorities may not even publish all data at once, such as the number of public schools or hospitals in each pincode or restrict by pagination. This can make it difficult for researchers or data enthusiasts to get the complete picture. In these cases, aggregators can play a valuable role by collecting data from different sources and making it available in a single place.
Sometimes, the reason why official authorities do not publish certain data is because it is expensive or difficult to collect and maintain. For example, it may be costly to call an API millions of times to get all the data you need. Or, the data may not be well-collected or structured, making it difficult to collate into a meaningful picture. In these cases, aggregators can also help by providing a more user-friendly interface for accessing the data.
Idea 2: Who owns the data?
Another important issue to consider is the ownership of data. In the example you gave, the NGO that collects data on the likelihood of a particular disease may own the data. However, if they share the data with the government, who then uses it to make policy decisions, the government may also be considered a data owner. This can be a complex issue, and there is no easy answer. Ultimately, it is up to the data owners to decide who they want to share their data with and how it can be used but people don't know what all are the possible consequences of approving to share the data. Often, it's a just a tick on a screen and most people don't contemplate.
I would love to hear more about your thoughts on this topic.
Posted by: Rudresh | 09/01/2023 at 03:53 AM
R: Thanks for the thoughtful comment. On Idea 1, merging of data is a separate issue from scraping. If the separate entities were to offer datasets for public usage, you'd download the datasets from all the sources, and then you still have to perform merging because these datasets typically are not standardized as you correctly said.
My point is that if the data owner is happy for others to use the data, then it is simpler for the data owner to offer a text file, and much simpler for the data enthusiast to use the text file.
Now - and I think you're getting at this, there may be public-interest projects in which one creates social value by scraping data in defiance of the owner. For example, you'd have to scrape Amazon to prove that it is practicing discrimination by geography. This concept supports the gist of the post - this is an example of scraping against the will of the website owner!
I'm particularly worried about a slippery slope. Some questionable practices are being justified by singular use cases that generate social value but the majority of use cases don't create such value. The former becomes a kind of trojan horse.
Idea 2 is very important. Facebook for example claims ownership of the data on its platforms; however, it also claims zero liability for the harm done by such data while profiting from them. It's a tricky area that requires legal expertise so I don't have much more to say about it.
Posted by: Kaiser | 09/04/2023 at 10:09 PM