« Numbersense and government accountability in the new political reality | Main | Pre-processing data is not just about correcting errors »


Feed You can follow this conversation by subscribing to the comment feed for this post.


Ethics are as important as both thoughtful, honest analyses and great visualizations. It happens to be against the Amazon Terms of Service to scrape IMDB and encouraging that act seems like a bad idea. Just because you can do something doesn't mean you should. (Strike #3?) IMO we need to teach ethics alongside analysis.

Ultimately it's still your choice or a reader's choice to risk violation of Terms of Service/Use. LinkedIn sued real people last year for scraping and there are other lawsuits from other sites that can be referred to as well. Better to ask permission and/or err on the side of ethics vs risk the consequences.


H: Thanks for your comment. I have opinions about the topic of the "scraping" which I will put in a future post, now prompted by your note. In short, I have many concerns about this practice.

That said, if you look at the curricula at many universities, including Columbia and NYU, one of the first things they teach in a Python or dataviz course is web scraping! I happen to find this situation rather sad as I don't think scraping is at the core of those subjects.

One additional consideration is that the T&Cs are (probably) aimed at competitive businesses who are using the scraped data against Amazon. I am guessing (although I have not read the Terms) that there may be exceptions for using the data in an educational setting. It's a complicated issue so I will discuss it in a separate post.


They don't make exceptions, but do say that they will allow non-commercial on application. My thought is that if their isn't a commercial motive then it is very unlikely that they would try to do anything and just as unlikely that they would actually achieve anything given the rather loose contract between the scraper and the website owner. After all, I'm just grabbing something that is publicly available do some processing of my own. A bit different if I try to sell it, or produce something from the data that I sell, as then there are copyright problems.

One problem that may occur is with ethics approval. I think some institutions can have IRB that feel that because the original material was created by a person then their should be an ethics review.

John Hall

Hollywood Economics by Arthur De Vany suggests that it will be very hard to predict box office results.

The comments to this entry are closed.

Get new posts by email:
Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR, Wired.

See my Youtube and Flickr.


  • only in Big Data
Numbers Rule Your World:
Amazon - Barnes&Noble

Amazon - Barnes&Noble

Junk Charts Blog

Link to junkcharts

Graphics design by Amanda Lee

Next Events

Jan: 10 NYPL Data Science Careers Talk, New York, NY

Past Events

Aug: 15 NYPL Analytics Resume Review Workshop, New York, NY

Apr: 2 Data Visualization Seminar, Pasadena, CA

Mar: 30 ASA DataFest, New York, NY

See more here

Principal Analytics Prep

Link to Principal Analytics Prep