« Numbersense and government accountability in the new political reality | Main | Pre-processing data is not just about correcting errors »


Feed You can follow this conversation by subscribing to the comment feed for this post.


Ethics are as important as both thoughtful, honest analyses and great visualizations. It happens to be against the Amazon Terms of Service to scrape IMDB and encouraging that act seems like a bad idea. Just because you can do something doesn't mean you should. (Strike #3?) IMO we need to teach ethics alongside analysis.

Ultimately it's still your choice or a reader's choice to risk violation of Terms of Service/Use. LinkedIn sued real people last year for scraping and there are other lawsuits from other sites that can be referred to as well. Better to ask permission and/or err on the side of ethics vs risk the consequences.


H: Thanks for your comment. I have opinions about the topic of the "scraping" which I will put in a future post, now prompted by your note. In short, I have many concerns about this practice.

That said, if you look at the curricula at many universities, including Columbia and NYU, one of the first things they teach in a Python or dataviz course is web scraping! I happen to find this situation rather sad as I don't think scraping is at the core of those subjects.

One additional consideration is that the T&Cs are (probably) aimed at competitive businesses who are using the scraped data against Amazon. I am guessing (although I have not read the Terms) that there may be exceptions for using the data in an educational setting. It's a complicated issue so I will discuss it in a separate post.


They don't make exceptions, but do say that they will allow non-commercial on application. My thought is that if their isn't a commercial motive then it is very unlikely that they would try to do anything and just as unlikely that they would actually achieve anything given the rather loose contract between the scraper and the website owner. After all, I'm just grabbing something that is publicly available do some processing of my own. A bit different if I try to sell it, or produce something from the data that I sell, as then there are copyright problems.

One problem that may occur is with ethics approval. I think some institutions can have IRB that feel that because the original material was created by a person then their should be an ethics review.

John Hall

Hollywood Economics by Arthur De Vany suggests that it will be very hard to predict box office results.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.


Post a comment

Your Information

(Name is required. Email address will not be displayed with the comment.)


Link to Principal Analytics Prep

See our curriculum, instructors. Apply.
Business analytics and data visualization expert. Author and Speaker. Founder of Principal Analytics Prep, MS Applied Analytics at Columbia. See my full bio.

Next Events

Oct: 31 Webinar on Data Visualization, online at JMP

Nov: 1 NYU unCOMMON Salon Public Lecture, New York, NY

Nov: 8 Tufts Gordon Institute: A Conversation with Kaiser Fung, Facebook Live

Nov: 8 Tufts TGI Careers & Networking Night panel, Somerville, MA

Nov: 26 Data Visualization New York Meetup, New York, NY

Nov: 27 NYPL Data Analytics Resume Workshop, New York, NY

Nov: 30 Purdue School of Engineering Seminar, West Lafayette, IN

Dec: 1 Purdue Mathematics, Data Science, and Industry Conference, West Lafayette, IN

Past Events

See here

Future Courses (New York)

Summer: Statistical Reasoning & Numbersense, Principal Analytics Prep (4 weeks)

Summer: Applied Analytics Frameworks & Methods, Columbia (6 weeks)

Junk Charts Blog

Link to junkcharts

Graphics design by Amanda Lee


  • only in Big Data