« Numbersense and government accountability in the new political reality | Main | Pre-processing data is not just about correcting errors »

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Hrbrmstr

Ethics are as important as both thoughtful, honest analyses and great visualizations. It happens to be against the Amazon Terms of Service to scrape IMDB and encouraging that act seems like a bad idea. Just because you can do something doesn't mean you should. (Strike #3?) IMO we need to teach ethics alongside analysis.

Ultimately it's still your choice or a reader's choice to risk violation of Terms of Service/Use. LinkedIn sued real people last year for scraping and there are other lawsuits from other sites that can be referred to as well. Better to ask permission and/or err on the side of ethics vs risk the consequences.

Kaiser

H: Thanks for your comment. I have opinions about the topic of the "scraping" which I will put in a future post, now prompted by your note. In short, I have many concerns about this practice.

That said, if you look at the curricula at many universities, including Columbia and NYU, one of the first things they teach in a Python or dataviz course is web scraping! I happen to find this situation rather sad as I don't think scraping is at the core of those subjects.

One additional consideration is that the T&Cs are (probably) aimed at competitive businesses who are using the scraped data against Amazon. I am guessing (although I have not read the Terms) that there may be exceptions for using the data in an educational setting. It's a complicated issue so I will discuss it in a separate post.

Ken

They don't make exceptions, but do say that they will allow non-commercial on application. My thought is that if their isn't a commercial motive then it is very unlikely that they would try to do anything and just as unlikely that they would actually achieve anything given the rather loose contract between the scraper and the website owner. After all, I'm just grabbing something that is publicly available do some processing of my own. A bit different if I try to sell it, or produce something from the data that I sell, as then there are copyright problems.

One problem that may occur is with ethics approval. I think some institutions can have IRB that feel that because the original material was created by a person then their should be an ethics review.

John Hall

Hollywood Economics by Arthur De Vany suggests that it will be very hard to predict box office results.

The comments to this entry are closed.

NEW BOOTCAMP



Part-Time Immersive
Fall 2019


Link to Principal Analytics Prep

See our curriculum, instructors. Apply.
Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR.

See my Youtube and Flickr.
Numbers Rule Your World:
Amazon - Barnes&Noble

Numbersense:
Amazon - Barnes&Noble

Search3

  • only in Big Data

Next Events

Aug: 15 NYPL Analytics Resume Review Workshop, New York, NY

Past Events

Jun: 5 NYPL Public Lecture on Analytics Careers, New York, NY

Apr: 2 Data Visualization Seminar, Pasadena, CA

Mar: 30 ASA DataFest, New York, NY

See more here

Courses

R Fundamentals, Principal Analytics Prep

Numbersense: Statistical Reasoning in Practice, Principal Analytics Prep

Applied Analytics Frameworks & Methods, Columbia

The Art of Data Visualization, NYU

Signed copies at McNally-Jackson, NYC

Excerpts: Numbersense Ch. 1, 7, 8. NRYW

Junk Charts Blog



Link to junkcharts

Graphics design by Amanda Lee

Community