Today I continue to explore the movie dataset, found on Kaggle. To catch up with previous work, see the blog posts 1 and 2.
One of the students came up with an interesting problem. Among the genre of action movies, are there particular plot elements that are correlated with box office? This problem is solvable because the dataset contains a variable called "plot keywords" lifted from IMDB.
Plot keywords are given in a single column in a pipe-delimited form. For example, one entry says "caribbean|curse|governor|pirate|undead." Sounds like one of those Johnny Depp movies.
Clearly, the first order of business is to unbind the keywords. We need to turn these individual words into "dummy variables" (i.e. 1 for the presence of "caribbean", 0 otherwise). Any analytical software will have such a capability. With Python, you might use the split function. With JMP, there is an excellent built-in function that converts such delimited strings into dummy variables with one click.
So now the one plot keyword column is turned into over 1,000 columns, each column relating to a specific keyword.
In theory, we are in business and can analyze this data using regression, trees, or whatever method one prefers.
***
If it is that simple, I wouldn't be blogging about this. As we like to say, you must look at your data. I ran some quick distribution analyses and was staring at these in confusion:
If the data were to be believed, words like "abduction" and "action hero" showed up in only one or two movies, out of a total of over 400 action films in this dataset.
Here is the entire histogram of how many times each keyword showed up in the data:
There were only twenty keywords that appeared in at least 10 action films: Murder. FBI. Police. Scientist. Assassin. Death. Alien. Future. Battle. CIA. Rescue. Spy. Superhero. Warrior. Escape. Martial Arts. Prison. Revenge. Terrorist. Vampire.
Huge alarm bells should be going off in the analyst's head right around now. There were only eleven movies about vampires? Only eleven martial arts movies? Only twelve movies involving superheroes?
I asked the student to figure out how this variable was collected. Go find the data dictionary to see if the keywords have some special criteria, such as statistically improbable words. Go to the IMDB page and look at what the scraping code might be doing (wrongly).
***
So what did we find? (See if you can spot the problem!) This is the screen shot of the section of the IMDB page for that Pirates of the Caribbean movie from which the plot keywords were lifted:
I have a bad habit of clicking on things when I am prompted to click. So I click on the "See All" link and a whole new world opened in front of my eyes!
So - the plot keyword column in the Kaggle dataset is completely useless. It contains five or six words in a list of hundreds of keywords. This also explains, at least partially, why the keywords appear so unique.
It pains me to think how many people have analyzed this dataset, and used these keywords to build models.
***
Back to the class project. Now this student was in serious trouble! Remember the problem she was hoping to solve is to find out which plot elements are predictive of box office. Turns out those keywords are extremely curtailed. So we had to immediately switch gears, and concoct a different problem that does not rely on these keywords.
(It was days before the due date. If there were more time, we can write a different scraping script to take the full list of keywords from the keyword pages and merge those to the existing dataset. Maybe someone who reads this post will be inspired to do it.)
If you are keeping count, that is strike #2. There are more posts coming.
Ethics are as important as both thoughtful, honest analyses and great visualizations. It happens to be against the Amazon Terms of Service to scrape IMDB and encouraging that act seems like a bad idea. Just because you can do something doesn't mean you should. (Strike #3?) IMO we need to teach ethics alongside analysis.
Ultimately it's still your choice or a reader's choice to risk violation of Terms of Service/Use. LinkedIn sued real people last year for scraping and there are other lawsuits from other sites that can be referred to as well. Better to ask permission and/or err on the side of ethics vs risk the consequences.
Posted by: Hrbrmstr | 01/25/2017 at 10:22 AM
H: Thanks for your comment. I have opinions about the topic of the "scraping" which I will put in a future post, now prompted by your note. In short, I have many concerns about this practice.
That said, if you look at the curricula at many universities, including Columbia and NYU, one of the first things they teach in a Python or dataviz course is web scraping! I happen to find this situation rather sad as I don't think scraping is at the core of those subjects.
One additional consideration is that the T&Cs are (probably) aimed at competitive businesses who are using the scraped data against Amazon. I am guessing (although I have not read the Terms) that there may be exceptions for using the data in an educational setting. It's a complicated issue so I will discuss it in a separate post.
Posted by: Kaiser | 01/25/2017 at 12:10 PM
They don't make exceptions, but do say that they will allow non-commercial on application. My thought is that if their isn't a commercial motive then it is very unlikely that they would try to do anything and just as unlikely that they would actually achieve anything given the rather loose contract between the scraper and the website owner. After all, I'm just grabbing something that is publicly available do some processing of my own. A bit different if I try to sell it, or produce something from the data that I sell, as then there are copyright problems.
One problem that may occur is with ethics approval. I think some institutions can have IRB that feel that because the original material was created by a person then their should be an ethics review.
Posted by: Ken | 01/29/2017 at 04:56 AM
Hollywood Economics by Arthur De Vany suggests that it will be very hard to predict box office results.
Posted by: John Hall | 08/05/2017 at 09:36 AM