One of the students came up with an interesting problem. Among the genre of action movies, are there particular plot elements that are correlated with box office? This problem is solvable because the dataset contains a variable called "plot keywords" lifted from IMDB.
Plot keywords are given in a single column in a pipe-delimited form. For example, one entry says "caribbean|curse|governor|pirate|undead." Sounds like one of those Johnny Depp movies.
Clearly, the first order of business is to unbind the keywords. We need to turn these individual words into "dummy variables" (i.e. 1 for the presence of "caribbean", 0 otherwise). Any analytical software will have such a capability. With Python, you might use the split function. With JMP, there is an excellent built-in function that converts such delimited strings into dummy variables with one click.
So now the one plot keyword column is turned into over 1,000 columns, each column relating to a specific keyword.
In theory, we are in business and can analyze this data using regression, trees, or whatever method one prefers.
If it is that simple, I wouldn't be blogging about this. As we like to say, you must look at your data. I ran some quick distribution analyses and was staring at these in confusion:
If the data were to be believed, words like "abduction" and "action hero" showed up in only one or two movies, out of a total of over 400 action films in this dataset.
Here is the entire histogram of how many times each keyword showed up in the data:
There were only twenty keywords that appeared in at least 10 action films: Murder. FBI. Police. Scientist. Assassin. Death. Alien. Future. Battle. CIA. Rescue. Spy. Superhero. Warrior. Escape. Martial Arts. Prison. Revenge. Terrorist. Vampire.
Huge alarm bells should be going off in the analyst's head right around now. There were only eleven movies about vampires? Only eleven martial arts movies? Only twelve movies involving superheroes?
I asked the student to figure out how this variable was collected. Go find the data dictionary to see if the keywords have some special criteria, such as statistically improbable words. Go to the IMDB page and look at what the scraping code might be doing (wrongly).
So what did we find? (See if you can spot the problem!) This is the screen shot of the section of the IMDB page for that Pirates of the Caribbean movie from which the plot keywords were lifted:
I have a bad habit of clicking on things when I am prompted to click. So I click on the "See All" link and a whole new world opened in front of my eyes!
So - the plot keyword column in the Kaggle dataset is completely useless. It contains five or six words in a list of hundreds of keywords. This also explains, at least partially, why the keywords appear so unique.
It pains me to think how many people have analyzed this dataset, and used these keywords to build models.
Back to the class project. Now this student was in serious trouble! Remember the problem she was hoping to solve is to find out which plot elements are predictive of box office. Turns out those keywords are extremely curtailed. So we had to immediately switch gears, and concoct a different problem that does not rely on these keywords.
(It was days before the due date. If there were more time, we can write a different scraping script to take the full list of keywords from the keyword pages and merge those to the existing dataset. Maybe someone who reads this post will be inspired to do it.)
If you are keeping count, that is strike #2. There are more posts coming.