Lines that delight, lines that blight
Layered donuts have excess fats and oils

February talks, and exploratory data analysis using visuals


In February, I am bringing my dataviz lecture to various cities: Atlanta (Feb 7), Austin (Feb 15), and Copenhagen (Feb 28). Click on the links for free registration.

I hope to meet some of you there.


On the sister blog about predictive models and Big Data, I have been discussing aspects of a dataset containing IMDB movie data. Here are previous posts (1, 2, 3).

The latest instalment contains the following chart:


The general idea is that the average rating of the average film on IMDB has declined from about 7.5 to 6.5... but this does not mean that IMDB users like oldies more than recent movies. The problem is a bias in the IMDB user base. Since IMDB's website launched only in 1990, users are much more likely to be reviewing movies released after 1990 than before. Further, if users are reviewing oldies, they are likely reviewing oldies that they like and go back to, rather than the horrible movie they watched 15 years ago.

Modelers should be exploring and investigating their datasets before building their models. Same thing for anyone doing data visualization! You need to understand the origin of the data, and its biases in order to tell the proper story.

Click here to read the full post.




Feed You can follow this conversation by subscribing to the comment feed for this post.

Adam Schwartz

It seems like it might be even messier than that. Have you recently watched an episode of any 1980s television show? I remember "Alf" being great as a kid, but I couldn't even get through the first episode when it showed up on one of my streaming services. So assuming I could go back in time and make ratings of such shows back then, they might fare quite well at the time but if you made me review them again now then they'd much get lower ratings.

It seems like quality (at least on a 1-10 scale) may not be an unchanging constant over time for a given movie.


I just want to say that I have really enjoyed this series of posts. My concern with this kind of analysis is more ontological (?), I guess.

While IMDB data is easy to understand, and it's interesting to see patterns in ratings, technically, this isn't "movie data." It's opinion data. It's even worse with Rotten Tomatoes or Metacritic data, because their aggregation and scoring algorithms move them farther from the source: actual movies.

All movie analyses that use crowdrourced rating data are resting on the flawed assumption that such ratings have some inherent, meaningful relationship to movie quality.


Data_chefs: You will like my next post in the series. Thanks for the note

The comments to this entry are closed.