Dissecting the Netflix Dataset
In case you haven't heard, Netflix announced a public competition ($1 million prize) for a general purpose machine learning algorithm to predict movie ratings based on users' history (with the assumption, that we can learn from similar users). Now, the prize is nice, but the dataset that they released on its own caused quiet a stir in the Computer Science/Data Mining community - it is orders of magnitude larger than anything that was available before! Here are some quick stats:
480,189 User ID's, 17,770 movies, 100,480,507 ratings collected from October 1998 to December 2005.
In compressed version it fits on a single CD, once uncompressed it becomes a hefty 2GB+ dataset. Fun stuff! Of course, being a CS geek that I am, I couldn't pass the opportunity and myself a number of other grad/PhD students at University of Waterloo started a small team to see how far we can get - watch out, we're coming! :)
One of the first things I wanted to see was whether I could find some interesting patterns without resorting to AI/ML, and indeed I have - they're nothing unusual, but very interesting nonetheless. After struggling a bit with importing all that data into a database, I finally managed to cram it all in (kudos to MySQL5) and started running aggregate queries (~30 minutes on average on my home machine). And look at what I found...
Nothing unexpected, and doing same query on IMDB database would yield better results - but look at the exponential growth in the 90's! (My excel must have went bonkers - 2005 before 1999?)
Check out the long-tail! Top blockbusters claim up to 250,000 ratings each, whereas most movies linger below ~200.
Now this is a interesting one. The X-axis is the month of the year (starting with 1998), check out January! It spikes every year, I guess movies make for a popular gift, or we just have a lot more free time to watch movies! (Most probably, a combination of both).
An interesting one once again, the mean is 3.8! I guess on average we are satisfied with our movie choices - which makes it only so much harder for a ML algorithm to find an 'even more' interesting movie. One pattern that you cannot observe here is that the mean tends to drift to the right as the number of users grows - in 1998 the average was 3.4, by 2005 it steadily moved to 3.8. I wonder what accounts for the drift? Are early adopters (techies) more discerning in their ratings/choices? Are the movies getting better? (Doubt it!)
This one makes sense once you think about it - the lower your average rating is, the more discerning you are with your ratings (the wider the variance), plus the fact that you're lower on the scale gives you a wider window to rate movies in!
Last but not least, this is number of ratings (Y) per user, again perfect long tail - you can see that 50% of the users have rated less than 100 movies. (I chopped of the top 10K users by ratings because you can't see the tail with them in). However, this nonetheless points to a very active community for Netflix - over 200,000 users who rated over 100 movies! Wow!
There is a couple more graphs on my Flickr gallery and you can grab a couple of quick MS Excel spreadsheets with some of the data below also.
Update: Here is a log-log plot of number of ratings vs movie popularity. Note that this is a very weak representation, I'm using 'popularity' as determined by number of ratings, so there is an inherent direct dependency in this graph that shouldn't be there. Nonetheless though, look at the 'long' drooping tail, something tells me Netflix should beef up their movie selection to get that fixed. Then again, Netflix is still subject to the inefficiencies of the 'brick and mortar' world, they can only carry so many titles without stressing the distribution/warehousing/etc. costs.
Loading Netflix Dataset into SQL for DIY-minded people!