Loading Netflix Dataset into SQL
Here is a couple of quick Ruby scripts I wrote to load the Netflix dataset into SQL. It may not be the most elegant solution, but it works. So first of all, let's get our database running. Assuming you have some flavor of SQL server installed, use the provided structure file below to prepare the tables (should work on all flavors of SQL, unless you are using something really exotic). I would strongly recommend disabling InnoDB/transaction support and using the my.cnf-huge
(MySQL) config or its equivalent for other databases. Here is the structure file:
/* SQL Structure - should work on all versions */
DROP DATABASE IF EXISTS `netflix`;
CREATE DATABASE `netflix` /*!40100 DEFAULT CHARACTER SET latin1 */;
USE `netflix`;
CREATE TABLE `movies` (
`id` int(5) NOT NULL DEFAULT '0',
`year` int(4) DEFAULT '0',
`title` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `probe` (
`movie_id` int(5) NOT NULL DEFAULT '0',
`customer_id` int(6) NOT NULL DEFAULT '0',
KEY `movie_id` (`movie_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `qualifying` (
`customer_id` int(6) NOT NULL DEFAULT '0',
`date` date NOT NULL DEFAULT '0000-00-00',
`movie_id` int(5) NOT NULL DEFAULT '0',
KEY `movie_id` (`movie_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `ratings` (
`movie_id` int(5) NOT NULL DEFAULT '0',
`customer_id` int(6) NOT NULL DEFAULT '0',
`rating` int(1) NOT NULL DEFAULT '0',
`date` date NOT NULL DEFAULT '0000-00-00',
PRIMARY KEY (`movie_id`,`customer_id`),
KEY `date` (`date`),
KEY `movie_id` (`movie_id`),
KEY `customer_id` (`customer_id`),
KEY `rating` (`rating`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
First, let's import the movie titles. To accomplish this I came up with a very quick script to rewrite the provided text file into an SQL import file (exactly as it would look if you used mysqldump). Here you go:
begin
sql= <<-'EOS' # drop & create table
USE `netflix`;
DROP TABLE IF EXISTS `movies`;
CREATE TABLE `movies` (
`id` int(5) NOT NULL default '0',
`year` int(4) default '0',
`title` varchar(255) NOT NULL default '',
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
EOS
out = File.open("movies-import.sql", "w")
out.write(sql)
File.open("../movie_titles.txt", "r") do |movies|
movies.each_line { |movie|
id, year, title = movie.chomp.scan(/(\d+),(\d+|NULL),(.*)/).flatten
# escape ' and \ characters
title = title.gsub(/\\/, '\&\&').gsub(/'/, "''")
out.write("\nINSERT INTO `movies` (`id`,`year`,`title`) VALUES (#{id},#{year},'#{title}');")
}
end
out.close
rescue => err
puts "Exception: #{err}"
end
Running this script will produce a movies-import.sql
file, which can be loaded into the database by calling: mysql -u USER -p DBNAME < movies-import.sql
Importing the ratings dataset can be tricky. Best way I found was to transform all 17,000 thousand files into a directly loadable format and then cram the data in using the native SQL import functionality. Code provided below will open and transform each of the 17,000 movie files into a loadable format. Warning: this script will give you a couple of GB's of new text files. However, you can delete them as soon as you've finished importing the data.
# Transforms movie_id.txt into a 'DATA LOAD'able format for MySQL
# - Assuming table is: [movie_id, user_id, rating, date]
# - Not the most efficient solution - for people on unix/linux, you can apparently pipe the data into DATA LOAD directly (google is your friend)
begin
1.upto(17770) do |n|
out = File.open("data-load/ratings.#{n.to_s.rjust(7, '0')}.txt", "w")
File.open("../training_set/mv_#{n.to_s.rjust(7, '0')}.txt", "r") do |ratings|
ratings.each_line { |rating|
if rating =~ /(\d+),(\d+),(.*)/
userid, rating, date = rating.scan(/(\d+),(\d+),(.*)/).flatten
out.write("#{n},#{userid},#{rating},#{date}\n")
end
}
end
end
out.close
rescue => err
puts "Exception: #{err}"
end
Now that we have the files, we are just going to iterate over them and call 'LOAD DATA':
# This will use DATA LOAD and call on each movie file
# - A LOT faster than doing inserts.
require "dbi"
begin
# connect to the MySQL server
dbh = DBI.connect("dbi:Mysql:netflix:localhost", "root", "yourpass")
row = dbh.select_one("SELECT VERSION()")
p "Running on version: " + row[0]
movieCount = 0
1.upto(17770) do |n|
movie = Dir.getwd << "//data-load//ratings.#{n.to_s.rjust(7, '0')}.txt"
result = dbh.do("LOAD DATA INFILE '#{movie}' INTO TABLE ratings FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'")
p "#{movieCount} : #{result}" if movieCount % 100 == 0
movieCount = movieCount + 1
end
rescue DBI::DatabaseError => e
p "An error occurred"
p "Error code: #{e.err}"
p "Error message: #{e.errstr}"
ensure
p "Exiting on: #{movieCount}"
dbh.disconnect if dbh
end
Loading the data files into the DB took me about 20 minutes. The transform script took another 20-30 minutes, giving us a grand total of 1-cpu hour. At the end of the import my DB grew by about ~5GB. You can cut down on the size and get better load times by removing some extra key fields in the ratings table. If you look at the provided SQL file carefully, you'll see that I have keys for every field - this obviously impacts the size and import times. I wanted to run aggregate queries on my data, hence I opted out to create keys on every attribute.
Note: Dissecting the Netflix Dataset has some interesting stats on the data