Loading Netflix Dataset into SQL

By Ilya Grigorik on December 01, 2006

Here is a couple of quick Ruby scripts I wrote to load the Netflix dataset into SQL. It may not be the most elegant solution, but it works. So first of all, let's get our database running. Assuming you have some flavor of SQL server installed, use the provided structure file below to prepare the tables (should work on all flavors of SQL, unless you are using something really exotic). I would strongly recommend disabling InnoDB/transaction support and using the my.cnf-huge (MySQL) config or its equivalent for other databases. Here is the structure file:

/* SQL Structure - should work on all versions */

DROP DATABASE IF EXISTS `netflix`;
CREATE DATABASE `netflix` /*!40100 DEFAULT CHARACTER SET latin1 */;
USE `netflix`;

CREATE TABLE `movies` (
  `id` int(5) NOT NULL DEFAULT '0',
  `year` int(4) DEFAULT '0',
  `title` varchar(255) NOT NULL DEFAULT '',
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

CREATE TABLE `probe` (
  `movie_id` int(5) NOT NULL DEFAULT '0',
  `customer_id` int(6) NOT NULL DEFAULT '0',
  KEY `movie_id` (`movie_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

CREATE TABLE `qualifying` (
  `customer_id` int(6) NOT NULL DEFAULT '0',
  `date` date NOT NULL DEFAULT '0000-00-00',
  `movie_id` int(5) NOT NULL DEFAULT '0',
  KEY `movie_id` (`movie_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

CREATE TABLE `ratings` (
  `movie_id` int(5) NOT NULL DEFAULT '0',
  `customer_id` int(6) NOT NULL DEFAULT '0',
  `rating` int(1) NOT NULL DEFAULT '0',
  `date` date NOT NULL DEFAULT '0000-00-00',
  PRIMARY KEY  (`movie_id`,`customer_id`),
  KEY `date` (`date`),
  KEY `movie_id` (`movie_id`),
  KEY `customer_id` (`customer_id`),
  KEY `rating` (`rating`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

First, let's import the movie titles. To accomplish this I came up with a very quick script to rewrite the provided text file into an SQL import file (exactly as it would look if you used mysqldump). Here you go:

begin
      sql= <<-'EOS' # drop & create table
        USE `netflix`;
        DROP TABLE IF EXISTS `movies`;
        CREATE TABLE `movies` (
          `id` int(5) NOT NULL default '0',
          `year` int(4) default '0',
          `title` varchar(255) NOT NULL default '',
          PRIMARY KEY  (`id`)
        ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
      EOS

      out = File.open("movies-import.sql", "w")
      out.write(sql)

      File.open("../movie_titles.txt", "r") do |movies|
        movies.each_line { |movie|
            id, year, title = movie.chomp.scan(/(\d+),(\d+|NULL),(.*)/).flatten

            # escape ' and \ characters
            title = title.gsub(/\\/, '\&\&').gsub(/'/, "''")
            out.write("\nINSERT INTO `movies` (`id`,`year`,`title`) VALUES (#{id},#{year},'#{title}');")
        }
      end

      out.close
rescue => err
      puts "Exception: #{err}"
end

Running this script will produce a movies-import.sql file, which can be loaded into the database by calling: mysql -u USER -p DBNAME < movies-import.sql

Importing the ratings dataset can be tricky. Best way I found was to transform all 17,000 thousand files into a directly loadable format and then cram the data in using the native SQL import functionality. Code provided below will open and transform each of the 17,000 movie files into a loadable format. Warning: this script will give you a couple of GB's of new text files. However, you can delete them as soon as you've finished importing the data.

# Transforms movie_id.txt into a 'DATA LOAD'able format for MySQL
#  - Assuming table is: [movie_id, user_id, rating, date]
#  - Not the most efficient solution - for people on unix/linux, you can apparently pipe the data into DATA LOAD directly (google is your friend)

begin
    1.upto(17770) do |n|
      out = File.open("data-load/ratings.#{n.to_s.rjust(7, '0')}.txt", "w")

      File.open("../training_set/mv_#{n.to_s.rjust(7, '0')}.txt", "r") do |ratings|
        ratings.each_line { |rating|
          if rating =~ /(\d+),(\d+),(.*)/
            userid, rating, date = rating.scan(/(\d+),(\d+),(.*)/).flatten
            out.write("#{n},#{userid},#{rating},#{date}\n")
          end
        }
      end
     end

     out.close
rescue => err
      puts "Exception: #{err}"
end

Now that we have the files, we are just going to iterate over them and call 'LOAD DATA':

  # This will use DATA LOAD and call on each movie file
  #  - A LOT faster than doing inserts.

  require "dbi"

   begin
    # connect to the MySQL server
    dbh = DBI.connect("dbi:Mysql:netflix:localhost", "root", "yourpass")

    row = dbh.select_one("SELECT VERSION()")
    p "Running on version: " + row[0]
    movieCount = 0

    1.upto(17770) do |n|
        movie = Dir.getwd << "//data-load//ratings.#{n.to_s.rjust(7, '0')}.txt"
        result = dbh.do("LOAD DATA INFILE '#{movie}' INTO TABLE ratings FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'")
        p "#{movieCount} : #{result}" if movieCount % 100 == 0

        movieCount = movieCount + 1
    end

   rescue DBI::DatabaseError => e
       p "An error occurred"
       p "Error code: #{e.err}"
       p "Error message: #{e.errstr}"
   ensure
       p "Exiting on: #{movieCount}"
       dbh.disconnect if dbh
   end

Loading the data files into the DB took me about 20 minutes. The transform script took another 20-30 minutes, giving us a grand total of 1-cpu hour. At the end of the import my DB grew by about ~5GB. You can cut down on the size and get better load times by removing some extra key fields in the ratings table. If you look at the provided SQL file carefully, you'll see that I have keys for every field - this obviously impacts the size and import times. I wanted to run aggregate queries on my data, hence I opted out to create keys on every attribute.

Note: Dissecting the Netflix Dataset has some interesting stats on the data

Ilya Grigorik is a web ecosystem engineer, author of High Performance Browser Networking (O'Reilly), and Principal Engineer at Shopify — follow on Twitter.

Loading Netflix Dataset into SQL

High Performance Browser Networking - O'Reilly