Relational Databases

Relational databases allow us to work with tables ("relation" is a fancy word for table). A relational database (Oracle, SQL Server, MySQL, etc.) will store one or more tables on disk. SQLite uses a single file, but other systems use multiple files and a client/server architecture.

The SQL language is used to specify queries on the data, as well as insertions/deletions/updates.

Why would we use a database? Doesn't Pandas already allow us to work with tables?

  • Relational databases run queries on the data that's on the disk. They do load blocks of it into memory, but they rarely have to load the whole database into memory. If you have a very large dataset, relational databases are usually going to be able to work with it more efficiently (or: loading that large dataset into memory may be infeasible)
  • Data on disk is stored in a type of tree called B-Tree that optimizes certain types of data access. The data itself is stored in a format that typically requires little or no transformation to be used (e.g., integers are stored in a binary format, instead of in their string representation which is what we have in CSV/JSON/etc. files)
  • SQL is a very expressive language, and will allow us to express queries that may be hard to express in Pandas. SQL is particularly good at joining multiple tables together (we can do this with Pandas too, but databases are very good at optimizing this particular operation)

The MovieLens dataset contains information about:

  • Movies
  • Genres
  • Ratings (made by users)
  • Tags (added by users)

We will be using a small dataset (9,000 movies, 100,000 ratings) that is only 4MB, but the full dataset is a 1GB file (40,000 movies, 24,000,000 ratings).

Basic SELECT queries

Movies table:

CREATE TABLE movies (
  movieId INTEGER PRIMARY KEY,
  title TEXT NOT NULL,
  year INTEGER
);

Open the database:

sqlite3 movielens-small.sqlite

Enable pretty output:

.headers on
.mode column

All the movies from 1933:

SELECT * FROM movies WHERE year = 1933;

movieId     title       year      
----------  ----------  ----------
1256        Duck Soup   1933      
2366        King Kong   1933      
3932        Invisible   1933      
4921        Little Wom  1933      
7080        42nd Stree  1933      
7081        I'm No Ang  1933      
8256        Queen Chri  1933      
8670        Testament   1933      
25795       Dinner at   1933      
25801       She Done H  1933      
48301       Private Li  1933      
71102       Three Musk  1933      
80748       Alice in W  1933   

The titles from all the movies from 1933:

SELECT title FROM movies WHERE year = 1933;

title     
----------
Duck Soup 
King Kong 
Invisible 
Little Wom
42nd Stree
I'm No Ang
Queen Chri
Testament 
Dinner at 
She Done H
Private Li
Three Musk
Alice in W

All the movies (there's 9000+ movies, so we use LIMIT to only print the top n rows)

SELECT * FROM movies LIMIT 10;

movieId     title       year      
----------  ----------  ----------
1           Toy Story   1995      
2           Jumanji     1995      
3           Grumpier O  1995      
4           Waiting to  1995      
5           Father of   1995      
6           Heat        1995      
7           Sabrina     1995      
8           Tom and Hu  1995      
9           Sudden Dea  1995      
10          GoldenEye   1995    

Python + SQLite

In [2]:
import sqlite3

conn = sqlite3.connect("movielens-small.sqlite")

c = conn.execute("SELECT * FROM movies LIMIT 10")

for row in c:
    print(row)
(1, 'Toy Story', 1995)
(2, 'Jumanji', 1995)
(3, 'Grumpier Old Men', 1995)
(4, 'Waiting to Exhale', 1995)
(5, 'Father of the Bride Part II', 1995)
(6, 'Heat', 1995)
(7, 'Sabrina', 1995)
(8, 'Tom and Huck', 1995)
(9, 'Sudden Death', 1995)
(10, 'GoldenEye', 1995)

Basic Aggregation

Find the number of movies.

SELECT COUNT(*) FROM movies;

Use "AS" to name the resulting column.

SELECT COUNT(*) AS num FROM movies;

Find the number of movies from 1933.

SELECT COUNT(*) AS num FROM movies WHERE year = 1933;

Find the number of movies per year.

  SELECT year, COUNT(*) AS num 
    FROM movies 
GROUP BY year;

Order it in descending order.

  SELECT year, COUNT(*) AS num 
    FROM movies 
GROUP BY year 
ORDER BY num DESC;

Include only years with more than 150 movies. Note that WHERE filters by row, and HAVING BY filters by group.

  SELECT year, COUNT(*) AS num 
    FROM movies 
GROUP BY year 
  HAVING num > 150
ORDER BY num DESC;

Find all the movies that have been remade:

  SELECT title, COUNT(*) AS num 
    FROM movies 
GROUP BY title 
  HAVING num > 1;

Joins

The ratings table contains one row per rating entered by one user for one movie (i.e., a given user cannot rate a movie more than once).

CREATE TABLE ratings(
  userId INTEGER,
  movieId INTEGER,
  rating REAL,
  timestamp TIMESTAMP,
  FOREIGN KEY(movieId) REFERENCES movies(movieId),
  PRIMARY KEY (userId, movieId)
);

Note: Ratings are from 0 to 5.

movieId is a foreign key: the value of movieId in the ratings table references a movie with the same movieId in the movies table.

Foreign keys allow us to model 1-N relationships in the database. In this case, every move can have multiple (N) ratings, but each rating is associated with only one (1) movie.

Show table:

SELECT * FROM ratings LIMIT 10;

Joining two tables:

SELECT *
  FROM movies JOIN ratings
    ON movies.movieId = ratings.movieId
 LIMIT 10;     

If column names are the same, we can just write USING (note the parentheses):

SELECT *
  FROM movies JOIN ratings USING (movieId)
 LIMIT 10;     

Computing the average rating:

  SELECT title, AVG(rating)
    FROM movies JOIN ratings USING (movieId)
GROUP BY ratings.movieId
   LIMIT 10;

Compute the top-rated movies:

  SELECT title, AVG(rating) AS avg_rating
    FROM movies JOIN ratings USING (movieId)
GROUP BY ratings.movieId
ORDER BY avg_rating DESC
   LIMIT 20;

That doesn't seem right. I don't recognize any of those movies! Let's also include the number of ratings:

  SELECT title, AVG(rating) AS avg_rating, COUNT(rating) AS num_ratings
    FROM movies JOIN ratings USING (movieId)
GROUP BY ratings.movieId
ORDER BY avg_rating DESC
   LIMIT 20;

Let's only count the movies with 50+ ratings:

  SELECT title, AVG(rating) AS avg_rating, COUNT(rating) AS num_ratings
    FROM movies JOIN ratings USING (movieId)
GROUP BY ratings.movieId
  HAVING num_ratings >= 50
ORDER BY avg_rating DESC
   LIMIT 20;

N-M relations

The database has an N-M relation between movies and genres. This requires using an intermediate movie_genre table:

CREATE TABLE genres(
  genreId INTEGER PRIMARY KEY,
  genre TEXT NOT NULL
);

CREATE TABLE movie_genre(
  movieId INTEGER,
  genreId INTEGER,
  FOREIGN KEY(movieId) REFERENCES movies(movieId),
  FOREIGN KEY(genreId) REFERENCES movies(genreId),
  PRIMARY KEY (movieId, genreId)
);

Show the genres for each movie:

  SELECT title, genre
    FROM movies 
    JOIN movie_genre USING (movieId)
    JOIN genres USING (genreId)
   LIMIT 20;

Number of movies made per genre in a given year.

  SELECT genre, COUNT(*) as num_movies
    FROM movies 
    JOIN movie_genre USING (movieId)
    JOIN genres USING (genreId)
   WHERE year = 2015
GROUP BY genre
ORDER BY num_movies DESC;       

Nested Queries

Here are all the movies with their average ranking:

  SELECT year, title, AVG(rating) AS avg_rating
    FROM movies JOIN ratings USING (movieId)
GROUP BY ratings.movieId
  HAVING COUNT(rating) >= 10

(we lower the threshold to 10 since movies in recent years don't have that many ratings)

What if we want to find the top-rated title of the year? We can do a nested query, and write a query on the results of the above query:

    SELECT year, title, MAX(avg_rating)
      FROM (

              SELECT year, title, AVG(rating) AS avg_rating
                FROM movies JOIN ratings USING (movieId)
            GROUP BY ratings.movieId
              HAVING COUNT(rating) >= 10

           )
  GROUP BY year
  ORDER BY year DESC;

Nested queries can also be used instead of joins when we are only doing the join for the purposes of filtering rows, and not actually including any columns from the joined tables. For example, if we wanted to obtain the title and year of every movie that has only received ratings less than or equal to 1:

  SELECT title, year 
    FROM movies JOIN ratings USING(movieId) 
GROUP BY movieId 
  HAVING MAX(rating) <= 1;

We could instead use a subquery to first obtain the movies with a max rating of 1, and then use that to filter the movies in the movies table:

  SELECT title, year 
    FROM movies 
   WHERE movieId IN (
                        SELECT movieId 
                          FROM ratings 
                      GROUP BY movieId 
                        HAVING MAX(rating) <= 1
                    );                  

INSERT, UPDATE, DELETE

Inserting a new row into the movies table:

INSERT INTO movies VALUES (164980, "Hello world!", 2016);

We can also insert a row with only some of the values. If the table specification allows it, the other values will be given NULL values.

INSERT INTO movies (movieID, title) VALUES (164981, "Hello Universe")

Updating a row:

UPDATE movies SET year=2016 WHERE movieID=164981;

Deleting rows:

DELETE FROM movies WHERE movieID=164981 OR movieID=164980;