Relational databases allow us to work with tables ("relation" is a fancy word for table). A relational database (Oracle, SQL Server, MySQL, etc.) will store one or more tables on disk. SQLite uses a single file, but other systems use multiple files and a client/server architecture.
The SQL language is used to specify queries on the data, as well as insertions/deletions/updates.
Why would we use a database? Doesn't Pandas already allow us to work with tables?
The MovieLens dataset contains information about:
We will be using a small dataset (9,000 movies, 100,000 ratings) that is only 4MB, but the full dataset is a 1GB file (40,000 movies, 24,000,000 ratings).
Movies table:
CREATE TABLE movies (
movieId INTEGER PRIMARY KEY,
title TEXT NOT NULL,
year INTEGER
);
Open the database:
sqlite3 movielens-small.sqlite
Enable pretty output:
.headers on
.mode column
All the movies from 1933:
SELECT * FROM movies WHERE year = 1933;
movieId title year
---------- ---------- ----------
1256 Duck Soup 1933
2366 King Kong 1933
3932 Invisible 1933
4921 Little Wom 1933
7080 42nd Stree 1933
7081 I'm No Ang 1933
8256 Queen Chri 1933
8670 Testament 1933
25795 Dinner at 1933
25801 She Done H 1933
48301 Private Li 1933
71102 Three Musk 1933
80748 Alice in W 1933
The titles from all the movies from 1933:
SELECT title FROM movies WHERE year = 1933;
title
----------
Duck Soup
King Kong
Invisible
Little Wom
42nd Stree
I'm No Ang
Queen Chri
Testament
Dinner at
She Done H
Private Li
Three Musk
Alice in W
All the movies (there's 9000+ movies, so we use LIMIT
to only print the top n rows)
SELECT * FROM movies LIMIT 10;
movieId title year
---------- ---------- ----------
1 Toy Story 1995
2 Jumanji 1995
3 Grumpier O 1995
4 Waiting to 1995
5 Father of 1995
6 Heat 1995
7 Sabrina 1995
8 Tom and Hu 1995
9 Sudden Dea 1995
10 GoldenEye 1995
import sqlite3
conn = sqlite3.connect("movielens-small.sqlite")
c = conn.execute("SELECT * FROM movies LIMIT 10")
for row in c:
print(row)
Find the number of movies.
SELECT COUNT(*) FROM movies;
Use "AS" to name the resulting column.
SELECT COUNT(*) AS num FROM movies;
Find the number of movies from 1933.
SELECT COUNT(*) AS num FROM movies WHERE year = 1933;
Find the number of movies per year.
SELECT year, COUNT(*) AS num
FROM movies
GROUP BY year;
Order it in descending order.
SELECT year, COUNT(*) AS num
FROM movies
GROUP BY year
ORDER BY num DESC;
Include only years with more than 150 movies. Note that WHERE filters by row, and HAVING BY filters by group.
SELECT year, COUNT(*) AS num
FROM movies
GROUP BY year
HAVING num > 150
ORDER BY num DESC;
Find all the movies that have been remade:
SELECT title, COUNT(*) AS num
FROM movies
GROUP BY title
HAVING num > 1;
The ratings table contains one row per rating entered by one user for one movie (i.e., a given user cannot rate a movie more than once).
CREATE TABLE ratings(
userId INTEGER,
movieId INTEGER,
rating REAL,
timestamp TIMESTAMP,
FOREIGN KEY(movieId) REFERENCES movies(movieId),
PRIMARY KEY (userId, movieId)
);
Note: Ratings are from 0 to 5.
movieId
is a foreign key: the value of movieId
in the ratings
table references a movie with the same movieId
in the movies
table.
Foreign keys allow us to model 1-N relationships in the database. In this case, every move can have multiple (N) ratings, but each rating is associated with only one (1) movie.
Show table:
SELECT * FROM ratings LIMIT 10;
Joining two tables:
SELECT *
FROM movies JOIN ratings
ON movies.movieId = ratings.movieId
LIMIT 10;
If column names are the same, we can just write USING (note the parentheses):
SELECT *
FROM movies JOIN ratings USING (movieId)
LIMIT 10;
Computing the average rating:
SELECT title, AVG(rating)
FROM movies JOIN ratings USING (movieId)
GROUP BY ratings.movieId
LIMIT 10;
Compute the top-rated movies:
SELECT title, AVG(rating) AS avg_rating
FROM movies JOIN ratings USING (movieId)
GROUP BY ratings.movieId
ORDER BY avg_rating DESC
LIMIT 20;
That doesn't seem right. I don't recognize any of those movies! Let's also include the number of ratings:
SELECT title, AVG(rating) AS avg_rating, COUNT(rating) AS num_ratings
FROM movies JOIN ratings USING (movieId)
GROUP BY ratings.movieId
ORDER BY avg_rating DESC
LIMIT 20;
Let's only count the movies with 50+ ratings:
SELECT title, AVG(rating) AS avg_rating, COUNT(rating) AS num_ratings
FROM movies JOIN ratings USING (movieId)
GROUP BY ratings.movieId
HAVING num_ratings >= 50
ORDER BY avg_rating DESC
LIMIT 20;
The database has an N-M relation between movies and genres. This requires using an intermediate movie_genre
table:
CREATE TABLE genres(
genreId INTEGER PRIMARY KEY,
genre TEXT NOT NULL
);
CREATE TABLE movie_genre(
movieId INTEGER,
genreId INTEGER,
FOREIGN KEY(movieId) REFERENCES movies(movieId),
FOREIGN KEY(genreId) REFERENCES movies(genreId),
PRIMARY KEY (movieId, genreId)
);
Show the genres for each movie:
SELECT title, genre
FROM movies
JOIN movie_genre USING (movieId)
JOIN genres USING (genreId)
LIMIT 20;
Number of movies made per genre in a given year.
SELECT genre, COUNT(*) as num_movies
FROM movies
JOIN movie_genre USING (movieId)
JOIN genres USING (genreId)
WHERE year = 2015
GROUP BY genre
ORDER BY num_movies DESC;
Here are all the movies with their average ranking:
SELECT year, title, AVG(rating) AS avg_rating
FROM movies JOIN ratings USING (movieId)
GROUP BY ratings.movieId
HAVING COUNT(rating) >= 10
(we lower the threshold to 10 since movies in recent years don't have that many ratings)
What if we want to find the top-rated title of the year? We can do a nested query, and write a query on the results of the above query:
SELECT year, title, MAX(avg_rating)
FROM (
SELECT year, title, AVG(rating) AS avg_rating
FROM movies JOIN ratings USING (movieId)
GROUP BY ratings.movieId
HAVING COUNT(rating) >= 10
)
GROUP BY year
ORDER BY year DESC;
Nested queries can also be used instead of joins when we are only doing the join for the purposes of filtering rows, and not actually including any columns from the joined tables. For example, if we wanted to obtain the title and year of every movie that has only received ratings less than or equal to 1:
SELECT title, year
FROM movies JOIN ratings USING(movieId)
GROUP BY movieId
HAVING MAX(rating) <= 1;
We could instead use a subquery to first obtain the movies with a max rating of 1, and then use that to filter the movies in the movies table:
SELECT title, year
FROM movies
WHERE movieId IN (
SELECT movieId
FROM ratings
GROUP BY movieId
HAVING MAX(rating) <= 1
);
Inserting a new row into the movies
table:
INSERT INTO movies VALUES (164980, "Hello world!", 2016);
We can also insert a row with only some of the values. If the table specification allows it, the other values will be given NULL values.
INSERT INTO movies (movieID, title) VALUES (164981, "Hello Universe")
Updating a row:
UPDATE movies SET year=2016 WHERE movieID=164981;
Deleting rows:
DELETE FROM movies WHERE movieID=164981 OR movieID=164980;