CS 122, Winter 2017: SQL¶
The objective of this lab is to give you practice in writing SQL queries and using the sqlite3
program to work with databases.
To pick up the files you will need for this lab, go to your individual repository and do a git pull upstream master
. You will then have a lab3
directory seeded with .csv
files.
We’ll start by reviewing the syntax of SQL that you will need to know,
and provide instructions on how to use the sqlite3
program. Then,
we’ll ask you to write seven different queries over a sample
dataset. These queries build from simpler to more complex, giving you
experience with a range of SQL features and applications.
SQL basics¶
Using sqlite3¶
We will be using sqlite3
. The command:
sqlite3 stocks.db
will fire up the interactive version of sqlite3
with the database
stocks.db
pre-loaded. If the file stocks.db
does not exist, then
sqlite will create an empty database.
At the sqlite3
command line, you can run SQL queries directly by typing them in; their results will display as soon as they complete. You can also issue commands to sqlite3
by entering them in; to distinguish between SQL and sqlite3
commands, commands that are not SQL queries begin with a period.
You can run queries from a file using the .read
command. For example:
.read task1.sql
If you make a mistake and need to remove a table, you can use the SQL
DROP
command. For example, the statement:
DROP TABLE candidate;
drops the candidate table from the current database.
You can also delete a database entirely using the linux rm
command once you have quit out of sqlite3
.
The sqlite command .quit
ends an interactive session.
The command:
sqlite3 stocks.db < some-query.sql
will load the database stocks.db
, run the query specified in the
file some-query.sql
, print the result, and then exit.
Creating tables¶
To create a table, you will use the CREATE TABLE
command. Here is
a create statement that we might use for a table of political candidates:
CREATE TABLE candidate
(committee_id VARCHAR(10),
lastname VARCHAR(50),
firstname VARCHAR(50),
party VARCHAR(6),
city VARCHAR(20),
state VARCHAR(2),
zip integer,
cand_id VARCHAR(10),
district VARCHAR(2),
CONSTRAINT pk_candidate PRIMARY KEY (committee_id));
The type varchar is used for strings. Other types include integer, float, etc. See here for a list of legal data types.
The default column separator used by sqlite3
when it imports from a file is the vertical line (|) character. If you are working with a file, like our sample files, that uses a comma, then you first need to:
.separator ,
To import data into a table, you will use the sqlite command
.import
. For example:
.import candidate.csv candidate
loads the data from the file candidate.csv
into the candidate
table.
Note that we have provided this file, and contributions.csv
in the
distribution if you would like to try our examples yourself.
The file create-fec.sql
will load them into tables with the
appropriate schemas:
sqlite3 fec.db < create-fec.sql
at the main Linux command-line will create a database pre-populated with both tables.
Selection and Projection¶
The two most basic SQL operations are selection and projection. A selection operation specifies a condition and table and the output is a table that contains only the rows from the original table that meet the condition. For example, the following SQL query:
SELECT * FROM candidate WHERE state = "IL";
extracts all of the rows for candidates who live in “IL”.
The projection operation specifies a set of column names and a table and returns a table with the specified columns from the original table. For example, the following SQL query:
SELECT lastname, party FROM candidate;
yields a table with two columns: one containing the last names of the candidates and another containing their parties.
We can combine these operations to get the names and parties of all candidates from Illinois:
SELECT lastname, party FROM candidate WHERE state = "IL";
To remove duplicates, you can add the keyword DISTINCT. For example, the query:
SELECT DISTINCT party FROM candidate WHERE state = "IL";
returns a table with one entry for each party that has at least one candidate in Illinois.
String equality in SQL is case sensitive, so the string "IL"
will not
match the strings "il"
and "Il"
. The clause COLLATE NOCASE
can be used
to tell SQL to ignore the case when comparing two strings. For example:
SELECT DISTINCT party
FROM candidate
WHERE state = "IL"
COLLATE NOCASE;
returns a table with one entry for each party that has at least one
candidate in Illinois, even if the only candidate in a party has the
abbreviation "Il"
listed for his or her home state.
Aggregation and GROUP BY¶
The GROUP BY
command is used to identify subsets of the rows and
is usually combined with an aggregation operation (COUNT, AVG, MIN, MAX,
etc.). For example, the following SQL query:
SELECT party, COUNT(*) AS num
FROM candidate
GROUP BY party
COLLATE NOCASE;
determines the number of candidates in each party. The AS num
clause specifies that the resulting table will use the name num
for the column that contains the number of candidates in each party.
Often we would like to see the output in sorted order. The ORDER BY
clause allows us to specify a sorting order for the output. For example,
we can add a clause ORDER BY num
:
SELECT party, COUNT(*) AS num
FROM candidate
GROUP BY party
COLLATE NOCASE
ORDER BY num;
to sort the parties in increasing order by the number of candidates in
the party. Adding the keyword DESC
after COUNT
in the
ORDER BY
clause yields a table that is sorted in decreasing order.
If we want the three parties with the most candidates, we could write:
SELECT party, COUNT(*) AS num
FROM candidate
GROUP BY party
COLLATE NOCASE
ORDER BY num DESC
LIMIT 3;
Notice that we changed the sorting order to decreasing and added a
LIMIT
clause to limit the number of entries in the output.
The HAVING
clause allows us to select rows after the group by
operation has been performed. For example, the following query:
SELECT party, COUNT(*) AS num FROM candidate GROUP BY party
HAVING num >= 1000 COLLATE NOCASE ORDER BY num;
returns a table with the parties that have at least 1000 candidates.
JOINS¶
Selection, projection and group by are useful operations, but the real power of relational databases comes from combining information from multiple tables using the JOIN operation. There are different types of JOINS. We will only be using inner joins, which are the default in sqlite. An inner join operation does a cross product and then keeps only the resulting rows that match the join condition. For example, here’s a query that contains the name and amount for each contribution made to a candidate:
SELECT lastname, amount
FROM candidate JOIN contribution
ON candidate.cand_id = contribution.cand_id;
On occasion, it can be useful to join a table with itself (called a self-join). For example, if we wanted to determine whether a candidate had more than one candidate id, we could use the following query:
SELECT l.lastname, l.firstname, l.state, l.city, l.zip, l.cand_id
FROM candidate AS l JOIN candidate AS r
ON l.lastname = r.lastname AND
l.firstname = r.firstname AND
l.city = r.city AND
l.state = r.state AND
l.zip = r.zip
WHERE l.cand_id != r.cand_id
COLLATE NOCASE
ORDER BY l.lastname, l.firstname;
Nested queries¶
Queries can be nested. For example, the following query:
SELECT DISTINCT lastname, firstname
FROM candidate JOIN contribution ON candidate.cand_id = contribution.cand_id
WHERE amount >= 3 * (SELECT AVG(amount) AS amt FROM contribution);
generates a table that contains the names of candidates who received a contribution that was at least 3 times the average contribution. In this case, the inner select returns a table with one column and one row and the value can be used as a single integer.
Here’s a query that counts the number of candidates who have at least 10 contributions:
SELECT COUNT(*)
FROM candidate
WHERE candidate.cand_id IN
(SELECT cand_id FROM contribution GROUP BY cand_id HAVING COUNT(*) >= 10);
Notice that the sub-query yields a table that is likely to have more
than one entry. The IN
operator returns true of the value of the left
operand (contribution
) occurs in the right operand (the sub-query),
which is a table. You should think of this as a set membership test, in mathematical terms.
Finally, a sub-query can be used anywhere the name of a table can appear.
The rest of this document describes how to get started, the data you will be working with, using sqlite3, and your tasks.
Your tasks¶
Getting started¶
If you haven’t already, git pull upstream master
to pick up the .csv
files you will need.
For each task, your solution should go in the file with the
corresponding name. For example, the solution for task 1 goes in the
file named task1.sql
.
Data¶
The data we will be using represents companies, their stocks, and the stock market indices that they belong to.
We will be working with three types of data: company information, index information, and price information. Each row in the company data contains a ticker symbol (7), name (100), and sector (50) for a company. Each row in the index data contains a ticker symbol and the name of an index (10). And finally, a row in the price data includes a ticker symbol and a price. The numbers in parenthesis specify an upper bound on the length of the longest string for that field.
You may assume that the ticker symbol is a key for the company data and the price data. In contrast, a given ticker symbol may appear more than once in the index data. For example, all the stocks in the DJIA are also in the SP500 index.
Tasks¶
Task 1
Write a set of SQL statements and sqlite import commands to create three
tables—company_info, index_info, and prices—and import the
associated data into the tables from the files we have provided in your lab3
directory.
Task 2
Write a SQL query to identify the names of all the companies in the “Services” sector.
Task 3
Write a query that produces a table with the names of all the sectors.
Task 4
Write a query to determine the number of companies in each sector. The output should be sorted in increasing order by sector name.
Task 5
Write a query to determine the names of the stocks in the Dow Jones Industrial Average (DJIA).
Task 6
Write a query to determine the top five sectors (by number of stocks) in the SP500 index.
Task 7
Write a query to determine the names of the companies that appear in
both the DJIA and the SP500 index. Hint: use a three-way join
(x JOIN y JOIN z
).
Submission¶
Please git add
all of your .sql
files (one for each task), git commit
and git push
them to the server. While this lab is not graded, we may ask you to demonstrate that you have completed it if you have questions about using SQL.