CS 122, Winter 2017: SQL =============================== The objective of this lab is to give you practice in writing SQL queries and using the ``sqlite3`` program to work with databases. To pick up the files you will need for this lab, go to your individual repository and do a ``git pull upstream master``. You will then have a ``lab3`` directory seeded with ``.csv`` files. We'll start by reviewing the syntax of SQL that you will need to know, and provide instructions on how to use the ``sqlite3`` program. Then, we'll ask you to write seven different queries over a sample dataset. These queries build from simpler to more complex, giving you experience with a range of SQL features and applications. SQL basics ---------- Using sqlite3 ~~~~~~~~~~~~~ We will be using ``sqlite3``. The command:: sqlite3 stocks.db will fire up the interactive version of ``sqlite3`` with the database ``stocks.db`` pre-loaded. If the file ``stocks.db`` does not exist, then sqlite will create an empty database. At the ``sqlite3`` command line, you can run SQL queries directly by typing them in; their results will display as soon as they complete. You can also issue commands to ``sqlite3`` by entering them in; to distinguish between SQL and ``sqlite3`` commands, commands that are not SQL queries begin with a period. You can run queries from a file using the ``.read`` command. For example:: .read task1.sql If you make a mistake and need to remove a table, you can use the SQL ``DROP`` command. For example, the statement: .. code:: sql DROP TABLE candidate; drops the candidate table from the current database. You can also delete a database entirely using the linux ``rm`` command once you have quit out of ``sqlite3``. The sqlite command ``.quit`` ends an interactive session. The command:: sqlite3 stocks.db < some-query.sql will load the database ``stocks.db``, run the query specified in the file ``some-query.sql``, print the result, and then exit. Creating tables ~~~~~~~~~~~~~~~ To create a table, you will use the ``CREATE TABLE`` command. Here is a create statement that we might use for a table of political candidates: .. code:: sql CREATE TABLE candidate (committee_id VARCHAR(10), lastname VARCHAR(50), firstname VARCHAR(50), party VARCHAR(6), city VARCHAR(20), state VARCHAR(2), zip integer, cand_id VARCHAR(10), district VARCHAR(2), CONSTRAINT pk_candidate PRIMARY KEY (committee_id)); The type varchar is used for strings. Other types include integer, float, etc. See `here `_ for a list of legal data types. The default column separator used by ``sqlite3`` when it imports from a file is the vertical line (|) character. If you are working with a file, like our sample files, that uses a comma, then you first need to:: .separator , To import data into a table, you will use the sqlite command ``.import``. For example:: .import candidate.csv candidate loads the data from the file ``candidate.csv`` into the ``candidate`` table. Note that we have provided this file, and ``contributions.csv`` in the distribution if you would like to try our examples yourself. The file ``create-fec.sql`` will load them into tables with the appropriate schemas:: sqlite3 fec.db < create-fec.sql at the main Linux command-line will create a database pre-populated with both tables. Selection and Projection ~~~~~~~~~~~~~~~~~~~~~~~~ The two most basic SQL operations are selection and projection. A selection operation specifies a condition and table and the output is a table that contains only the rows from the original table that meet the condition. For example, the following SQL query: .. code:: sql SELECT * FROM candidate WHERE state = "IL"; extracts all of the rows for candidates who live in "IL". The projection operation specifies a set of column names and a table and returns a table with the specified columns from the original table. For example, the following SQL query: .. code:: sql SELECT lastname, party FROM candidate; yields a table with two columns: one containing the last names of the candidates and another containing their parties. We can combine these operations to get the names and parties of all candidates from Illinois: .. code:: sql SELECT lastname, party FROM candidate WHERE state = "IL"; To remove duplicates, you can add the keyword DISTINCT. For example, the query: .. code:: sql SELECT DISTINCT party FROM candidate WHERE state = "IL"; returns a table with one entry for each party that has at least one candidate in Illinois. String equality in SQL is case sensitive, so the string ``"IL"`` will not match the strings ``"il"`` and ``"Il"``. The clause ``COLLATE NOCASE`` can be used to tell SQL to ignore the case when comparing two strings. For example: .. code:: sql SELECT DISTINCT party FROM candidate WHERE state = "IL" COLLATE NOCASE; returns a table with one entry for each party that has at least one candidate in Illinois, even if the only candidate in a party has the abbreviation ``"Il"`` listed for his or her home state. Aggregation and GROUP BY ~~~~~~~~~~~~~~~~~~~~~~~~ The ``GROUP BY`` command is used to identify subsets of the rows and is usually combined with an aggregation operation (COUNT, AVG, MIN, MAX, etc.). For example, the following SQL query: .. code:: sql SELECT party, COUNT(*) AS num FROM candidate GROUP BY party COLLATE NOCASE; determines the number of candidates in each party. The ``AS num`` clause specifies that the resulting table will use the name ``num`` for the column that contains the number of candidates in each party. Often we would like to see the output in sorted order. The ``ORDER BY`` clause allows us to specify a sorting order for the output. For example, we can add a clause ``ORDER BY num``: .. code:: sql SELECT party, COUNT(*) AS num FROM candidate GROUP BY party COLLATE NOCASE ORDER BY num; to sort the parties in increasing order by the number of candidates in the party. Adding the keyword ``DESC`` after ``COUNT`` in the ``ORDER BY`` clause yields a table that is sorted in decreasing order. If we want the three parties with the most candidates, we could write: .. code:: sql SELECT party, COUNT(*) AS num FROM candidate GROUP BY party COLLATE NOCASE ORDER BY num DESC LIMIT 3; Notice that we changed the sorting order to decreasing and added a ``LIMIT`` clause to limit the number of entries in the output. The ``HAVING`` clause allows us to select rows after the group by operation has been performed. For example, the following query: .. code:: sql SELECT party, COUNT(*) AS num FROM candidate GROUP BY party HAVING num >= 1000 COLLATE NOCASE ORDER BY num; returns a table with the parties that have at least 1000 candidates. JOINS ~~~~~ Selection, projection and group by are useful operations, but the real power of relational databases comes from combining information from multiple tables using the JOIN operation. There are different types of JOINS. We will only be using inner joins, which are the default in sqlite. An inner join operation does a cross product and then keeps only the resulting rows that match the join condition. For example, here's a query that contains the name and amount for each contribution made to a candidate: .. code:: sql SELECT lastname, amount FROM candidate JOIN contribution ON candidate.cand_id = contribution.cand_id; On occasion, it can be useful to join a table with itself (called a self-join). For example, if we wanted to determine whether a candidate had more than one candidate id, we could use the following query: .. code:: sql SELECT l.lastname, l.firstname, l.state, l.city, l.zip, l.cand_id FROM candidate AS l JOIN candidate AS r ON l.lastname = r.lastname AND l.firstname = r.firstname AND l.city = r.city AND l.state = r.state AND l.zip = r.zip WHERE l.cand_id != r.cand_id COLLATE NOCASE ORDER BY l.lastname, l.firstname; Nested queries ~~~~~~~~~~~~~~ Queries can be nested. For example, the following query: .. code:: sql SELECT DISTINCT lastname, firstname FROM candidate JOIN contribution ON candidate.cand_id = contribution.cand_id WHERE amount >= 3 * (SELECT AVG(amount) AS amt FROM contribution); generates a table that contains the names of candidates who received a contribution that was at least 3 times the average contribution. In this case, the inner select returns a table with one column and one row and the value can be used as a single integer. Here's a query that counts the number of candidates who have at least 10 contributions: .. code:: sql SELECT COUNT(*) FROM candidate WHERE candidate.cand_id IN (SELECT cand_id FROM contribution GROUP BY cand_id HAVING COUNT(*) >= 10); Notice that the sub-query yields a table that is likely to have more than one entry. The ``IN`` operator returns true of the value of the left operand (``contribution``) occurs in the right operand (the sub-query), which is a table. You should think of this as a set membership test, in mathematical terms. Finally, a sub-query can be used anywhere the name of a table can appear. The rest of this document describes how to get started, the data you will be working with, using sqlite3, and your tasks. Your tasks ---------- Getting started ~~~~~~~~~~~~~~~ If you haven't already, ``git pull upstream master`` to pick up the ``.csv`` files you will need. For each task, your solution should go in the file with the corresponding name. For example, the solution for task 1 goes in the file named ``task1.sql``. Data ~~~~ The data we will be using represents companies, their stocks, and the stock market indices that they belong to. We will be working with three types of data: company information, index information, and price information. Each row in the company data contains a ticker symbol (7), name (100), and sector (50) for a company. Each row in the index data contains a ticker symbol and the name of an index (10). And finally, a row in the price data includes a ticker symbol and a price. The numbers in parenthesis specify an upper bound on the length of the longest string for that field. You may assume that the ticker symbol is a key for the company data and the price data. In contrast, a given ticker symbol may appear more than once in the index data. For example, all the stocks in the DJIA are also in the SP500 index. Tasks ~~~~~ **Task 1** Write a set of SQL statements and sqlite import commands to create three tables---company\_info, index\_info, and prices---and import the associated data into the tables from the files we have provided in your ``lab3`` directory. **Task 2** Write a SQL query to identify the names of all the companies in the "Services" sector. **Task 3** Write a query that produces a table with the names of all the sectors. **Task 4** Write a query to determine the number of companies in each sector. The output should be sorted in increasing order by sector name. **Task 5** Write a query to determine the names of the stocks in the Dow Jones Industrial Average (DJIA). **Task 6** Write a query to determine the top five sectors (by number of stocks) in the SP500 index. **Task 7** Write a query to determine the names of the companies that appear in both the DJIA and the SP500 index. Hint: use a three-way join (``x JOIN y JOIN z``). Submission ---------- Please ``git add`` all of your ``.sql`` files (one for each task), ``git commit`` and ``git push`` them to the server. While this lab is not graded, we may ask you to demonstrate that you have completed it if you have questions about using SQL.