CS 122, Winter 2017: SQL
===============================

The objective of this lab is to give you practice in writing SQL queries and using the ``sqlite3`` program to work with databases.

To pick up the files you will need for this lab, go to your individual repository and do a ``git pull upstream master``. You will then have a ``lab3`` directory seeded with ``.csv`` files.

We'll start by reviewing the syntax of SQL that you will need to know,
and provide instructions on how to use the ``sqlite3`` program. Then,
we'll ask you to write seven different queries over a sample
dataset. These queries build from simpler to more complex, giving you
experience with a range of SQL features and applications.

SQL basics
----------

Using sqlite3
~~~~~~~~~~~~~

We will be using ``sqlite3``. The command::

    sqlite3 stocks.db

will fire up the interactive version of ``sqlite3`` with the database
``stocks.db`` pre-loaded. If the file ``stocks.db`` does not exist, then
sqlite will create an empty database.

At the ``sqlite3`` command line, you can run SQL queries directly by typing them in; their results will display as soon as they complete. You can also issue commands to ``sqlite3`` by entering them in; to distinguish between SQL and ``sqlite3`` commands, commands that are not SQL queries begin with a period.

You can run queries from a file using the ``.read``
command. For example::

    .read task1.sql

If you make a mistake and need to remove a table, you can use the SQL
``DROP`` command. For example, the statement:

.. code:: sql

    DROP TABLE candidate;

drops the candidate table from the current database.

You can also delete a database entirely using the linux ``rm`` command once you have quit out of ``sqlite3``.

The sqlite command ``.quit`` ends an interactive session.

The command::

    sqlite3 stocks.db < some-query.sql

will load the database ``stocks.db``, run the query specified in the
file ``some-query.sql``, print the result, and then exit.

Creating tables
~~~~~~~~~~~~~~~

To create a table, you will use the ``CREATE TABLE`` command. Here is
a create statement that we might use for a table of political candidates:

.. code:: sql

    CREATE TABLE candidate
      (committee_id VARCHAR(10),
       lastname VARCHAR(50),
       firstname VARCHAR(50),
       party VARCHAR(6),
       city VARCHAR(20),
       state VARCHAR(2),
       zip integer,
       cand_id VARCHAR(10),
       district VARCHAR(2),
       CONSTRAINT pk_candidate PRIMARY KEY (committee_id));

The type varchar is used for strings. Other types include integer,
float, etc. See `here <http://www.sqlite.org/datatype3.html>`_ for a
list of legal data types.

The default column separator used by ``sqlite3`` when it imports from a file is the vertical line (|) character. If you are working with a file, like our sample files, that uses a comma, then you first need to::

    .separator ,

To import data into a table, you will use the sqlite command
``.import``. For example::

    .import candidate.csv candidate

loads the data from the file ``candidate.csv`` into the ``candidate``
table.
Note that we have provided this file, and ``contributions.csv`` in the
distribution if you would like to try our examples yourself.
The file ``create-fec.sql`` will load them into tables with the 
appropriate schemas::

    sqlite3 fec.db < create-fec.sql 

at the main Linux command-line will create a database pre-populated with both tables.

Selection and Projection
~~~~~~~~~~~~~~~~~~~~~~~~

The two most basic SQL operations are selection and projection. A
selection operation specifies a condition and table and the output is a
table that contains only the rows from the original table that meet the
condition. For example, the following SQL query:

.. code:: sql

    SELECT * FROM candidate WHERE state = "IL";

extracts all of the rows for candidates who live in "IL".

The projection operation specifies a set of column names and a table and
returns a table with the specified columns from the original table. For
example, the following SQL query:

.. code:: sql

    SELECT lastname, party FROM candidate;

yields a table with two columns: one containing the last names of the
candidates and another containing their parties.

We can combine these operations to get the names and parties of all
candidates from Illinois:

.. code:: sql

    SELECT lastname, party FROM candidate WHERE state = "IL";

To remove duplicates, you can add the keyword DISTINCT. For example, the
query:

.. code:: sql

    SELECT DISTINCT party FROM candidate WHERE state = "IL";

returns a table with one entry for each party that has at least one
candidate in Illinois.

String equality in SQL is case sensitive, so the string ``"IL"`` will not
match the strings ``"il"`` and ``"Il"``. The clause ``COLLATE NOCASE`` can be used
to tell SQL to ignore the case when comparing two strings. For example:

.. code:: sql

    SELECT DISTINCT party
    FROM candidate
    WHERE state = "IL"
    COLLATE NOCASE;

returns a table with one entry for each party that has at least one
candidate in Illinois, even if the only candidate in a party has the
abbreviation ``"Il"`` listed for his or her  home state.

Aggregation and GROUP BY
~~~~~~~~~~~~~~~~~~~~~~~~

The ``GROUP BY`` command is used to identify subsets of the rows and
is usually combined with an aggregation operation (COUNT, AVG, MIN, MAX,
etc.). For example, the following SQL query:

.. code:: sql

    SELECT party, COUNT(*) AS num
    FROM candidate
    GROUP BY party
    COLLATE NOCASE;

determines the number of candidates in each party. The ``AS num``
clause specifies that the resulting table will use the name ``num``
for the column that contains the number of candidates in each party.

Often we would like to see the output in sorted order. The ``ORDER BY``
clause allows us to specify a sorting order for the output. For example,
we can add a clause ``ORDER BY num``:

.. code:: sql

    SELECT party, COUNT(*) AS num
    FROM candidate
    GROUP BY party
    COLLATE NOCASE
    ORDER BY num;

to sort the parties in increasing order by the number of candidates in
the party. Adding the keyword ``DESC`` after ``COUNT`` in the
``ORDER BY`` clause yields a table that is sorted in decreasing order.

If we want the three parties with the most candidates, we could write:

.. code:: sql

    SELECT party, COUNT(*) AS num
    FROM candidate
    GROUP BY party
    COLLATE NOCASE
    ORDER BY num DESC
    LIMIT 3;

Notice that we changed the sorting order to decreasing and added a
``LIMIT`` clause to limit the number of entries in the output.

The ``HAVING`` clause allows us to select rows after the group by
operation has been performed. For example, the following query:

.. code:: sql

    SELECT party, COUNT(*) AS num FROM candidate GROUP BY party
    HAVING num >= 1000 COLLATE NOCASE ORDER BY num;

returns a table with the parties that have at least 1000 candidates.

JOINS
~~~~~

Selection, projection and group by are useful operations, but the real
power of relational databases comes from combining information from
multiple tables using the JOIN operation. There are different types of
JOINS. We will only be using inner joins, which are the default in
sqlite. An inner join operation does a cross product and then keeps only
the resulting rows that match the join condition. For example, here's a
query that contains the name and amount for each contribution made to a
candidate:

.. code:: sql

    SELECT lastname, amount
    FROM candidate JOIN contribution
    ON candidate.cand_id = contribution.cand_id;

On occasion, it can be useful to join a table with itself (called a
self-join). For example, if we wanted to determine whether a candidate
had more than one candidate id, we could use the following query:

.. code:: sql

    SELECT l.lastname, l.firstname, l.state, l.city, l.zip, l.cand_id
    FROM candidate AS l JOIN candidate AS r
    ON l.lastname = r.lastname AND
       l.firstname = r.firstname AND
       l.city = r.city AND
       l.state = r.state AND
       l.zip = r.zip
    WHERE l.cand_id != r.cand_id
    COLLATE NOCASE
    ORDER BY l.lastname, l.firstname;

Nested queries
~~~~~~~~~~~~~~

Queries can be nested. For example, the following query:

.. code:: sql

    SELECT DISTINCT lastname, firstname
    FROM candidate JOIN contribution ON candidate.cand_id = contribution.cand_id
    WHERE amount >= 3 * (SELECT AVG(amount) AS amt FROM contribution);

generates a table that contains the names of candidates who received a
contribution that was at least 3 times the average contribution. In this
case, the inner select returns a table with one column and one row and the
value can be used as a single integer.

Here's a query that counts the number of candidates who have at least 10
contributions:

.. code:: sql

    SELECT COUNT(*)
    FROM candidate
    WHERE candidate.cand_id IN
       (SELECT cand_id FROM contribution GROUP BY cand_id HAVING COUNT(*) >= 10);

Notice that the sub-query yields a table that is likely to have more
than one entry. The ``IN`` operator returns true of the value of the left
operand (``contribution``) occurs in the right operand (the sub-query),
which is a table. You should think of this as a set membership test, in mathematical terms.

Finally, a sub-query can be used anywhere the name of a table can
appear.

The rest of this document describes how to get started, the data you will be working with, using
sqlite3, and your tasks.

Your tasks
----------

Getting started
~~~~~~~~~~~~~~~

If you haven't already, ``git pull upstream master`` to pick up the ``.csv`` files you will need.

For each task, your solution should go in the file with the
corresponding name. For example, the solution for task 1 goes in the
file named ``task1.sql``.


Data
~~~~

The data we will be using represents companies, their stocks, and the stock market indices that they belong to.

We will be working with three types of data: company information, index
information, and price information. Each row in the company data
contains a ticker symbol (7), name (100), and sector (50) for a company.
Each row in the index data contains a ticker symbol and the name of an
index (10). And finally, a row in the price data includes a ticker
symbol and a price. The numbers in parenthesis specify an upper bound on
the length of the longest string for that field.

You may assume that the ticker symbol is a key for the company data and
the price data. In contrast, a given ticker symbol may appear more than
once in the index data. For example, all the stocks in the DJIA are also
in the SP500 index.

Tasks
~~~~~

**Task 1**

Write a set of SQL statements and sqlite import commands to create three
tables---company\_info, index\_info, and prices---and import the
associated data into the tables from the files we have provided in your ``lab3`` directory.

**Task 2**

Write a SQL query to identify the names of all the companies in the
"Services" sector.

**Task 3**

Write a query that produces a table with the names of all the sectors.

**Task 4**

Write a query to determine the number of companies in each sector. The
output should be sorted in increasing order by sector name.

**Task 5**

Write a query to determine the names of the stocks in the Dow Jones
Industrial Average (DJIA).

**Task 6**

Write a query to determine the top five sectors (by number of stocks) in
the SP500 index.

**Task 7**

Write a query to determine the names of the companies that appear in
both the DJIA and the SP500 index. Hint: use a three-way join
(``x JOIN y JOIN z``).


Submission
----------

Please ``git add`` all of your ``.sql`` files (one for each task), ``git commit`` and ``git push`` them to the server. While this lab is not graded, we may ask you to demonstrate that you have completed it if you have questions about using SQL.