Relational Databases: Lecture Notes
===================================

Terms
-----

-  SQL -- query language
-  row == tuple
-  column == attribute or field
-  table == relation

-  declarative language
-  data definition language
-  data manipulation language

Relational databases were introduced in the 1970s... why has the idea
survived so long?  query optimization!

We will be using ``sqlite3``, a light weight relational database
implementation.

We will be working with a database that has information about
contributions to candidates for federal office.

Commands that start with . contain instructions to sqlite3, rather than
queries, etc.

::

    $ run sqlite3 fec.sqlite3 < create-fec.sql to create the fec database.
    $ sqlite3 fec.sqlite3

    sqlite> .tables
    candidate     contribution

    .mode column   /* line up the columns in output */
    .header ON     /* output headers with results */

Schema::

    sqlite> .schema candidate
    CREATE TABLE candidate
      (committee_id varchar(10),
       lastname varchar(50),
       firstname varchar(50),
       party varchar(6),
       city varchar(20),
       state varchar(2),
       zip integer,
       cand_id varchar(10),
       district varchar(2),
       constraint pk_candidate primary key (committee_id));    

The data has already been loaded by the script that created the tables.
Here's how you would load the data in bulk, if necessary::

    sqlite> .separator ","
    sqlite> .import candidate.csv candidate


You can type in queries from the command-line::

    sqlite> SELECT * FROM candidate WHERE party = "REP" LIMIT 10;

or read queries from a file::

    sqlite> .read q1.sql

where the file ``q1.sql`` contains::

    -- Pandas: df[df.party == "REP"] 

    SELECT * FROM candidate WHERE party = "REP" LIMIT 10;

Please note the line that starts with ``--`` is a comment and is
ignored by the query engine.


SQL Basics
----------

Selection operation
~~~~~~~~~~~~~~~~~~~

Selection: use a boolean expression to describe which tuples should be
included in the result.  For example, list all candidates FROM a
specific party::

    SELECT * FROM candidate WHERE party = "REP";

Use collate to handle differences in case: rep, Rep, REP::

    SELECT * FROM candidate WHERE party = "rep" COLLATE NOCASE;

In the case of the candidate data, all of the party names are in all
caps, so the collate is unnecessary.

Note the preferred style is to use all CAPS for keywords, but it is
not required.


Projection
~~~~~~~~~~

Projection is used to describe which attributes should be included
in the result.  For example, list the names of all the candidates::


  SELECT lastname, firstname FROM candidate;


Combine ideas from selection and projection to find the names of all
the candidates in the Republican party::

  SELECT lastname, firstname FROM candidate WHERE party = "REP";


Use limit to reduce the number that are included in the result::

  SELECT lastname, firstname FROM candidate WHERE party = "REP" LIMIT 10;


Aggregation
~~~~~~~~~~~

The simplest form of aggregation is counting rows::

  sqlite> SELECT count(*) FROM candidate WHERE party = "REP";
  count(*)  
  ----------
  2367      


Use the AS operator to name an attribute::

  sqlite> SELECT count(*) AS cnt FROM candidate WHERE party = "REP";
  cnt       
  -------
  2367


Group by provides another way to do aggregation: group the data using
some criteria and then aggregate over the groups.  For example, we
might want to compute a table with the number of candidates per party::

  SELECT party, count(*) AS cnt 
  FROM candidate 
  GROUP BY party;


We can sort the result, in increasing order by default::

  SELECT party, count(*) AS cnt 
  FROM candidate 
  GROUP BY party 
  ORDER BY cnt;

Notice the use of AS to give a name to the count and the use of that
name in the ORDER BY clause.

To sort in decreasing ORDER BY number of candidates add DESC::

  SELECT party, count(*) AS cnt FROM candidate 
  GROUP BY party 
  ORDER BY cnt DESC;


To find the top 10 parties by candidate, add in a use of LIMIT::

  SELECT party, count(*) AS cnt 
  FROM candidate 
  GROUP BY party  
  ORDER BY cnt DESC 
  LIMIT 10;


Finally, use HAVING to select which groups/aggregate results to
include in the result.  For example: find parties that have at least
100 candidates::

  SELECT party, count(*) AS cnt 
  FROM candidate 
  GROUP BY party 
  HAVING cnt > 100 
  ORDER BY cnt;

The where clause specifies the rows to include BEFORE grouping.
Having specifies which results to include AFTER grouping/aggregation.

A few extra examples
~~~~~~~~~~~~~~~~~~~~

What was the total number of contributions made by residents of each
state?

::

    SELECT state, count(*) AS total
    FROM contribution
    GROUP BY state
    ORDER BY state;

What is the smallest positive contribution in each state::

    SELECT state, min(amount) AS total
    FROM contribution
    WHERE amount > -1
    GROUP BY state
    ORDER BY state;


Joins
~~~~~

Joins allow us to combine information from different tables or even
using the same table (self-join).

Here is information about employees in a tiny firm.  Each row contains
an employee number and a pay per period::

    001,1850
    003,2000
    004,1975
    005,2450

Here's the schema for this data::

    CREATE TABLE employee_records
      (employee_number varchar(3),
       amount integer);

The employee number serves as a primary key for the employee table:
that is, each employee number uniquely identifies a tuple.

Here's some sample payroll data from a tiny company::

  C1,001,1850
  C2,002,2200
  C3,003,1000
  C4,003,1000

The first column contains a check number, the second is an employee
number, and the third is an amount.  We can create this table using the
following schema::

    CREATE TABLE payroll_ledger
      (check_number varchar(10),
       employee_number varchar(3),
       pay_per_period integer);

The check number serves the primary key for this table.

Let's combine these two tables using a JOIN without specifying a join
condition::

    sqlite> SELECT * FROM payroll_ledger JOIN employee_records;
    C1,001,1850,001,1850
    C1,001,1850,003,2000
    C1,001,1850,004,1975
    C1,001,1850,005,2450
    C2,002,2200,001,1850
    C2,002,2200,003,2000
    C2,002,2200,004,1975
    C2,002,2200,005,2450
    C3,003,1000,001,1850
    C3,003,1000,003,2000
    C3,003,1000,004,1975
    C3,003,1000,005,2450
    C4,003,1000,001,1850
    C4,003,1000,003,2000
    C4,003,1000,004,1975
    C4,003,1000,005,2450

The result is the cross-product: every row in payroll_ledger is
matched with every row in employee_record.  This result is not all
that useful.  Let's add a JOIN condition to keep only the rows that
contain payroll and employee information for the same employee (as
specified by the employee number)::

    sqlite> SELECT * FROM
       ...> payroll_ledger JOIN employee_records 
       ...> ON payroll_ledger.employee_number = employee_records.employee_number;
    C1,001,1850,001,1850
    C3,003,1000,003,2000
    C4,003,1000,003,2000

Notice that the result is much smaller.  Also, notice that there is no
entry for check ``C2``.  Check ``C2`` references employee number
``002``, who does not appear in the employee records table.  Also,
notice employees ``004`` and ``005`` appear in the employee records
table, but not in the payroll ledger.  As a result, the join does not
include entries for them.

We can use a LEFT JOIN to make sure that row that appears in table on
the left of the JOIN operator appears in the result::

    sqlite> SELECT * FROM
       ...> payroll_ledger LEFT JOIN employee_records
       ...> ON payroll_ledger.employee_number = employee_records.employee_number;
    C1,001,1850,001,1850
    C2,002,2200,,
    C3,003,1000,003,2000
    C4,003,1000,003,2000

Notice that check ``C2`` appears in the result with nulls for the
employee record values.  Sqlite3 does not support RIGHT JOIN, which
includes all rows from the right operand of the JOIN at least once or
OUTER JOIN, which includes all rows from both operands at least once.

Here are some sample queries that use JOIN on the candidate and
contribution data.

Combine data from candidate and contribution tables using the
candidate ID as the join key::

    SELECT *
    FROM candidate JOIN contribution
    ON candidate.cand_id = contribution.cand_id
    LIMIT 10;

Same as the previous example, except use projection to get rid of
duplicate information::

    SELECT cand.lastname, cand.firstname, cand.state AS cand_state, 
           cont.state AS cont_state, cont.amount  
    FROM candidate AS cand JOIN contribution AS cont 
    ON cand.cand_id = cont.cand_id
    LIMIT 10;


Add in grouping: find the top 3 candidates in terms of total
contributions.  Exclude negative amounts.

::

    SELECT cand.cand_id, cand.lastname, cand.firstname, SUM(cont.amount) as total
    FROM candidate AS cand JOIN contribution AS cont 
    ON cand.cand_id = cont.cand_id
    WHERE (cont.amount > 0) and (cand.state = cont.state)
    GROUP BY cand.cand_id
    ORDER BY total DESC
    LIMIT 3;


Nested queries
~~~~~~~~~~~~~~

Queries can be nested in a variety of ways.  Here's a simple example:
how many candidates received at least one positive contribution?

::

    SELECT COUNT(*)
    FROM candidate
    WHERE candidate.cand_id IN (SELECT DISTINCT cand_id FROM contribution WHERE amount > 0);


Here's a sightly more complex query: list candidates who had at least
one contribution of more than three times the average contribution::

    SELECT DISTINCT lastname, firstname
    FROM candidate JOIN contribution ON candidate.cand_id = contribution.cand_id
    WHERE amount > 3 * (SELECT AVG(amount) FROM contribution);


DISTINCT eliminates duplicates from the result.


The CASE clause allows us to include different values for an attribute
based on a boolean condition.  The general form is::

    CASE WHEN [condition] THEN [expression1] ELSE [expression2] END

The attribute will have the value of expression1 if the condition is
true and the value of expression2 otherwise.  Here is an example query
that yields a table that includes columns for lastname, homestate,
candidate id, an in-state contribution amount, and out of state
contribution amount (note that only one of these two amounts will be
non-zero)::

    SELECT lastname, candidate.state AS homestate, candidate.cand_id AS cand_id,
            CASE WHEN candidate.state= contribution.state THEN amount else 0 END AS isc,
            CASE WHEN candidate.state != contribution.state THEN amount else 0 end AS oosc
    FROM candidate INNER JOIN contribution 
    ON candidate.cand_id = contribution.cand_id;

This query can be nested within another to produce a table of the
total amount of in-state contributions to a candidate and the total
amount of out-of-state contributions::

    SELECT lastname, homestate, SUM(isc), SUM(oosc)
    FROM (SELECT lastname, candidate.state AS homestate, candidate.cand_id AS cand_id,
            CASE WHEN candidate.state= contribution.state THEN amount else 0 END AS isc,
            CASE WHEN candidate.state != contribution.state THEN amount else 0 end AS oosc
          FROM candidate INNER JOIN contribution ON candidate.cand_id = contribution.cand_id)
    GROUP BY cand_id;


Inserting/updating/deleting rows
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Here's the general form for inserting rows::

    INSERT INTO <table>
    VALUES (X,Y,Z)

Here's an example::

    INSERT INTO candidate
    VALUES ("HHHH", "DUPONT", "TODD", "LIB", "CHICAGO", "IL", "60637", "CCCCC", "01");

This form requires that we supply a value for every attribute in the
schema.

Here's a different way to insert rows::

    INSERT INTO <table> (<col name>, <col name>, <col name>)
    VALUES (X,Y,Z)

and here's a sample use::

    INSERT INTO candidate (lastname, firstname, party, city, state, zip, cand_id, district)
    VALUES ("FOO", "BAR", "DEM", "CHICAGO", "IL", "60637", "CCCCC", "01");

This form does not require a value for every attribute.  Notice that
we did not supply a value for the committee ID, for example.  Values
that are not supplied will be set to NULL.


Here's the general form for updating rows::

    UPDATE <table>
    SET <col name> = <expr>, <col name> = <expr>, ...
    WHERE <expr>

and here's an example::

    UPDATE candidate
    SET committee_id = "H55555"
    WHERE lastname="DUPONT";

Note that *every* row that matches the where clause will be updated!


Here's the general form for deleting rows::

    DELETE FROM <table>
    WHERE <expr>

and here's an example::

    DELETE FROM candidate
    WHERE lastname="DUPONT";

Again, *every* row that matches the where clause will be deleted!