Relational Databases: Lecture Notes =================================== Terms ----- - SQL -- query language - row == tuple - column == attribute or field - table == relation - declarative language - data definition language - data manipulation language Relational databases were introduced in the 1970s... why has the idea survived so long? query optimization! We will be using ``sqlite3``, a light weight relational database implementation. We will be working with a database that has information about contributions to candidates for federal office. Commands that start with . contain instructions to sqlite3, rather than queries, etc. :: $ run sqlite3 fec.sqlite3 < create-fec.sql to create the fec database. $ sqlite3 fec.sqlite3 sqlite> .tables candidate contribution .mode column /* line up the columns in output */ .header ON /* output headers with results */ Schema:: sqlite> .schema candidate CREATE TABLE candidate (committee_id varchar(10), lastname varchar(50), firstname varchar(50), party varchar(6), city varchar(20), state varchar(2), zip integer, cand_id varchar(10), district varchar(2), constraint pk_candidate primary key (committee_id)); The data has already been loaded by the script that created the tables. Here's how you would load the data in bulk, if necessary:: sqlite> .separator "," sqlite> .import candidate.csv candidate You can type in queries from the command-line:: sqlite> SELECT * FROM candidate WHERE party = "REP" LIMIT 10; or read queries from a file:: sqlite> .read q1.sql where the file ``q1.sql`` contains:: -- Pandas: df[df.party == "REP"] SELECT * FROM candidate WHERE party = "REP" LIMIT 10; Please note the line that starts with ``--`` is a comment and is ignored by the query engine. SQL Basics ---------- Selection operation ~~~~~~~~~~~~~~~~~~~ Selection: use a boolean expression to describe which tuples should be included in the result. For example, list all candidates FROM a specific party:: SELECT * FROM candidate WHERE party = "REP"; Use collate to handle differences in case: rep, Rep, REP:: SELECT * FROM candidate WHERE party = "rep" COLLATE NOCASE; In the case of the candidate data, all of the party names are in all caps, so the collate is unnecessary. Note the preferred style is to use all CAPS for keywords, but it is not required. Projection ~~~~~~~~~~ Projection is used to describe which attributes should be included in the result. For example, list the names of all the candidates:: SELECT lastname, firstname FROM candidate; Combine ideas from selection and projection to find the names of all the candidates in the Republican party:: SELECT lastname, firstname FROM candidate WHERE party = "REP"; Use limit to reduce the number that are included in the result:: SELECT lastname, firstname FROM candidate WHERE party = "REP" LIMIT 10; Aggregation ~~~~~~~~~~~ The simplest form of aggregation is counting rows:: sqlite> SELECT count(*) FROM candidate WHERE party = "REP"; count(*) ---------- 2367 Use the AS operator to name an attribute:: sqlite> SELECT count(*) AS cnt FROM candidate WHERE party = "REP"; cnt ------- 2367 Group by provides another way to do aggregation: group the data using some criteria and then aggregate over the groups. For example, we might want to compute a table with the number of candidates per party:: SELECT party, count(*) AS cnt FROM candidate GROUP BY party; We can sort the result, in increasing order by default:: SELECT party, count(*) AS cnt FROM candidate GROUP BY party ORDER BY cnt; Notice the use of AS to give a name to the count and the use of that name in the ORDER BY clause. To sort in decreasing ORDER BY number of candidates add DESC:: SELECT party, count(*) AS cnt FROM candidate GROUP BY party ORDER BY cnt DESC; To find the top 10 parties by candidate, add in a use of LIMIT:: SELECT party, count(*) AS cnt FROM candidate GROUP BY party ORDER BY cnt DESC LIMIT 10; Finally, use HAVING to select which groups/aggregate results to include in the result. For example: find parties that have at least 100 candidates:: SELECT party, count(*) AS cnt FROM candidate GROUP BY party HAVING cnt > 100 ORDER BY cnt; The where clause specifies the rows to include BEFORE grouping. Having specifies which results to include AFTER grouping/aggregation. A few extra examples ~~~~~~~~~~~~~~~~~~~~ What was the total number of contributions made by residents of each state? :: SELECT state, count(*) AS total FROM contribution GROUP BY state ORDER BY state; What is the smallest positive contribution in each state:: SELECT state, min(amount) AS total FROM contribution WHERE amount > -1 GROUP BY state ORDER BY state; Joins ~~~~~ Joins allow us to combine information from different tables or even using the same table (self-join). Here is information about employees in a tiny firm. Each row contains an employee number and a pay per period:: 001,1850 003,2000 004,1975 005,2450 Here's the schema for this data:: CREATE TABLE employee_records (employee_number varchar(3), amount integer); The employee number serves as a primary key for the employee table: that is, each employee number uniquely identifies a tuple. Here's some sample payroll data from a tiny company:: C1,001,1850 C2,002,2200 C3,003,1000 C4,003,1000 The first column contains a check number, the second is an employee number, and the third is an amount. We can create this table using the following schema:: CREATE TABLE payroll_ledger (check_number varchar(10), employee_number varchar(3), pay_per_period integer); The check number serves the primary key for this table. Let's combine these two tables using a JOIN without specifying a join condition:: sqlite> SELECT * FROM payroll_ledger JOIN employee_records; C1,001,1850,001,1850 C1,001,1850,003,2000 C1,001,1850,004,1975 C1,001,1850,005,2450 C2,002,2200,001,1850 C2,002,2200,003,2000 C2,002,2200,004,1975 C2,002,2200,005,2450 C3,003,1000,001,1850 C3,003,1000,003,2000 C3,003,1000,004,1975 C3,003,1000,005,2450 C4,003,1000,001,1850 C4,003,1000,003,2000 C4,003,1000,004,1975 C4,003,1000,005,2450 The result is the cross-product: every row in payroll_ledger is matched with every row in employee_record. This result is not all that useful. Let's add a JOIN condition to keep only the rows that contain payroll and employee information for the same employee (as specified by the employee number):: sqlite> SELECT * FROM ...> payroll_ledger JOIN employee_records ...> ON payroll_ledger.employee_number = employee_records.employee_number; C1,001,1850,001,1850 C3,003,1000,003,2000 C4,003,1000,003,2000 Notice that the result is much smaller. Also, notice that there is no entry for check ``C2``. Check ``C2`` references employee number ``002``, who does not appear in the employee records table. Also, notice employees ``004`` and ``005`` appear in the employee records table, but not in the payroll ledger. As a result, the join does not include entries for them. We can use a LEFT JOIN to make sure that row that appears in table on the left of the JOIN operator appears in the result:: sqlite> SELECT * FROM ...> payroll_ledger LEFT JOIN employee_records ...> ON payroll_ledger.employee_number = employee_records.employee_number; C1,001,1850,001,1850 C2,002,2200,, C3,003,1000,003,2000 C4,003,1000,003,2000 Notice that check ``C2`` appears in the result with nulls for the employee record values. Sqlite3 does not support RIGHT JOIN, which includes all rows from the right operand of the JOIN at least once or OUTER JOIN, which includes all rows from both operands at least once. Here are some sample queries that use JOIN on the candidate and contribution data. Combine data from candidate and contribution tables using the candidate ID as the join key:: SELECT * FROM candidate JOIN contribution ON candidate.cand_id = contribution.cand_id LIMIT 10; Same as the previous example, except use projection to get rid of duplicate information:: SELECT cand.lastname, cand.firstname, cand.state AS cand_state, cont.state AS cont_state, cont.amount FROM candidate AS cand JOIN contribution AS cont ON cand.cand_id = cont.cand_id LIMIT 10; Add in grouping: find the top 3 candidates in terms of total contributions. Exclude negative amounts. :: SELECT cand.cand_id, cand.lastname, cand.firstname, SUM(cont.amount) as total FROM candidate AS cand JOIN contribution AS cont ON cand.cand_id = cont.cand_id WHERE (cont.amount > 0) and (cand.state = cont.state) GROUP BY cand.cand_id ORDER BY total DESC LIMIT 3; Nested queries ~~~~~~~~~~~~~~ Queries can be nested in a variety of ways. Here's a simple example: how many candidates received at least one positive contribution? :: SELECT COUNT(*) FROM candidate WHERE candidate.cand_id IN (SELECT DISTINCT cand_id FROM contribution WHERE amount > 0); Here's a sightly more complex query: list candidates who had at least one contribution of more than three times the average contribution:: SELECT DISTINCT lastname, firstname FROM candidate JOIN contribution ON candidate.cand_id = contribution.cand_id WHERE amount > 3 * (SELECT AVG(amount) FROM contribution); DISTINCT eliminates duplicates from the result. The CASE clause allows us to include different values for an attribute based on a boolean condition. The general form is:: CASE WHEN [condition] THEN [expression1] ELSE [expression2] END The attribute will have the value of expression1 if the condition is true and the value of expression2 otherwise. Here is an example query that yields a table that includes columns for lastname, homestate, candidate id, an in-state contribution amount, and out of state contribution amount (note that only one of these two amounts will be non-zero):: SELECT lastname, candidate.state AS homestate, candidate.cand_id AS cand_id, CASE WHEN candidate.state= contribution.state THEN amount else 0 END AS isc, CASE WHEN candidate.state != contribution.state THEN amount else 0 end AS oosc FROM candidate INNER JOIN contribution ON candidate.cand_id = contribution.cand_id; This query can be nested within another to produce a table of the total amount of in-state contributions to a candidate and the total amount of out-of-state contributions:: SELECT lastname, homestate, SUM(isc), SUM(oosc) FROM (SELECT lastname, candidate.state AS homestate, candidate.cand_id AS cand_id, CASE WHEN candidate.state= contribution.state THEN amount else 0 END AS isc, CASE WHEN candidate.state != contribution.state THEN amount else 0 end AS oosc FROM candidate INNER JOIN contribution ON candidate.cand_id = contribution.cand_id) GROUP BY cand_id; Inserting/updating/deleting rows ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Here's the general form for inserting rows:: INSERT INTO