CSPP 53001: Assignment 5

Project is due by 11:59pm on Friday, November 16, 2012

Steps 4 and 5 of Your TBP App

The fourth part of your TBP project is to generate some more data and populate your MySQL database with it. The fifth part is to design and develop several join and aggregation queries on this larger database.

You also have the optional choice to redesign your database based on the feedback you have received from the instructor and the TAs and your experience developing the TBP App so far. If you do choose to redesign your database in a significant way, please, include your new ER diagram, named TBP_ERD_v2.??? with your submission. Note that adding or deleting several attributes or adding a few new relationships is not a significant redesign. If you do choose these option, please, make sure that the relations in your database correspond (i.e. have the same names) to the entity sets and relationships in your ER diagram.

Develop a larger amount of data for your database and load it into your relations using the SQL load command. The data you generate and load should be on the order of:

At least two relations with thousands of tuples
At least one additional relation with hundreds of tuples

To create more data for your relations, you can write a program in any programming language you like. The program should create large files consisting of records in an ASCII text-only format acceptable to the MySQL bulk loader. Then you should load the data into your TBP relations. If you are using real data, your program will need to transform the data into files of records conforming to your TBP schema. Otherwise you will write a program to fabricate data: your program will generate either random or nonrandom (e.g., sequential) records conforming to your schema. Note that it is acceptable for your data values to be meaningless (i.e. random sequence of characters instead of actual names).

If the semantics of your schema includes relations that are expected to be relatively small (e.g., states in the US), it is fine to use some small relations, but please ensure that you have relations of the sizes prescribed above as well. When writing a program to fabricate data, there are two important points to keep in mind:

Be sure not to generate duplicate values for your key attributes.
Your TBP certainly includes relations that are expected to join with each other. For example, if the key of the Users relation is username, then it is expected to join with the username_leader and username_follower attributes of the Follows relation. In generating data, be sure to generate values that actually do join -- otherwise all of your interesting queries will have empty results! One way to guarantee joinability is to generate the values in one relation (corresponding to an entity set in your E/R diagram), then use the generated values to select joining values for the other relation (corresponding to a relationship in your E/R diagram). For example, you could generate usernames first (either sequentially or randomly), then use these usernames to fill in the username_leader and username_follower values in the Follows relation.

You will need to submit your new create_db_large.sql and populate_db_large.sql scripts along with all data files referenced in your populate_db_large.sql script.

To demonstrate the size of your new relations, please, generate a script file called count.sql that finds the number of tuples in every one of your relations.

Extra credit: Develop a single query that returns the name of every table along with its size. For example, if you had only three tables: Users, Tweets, and Follows and their sizes were 10000, 50000, and 100000, the result of your single query should look like this:

+------------+------------+
| table_name | table_size |
+------------+------------+
| Users      |      10000 |
+------------+------------+
| Tweets     |      50000 |
+------------+------------+
| Follows    |     100000 |
+------------+------------+

Develop and test at least ten SQL queries using the JOIN syntax discussed in class. At least five of your queries should involve aggregation. To receive full credit, you should use all JOIN variations discussed in class (ON, USING, NATURAL, OUTER) and aggregation with and without GROUP BY clause. Please, generate a script file called query5.sql with all of your join and aggregation queries. Your queries must not produce empty results and also must complete in a reasonable amount of time (a few minutes). If any of your queries takes a long time to complete, consider modifying it by adding more conditions that limit the size of the intermediate results or final result. For example, if your query finds all pairs of users that follow at least one user in common:

SELECT DISTINCT F_1.username_follower AS user1, 
F_2.username_follower AS user2
FROM Follows F_1 JOIN Follows F_2
USING (username_leader)
WHERE F_1.username_follower <> F_2.username_follower;

and it takes 30 minutes or more to produces millions of results, consider adding some conditions that limit the users from F_1 and F_2. For example, you can only consider usernames starting with the letters "a" and "b":

SELECT DISTINCT F_1.username_follower AS user1, 
F_2.username_follower AS user2
FROM Follows F_1 JOIN Follows F_2
ON (F_1.username_leader = F_2.username_leader
AND F_1.username_follower LIKE "a%"
AND F_2.username_follower LIKE "b%")
WHERE F_1.username_follower <> F_2.username_follower;

Note that you do need to submit the scripts that populates your large database as well as the corresponding data files

You will submit your homework using the hwsubmit command. Please, create a new directory called hw5_<your_username>, e.g. hw5_evtimov, and move there all of the relevant scripts and all data files. For this assignment, there should be the four scripts: create_db_large.sql, populate_db_large.sql, count.sql and query5.sql, all the large data files and, if you have chosen to redesign your database, your new ER diagram TBP_ERD_v2.???. Then, you can submit your work by typing (<path_to> is the directory path to your hw5 directory).

hwsubmit cspp53001 <path_to>/hw5_<your_username>

Please make sure that all of the relevant files and no others are in that directory.

CSPP 53001: Assignment 5

Project is due by 11:59pm on Friday, November 16, 2012

Steps 4 and 5 of Your TBP App

There is no Gradiance Problem Set this week