CMSC 12300 and CAPP 30123: Computer Science with Applications III

The University of Chicago, Spring 2016

Cheat Sheet

Read-only Public Snapshots on AWS

The public data sets are read-only. If you need to do anything other than simply reading the files as-is (including converting or decompressing them), you first need to copy the particular file(s) to the local hard drive:

(in directory with files of interest)
$ mkdir ~/name-of-directory-to-hold-data
$ cp file-name ~/name-of-directory-to-hold-data
(note, you can do things like *.zip to copy all zip files in the current directory)
$ cd ~/name-of-directory-to-hold-data

Note that the local hard drive is obliterated when you terminate your instance, as well you should if you won't be using it for a while. The instructions above are suitable for casual exploration, but for long-term use, you should create an external volume and copy things there instead. This gives you more space and more persistence. See instructions further down the page. (If you follow those instructions, replace "~ above with "/mnt/whatever".)

Decompressing files

If files in a data set are compressed, then you will need to follow the above instructions to copy them. Next, once you cd to your new directory on the local hard drive, you can decompress them.

Compressed files end with ".zip", ".gz", ".tgz", ".bz", ".bz2", ".tar.bz", or ".tar.bz2". THe instructions depend upon the format:

.zip:
$ unzip filename.zip

.gz:
$ gunzip filename.gz

.bz or .bz2 *without* tar:
$ bunzip filename.bz (or bz2)

.tar.gz or .tgz:
$ tar -xzf filename.tar.gz (or .tgz)

.tar.bz or .tar.bz2:
$ tar -xjf filename.tar.bz (or .bz2)

You may delete the compressed versions on the local hard drive where you copied them after decompressing them if you wish.

Running out of space

If you, in the process of getting your files in place on the local hard drive or uncompressing them, receive an error about being out of disk space, you will need to create a new, bigger volume and use that for your storage instead. If you encounter this, see further down the page for instructions.

Combining CSV files

If your data set is in CSV format, but consists of multiple files, you have a couple of options. You can either write your scripts to expect and work with separate files, or concatenate them together into one file (if they have the same columns). THe following instructions explain how to concatenate.

Generally speaking, if you want all the CSV files to be in one big CSV file, you can concatenate the files. But, you have to be a little careful, because the first line is the header that describes all the fields, and if you just concatenate the files, you will get header lines interspersed throughout the data set.

Here is a way to strip the header off a file:

$ wc -l filename.csv
1234 filename.csv
$ tail -n 1233 filename.csv > filename-withoutheader.csv

What you did here is count the number of lines, then take the last n-1 lines from the file and put them in the other file. Once you do this for all the CSV files, you can concatenate them together into one file:

$ cat *withoutheader.csv > onebigfile.csv

But, one issue is that manually issuing the "wc -l" command and the "tail" command for many files is tedious and error-prone. You can accomplish all of this automatically using a "for" loop in the shell:

$ for f in `ls *.csv`; do len=`wc -l $f | awk '{print $1}'`; lenm1=$((len - 1)); tail -n $lenm1 $f > $f.wh; done

Here, we have a for loop, with the loop variable named f. We list (using ls) all the CSV files and loop over them one at a time, setting f to the current one. We set the variable len to be the result of taking the line count using "wc -l" on the file $f (you do use the dollar sign when accessing a variable, don't use it when setting it.) The problem is that wc prints two things, the number of lines and the filename. We "pipe" the result of "wc" into an "awk" command that only prints out the first column (the number of lines). We then set a new variable, lenm1, to the value of len minus one. We then do tail with the line count we calculated and file name, creating a new file without the header named the same as the original file but with .wh at the end of the file name. We then end the body of the for loop.

This will result in a bunch of .csv.wh files which you can concatenate:

$ cat *.wh > onebigfile.csv

You can remove the individual .wh and even the .csv files if you only care to keep around the final combined file. (But think about what you are doing; removing all the CSV files will remove the "onebigfile.csv" you just made. One solution is to temporarily name that file something not ending in .csv.)

If you want to selectively use only some of the CSV files rather than all of them (for instance, if files that don't belong together are in the same directory, you can use regular expressions to specify the file names. This is another topic, but for instance, you can do things like:

  highway*.csv

instead of:

  *.csv

to only select the files beginning with "highway".

Copying files between EC2 and your VirtualBox

You can use the scp command to copy files back and forth between EC2 and your local computer.

virtual-box $ scp -i cslab.pem ec2-user@public-dns:/mnt/dataset/abc.csv ./

copies from AWS to your VirtualBox. Here you need to get the cslab.pem file onto your virtual machine image first (only once). Instead of public-dns you put whatever the Public DNS name is from the EC2 management console. After the colon you give the path to the file of interest. If it is in your home directory, then you would instead, after the colon, have something like "~/file". If you want to copy a whole directory, write "-r" after the "scp", with spaces before and after.

If you want to copy from your virtual machine to EC2:

virtual-box $ scp -i cslab.pem filename ec2-user@public-dns:~/

Again, you may use "-r" for a directory.

To get the cslab.pem file to your VirtualBox:

virtual-box $ scp cnet@linux.cs.uchicago.edu:~/cslab.pem ./

and adjust for any differences in location or filename.

To be clear, all of the commands would be entered into the terminal on your virtual machine, and not into a terminal that is connected to Amazon at the moment.

To move files between VirtualBox and your Mac or WIndows machine, there are various options. Perhaps the easiest would be to sign into a web site like Dropbox or a webmail account and use that.

Creating an EBS volume for storage space

The AWS Public Data Sets are on read-only volumes, so if you want to make modified, filtered, or processed copies, you need to store them elsewhere. Or, you may be working with a dataset you have obtained elsewhere.

You can try storing them in your home directory (~/) on an EC2 instance, but there is only 6.6 GB available on a freshly started t2.micro instance. If you need more, you're going to need to store everything elsewhere. In addition, the home directory is obliterated when you terminate an EC2 instance (which you should be doing if you won't be using it for a prolonged period), so you would like some independent storage space for that reason as well.

Go to the EC2 management console. In the column on the left, choose Volumes under Elastic BLock Store. Click the Create Volume button. For the type, General Purpose (SSD) is fine. Choose a reasonable size. For the size, I would recommend you choose something with room to grow. You don't want to extravagantly waste space, but you should anticipate needing, for instance, a couple of copies of your data set, so I would recommend very substantial wiggle room. Yes, you can always make a bigger volume and copy your files later, but it's better to plan ahead now. Note also that there is bound to be some overhead; for instance, I made a 100 GB volume and there was only 94 GB available for actual use. Accept the other defaults and Create. (But, if you have an EC2 instance running already, be sure the volume will be in the same Availability Zone.)

If you don't already have an EC2 instance up, get one going in the same Availability Zone.

Once the volume's state is available and your EC2 instance is warmed up, right-click it and Attach Volume. Choose your instance and Attach.

In your terminal window that is ssh'd in to your instance:

$ sudo mkfs.ext4 /dev/sdf
$ sudo mkdir /mnt/whatever
$ sudo mount /dev/sdf /mnt/whatever
$ cd /mnt
$ sudo chmod a+w whatever

Now you should be able to read and write in this volume by cd'ing into whatever and doing whatever you want there.

(The mkfs.ext4 formats the drive with a blank filesystem of type ext4, one of the standard Linux filesystems and the one also being used for the boot drive. The chmod command makes the volume writeable by the ec2-user account; it is not by default.)

When you won't be using your node for a prolonged period, you can and indeed should still terminate it from the EC2 Instance console. To be clear, this will unmount, detach, and and shut down the EC2 instance, while retaining the volume and its contents. It is very important to note that anything in the home directory will be lost, but everything in the /mnt/whatever directory will still exist. When you later create a new instance, you:

Do not create a new volume
Do Attach the existing volume to the new instance
Do NOT issue the "mkfs" command, which will erase your data. Again, do NOT do this.
Do issue the mkdir command.
Do issue the mount command.
Do not issue the chmod command.

Note that another place to store your data is in S3. In fact, Elastic MapReduce can, and likes to, read straight out of S3. This is explained in more detail in Lab 3. But, note that you can store data in an EBS volume like the one we just made and use it for MapReduce, albeit it with an extra copying step. I'll discuss storing and using things on S3 more next.

Using S3 to store data, and considerations vs. EBS

There are two ways to store data on AWS: EBS and S3.

Elastic Block Store (EBS) is what your EC2 instances use for their virtual hard drives, and what is used to store the public snapshots. Instructions earlier on this page guide you through formatting and mounting a larger EBS virtual hard drive for storing your data persistently, even when you terminate your EC2 instance.

EBS volumes really are like a hard drive in the sense that a hard drive is only hooked up to a single machine at a time. This is problematic if you want a dataset to be available on several nodes in a cluster at once, and have them be able to write new files to a shared volume.

If you are using MapReduce, then you can start a mrjob from one of your EC2 instances, specifying a file stored on an EBS volume. mrjob will then copy that file to an S3 bucket that is available to all the nodes in the cluster. So, mrjob takes care of this for you and you don't have to worry about it.

But, if you aren't using mrjob, you have to take care of this yourself. If you want the jobs to write large output files to be stored alongside the input files, that isn't a MapReduce kind of thing to do, so you'd also need to make your own arrangements in this case. In addition, if you plan to run a number of MapReduce jobs over a data set, it is more efficient to copy your files into S3 once and specify a data set already in S3 to mrjob, avoiding the on-demand copy into S3 each time.

First, let's discuss S3 a bit more. S3, unlike EBS, is intended to be accessible by multiple nodes at once. So it is more appropriate for sharing a data set across a cluster. The bad news is that it isn't exactly intended to be interacted with using a Unix command line, with tools like "cd," "ls," "cp," and so on. It turns out that Amazon is working on a new service called Elastic File System, that provides a networked, cluster-friendly storage service, but is easily manipulated using standard command-line tools. The only problem is, it was just announced in April and isn't ready for prime time yet (this text is from 2015 and I haven't investigated where the situation stands a year later).

So, S3 is your best option for shared access to the same storage, for now.

To create an S3 bucket, you can use the S3 management console. To upload files into S3 from your own computer, you can use that console as well. However, if you need to download a huge data set, it may be better to download straight from within AWS.

If you need to access files on S3 as input to a MapReduce job, you can specify an S3 URL as shown in Lab 3 and need do nothing more. In fact, if this is the case, then you should do it this way.

If you want to download a large file over the internet into S3, or need to have read and write access to S3 from the command line or from software other than a MapReduce job, you will want to install some special software, called s3fs-fuse, that allows you to treat an S3 bucket like a mounted volume even though it is not. The instructions below are how to do this. Run the following on your EC2 instance (if trying to get the S3 bucket mounted so you can download something into it), or on each of your EC2 instances in a cluster (if you are trying to run non-S3-aware software on each node in a cluster).

sudo yum -y groupinstall "Development Tools"
sudo yum -y install fuse fuse-devel autoconf automake curl-devel libxml2-devel openssl-devel mailcap
wget https://github.com/s3fs-fuse/s3fs-fuse/archive/v1.78.tar.gz
tar xzf v1.78.tar.gz 
cd s3fs-fuse-1.78
./autogen.sh 
./configure 
make
sudo make install
cd
echo "access-key-id:secret" > .passwd-s3fs
chmod 600 .passwd-s3fs 
mkdir name-of-mount-directory
s3fs name-of-s3-bucket  name-of-mount-directory

Replace " name-of-mount-directory", "name-of-s3-bucket", and the access key id and secret with your own. Once you do this, everything within the mount directory will be the contents of the S3 bucket.

Note that if you are writing data files into S3 from multiple nodes at once, it is essential that they be separate files for each node. You don't want to be trying to write to the same file at the same time from multiple nodes.

The distinctions and considerations involved in choosing the right storage for your needs are subtle. Please do not hesitate to ask me for advice or clarifications.

Installing extra software

The AWS Linux machines have a truly minimal software installation. You can install whatever you want or need, however. You may have already gotten into the habit of doing this on your VM, using sudo pip install (possibly with a Python version number after "pip") or with sudo apt-get install for non-Python packages.

You can use sudo pip just the same on AWS. But, the apt-get tool is not available on the particular flavor of Linux provided by Amazon. Instead, you will need to use a tool called yum. It functions similarly to apt-get.

You may find that the names and availability of packages for yum differ from apt-get, however.

If a suitable package is not available for the tool you need, you may need to find a source code download for the program in question. These can be found by performing a web search. Often, the source is available on a site like sourceforge or github. You will, generally, get a .tar.gz, .tar.bz2, or (rarely) .zip archive when you download the source. See the top of this page for instructions on how to unpack these.

Before compiling the downloaded software and installing it, you will need to install basic development tools on the machine, since Amazon does not provide them by default. To do this, run the command yum groupinstall "Development Tools" (be sure to include the quotes). You need only do this once per machine, even if installing multiple tools from source.

Having done so, cd into the newly extracted directory. Virtually all packages come either with a INSTALL or readme file, but here is the Reader's Digest version that is applicable for most packages:

./configure
make
sudo make install

If these instructions do not work, you will need to check the readme files or inspect the error messages if they made some measure of progress before halting.

Whether it be installed with yum, pip or make install, you will need to redo this process each time you start a new machine. Should you need a lengthy installation process, consider making a shell script to automate the downloading and installation of all needed software. You can create this script on your own machine, then scp it in each time you launch a new machine.