CS 122 Lab 5: Shell

The goal of this lab is to learn how to accomplish common file-processing tasks with the shell and shell scripting.

Introduction

The shell is the program that you have interacted with whenever you open a terminal window on a Unix-style system. The terminal program itself is the program that draws the window and text (emphasis on “drawing”). The shell is the program that generates the text, such as the prompt to enter in a command, and then parses what is typed.

Most commands we type on the shell, like git or python3, result in launching a separate program; once that program is launched, we interact with it directly. But, it is the shell that understands what we intend when we type the name of such a program, and actually contains the code to launch the auxiliary program. While a program like python3 is running, the shell remains present in the background; once the program quits, the shell regains control and prints another prompt to await your next order. (It is also possible to interact with the shell while a program is running, although this topic is outside the scope of this lab.)

Many shell programs have existed over the years on Unix-based systems. The majority of these alternatives behave identically when being used to perform basic tasks like program launching, output redirection, and piping between commands.

Shells also offer the opportunity to write shell scripts in a shell-specific programing language. This is the main point of differentiation between the various choices of shell.

The bash shell has been the de facto standard on Linux systems for as long as we can remember. Due to a preference for software with more commercially-friendly licensing terms, Apple originally adopted a different shell (tcsh) for Mac OS X, but switched to bash long ago.

Generally speaking, it suffices to become familiar with the syntax of a single shell, because you can use that shell exclusively for all your scripting needs. You would only need to understand, then, another shell’s syntax if you were to need to read code written for another shell by someone else. Learning bash syntax, however, will acquaint you with the most common shell today, and maximize the chances that a script you might encounter that was written by someone else is also in your language of expertise.

Shells are strongest at text-processing tasks, and for manipulating (e.g. copying, moving, renaming, or deleting) files. To support text manipulation and abstraction, they support variables, which are typically intended to be used for strings. To enable repeated work, such as performing a task on every file in a directory, they provide loops. And, to allow behavior to be selective, they support conditionals.

Getting started

During this lab, we will create a number of dummy files, so that we can practice doing various things on them. Please make a directory within your home directory, but not within your repository, to house these files; we want to minimize the chances that anything you will do during this lab will affect anything else. For instance:

cd
mkdir shell-practice
cd shell-practice

Remain in this sub-directory for the entire lab.

Creating dummy files

Let’s start by making some dummy files to work with. For this lab, it is okay for the files to only have boilerplate data in them; we can just imagine that they might have something in them of value. We’ll be focusing on the names of the files.

Let’s build up to this task.

Recall that there is a shell command named echo that prints a string to the screen. Try the following (do not remove or add spaces anywhere):

echo Hello, world!
FAV_COLOR=blue
echo "My favorite color is $FAV_COLOR"

Recall also that we can redirect the output of a command to a file:

echo File this away > example.txt

(Open this file up with a text editor and confirm it contains the string we just wrote to it.)

Now, let’s see how to write a for loop in bash syntax:

for i in a b c 1 2 3; do
  echo $i
done

Note that this loop can be entered in on multiple lines; when you press return after the first or second line, bash will wait for you to complete the loop before executing it. Once you do, you should see lines with a, b, c, 1, 2, and 3 before a new prompt.

Picking this apart, this loop looks similar to one in Python: we start with for, then give the name of a loop variable – in this case, i. This variable will take on different values during each iteration.

After in, we then give the list of values to loop over. Here, the syntax is slightly different than Python’s; we just jump in with the list, without a delimiter, like [. Successive values are separated only by spaces. Although there is no opening delimiter, we do need to be clear about when the list ends. We indicate this with a semicolon (;). This differentiates between do as a keyword, and having a list of values to loop over that literally contains the word do.

The keyword do plays a role similar to the colon (:) in a Python loop. Starting on the next line, we give one or more commands to execute for each iteration. At any point, we can refer to the loop variable by giving its name, preceded by a dollar sign ($).

We mark the end of the loop with the keyword done. This is because, unlike Python, the shell does not assign semantic meaning to indentation. While it is appropriate to indent the body of the loop for readability, the shell does not interpret this as delineating where the loop body begins and ends the way Python does. Rather, it is only the done that actually provides the mechanism for ending the loop body (and do indicates where it begins).

Suppose you want one of the lines that will be printed out by this loop to be “CS 122”. Try inserting this phrase amidst the other values in the list and see what happens.

The lesson here is that the shell, by default, breaks everything up into words. If we want a multi-word (anything containing spaces) phrase to be treated as a single unit, we must use quotes around it:

for i in a b c "CS 122" 1 2 3; do
  echo $i
done

Explicitly specifying a list to iterate over is useful, but often we want this list to be generated dynamically. To generate dynamic lists, we often use the output from some command. For instance, try executing the following:

seq 1 8

We can use the backtick key (the one near the tab and escape keys on your keyboard) to denote that we want to run a command and then use its output for something else in a script. For instance, try this:

for i in `seq 1 8`; do
  echo "The next number is: $i"
done

Had we simply said seq 1 8 in our loop header, we would have created a list of length three with the contents seq, 1, and 8. Had we used quotes, we would have created a list with a single entry: the string seq 1 8. It is the use of backticks, specifically, that asks to run the command and store its results as a list.

Let’s put these ideas together to create several files:

for i in `seq 1 8`; do
  echo "This was originally file$i" > file$i
done

You should now have eight, sequentially-numbered files, in your directory; use ls to check. (If you’ve been trying every example, you’ll also have example.txt, from earlier.)

Renaming files

Imagine you just downloaded these eight files from a web site and they contain data useful for your project. But, they aren’t named the way you wanted; you really want them to end with .txt.

You could manually rename each of the files; it only takes a minute, although it is tedious. But... what if there are eight files per state? That’s 400 files. In other words, think of these eight files as stand-ins for what could have been a lot more. Whatever your personal threshold for pain, we can think of scenarios that would cross it.

Let’s use a loop to rename them, instead. Remember that the mv command, used in the form mv a b, moves file a to file b. If these two filenames are in the same directory, this would more accurately be thought of as renaming than moving.

Let’s go ahead and rename all of them. We want only to rename the files that have names beginning with file, not the separate example.txt. To loop over all the files matching some pattern, we can use a third style of loop header:

for i in file*; do
  mv $i $i.txt
done

Here, we ask for all the files whose names begin with file in the current directory. The asterisk (*) character acts as a wildcard, and this single pattern is expanded to a list of all files whose names match the pattern.

This wildcard and pattern beg a little explanation. The shell generally allows us to use regular expression syntax; for instance, we can use brackets for ranges. But, there are some subtle differences: for instance, we write only the asterisk to denote what would be, in a proper regular expression, the period (to represent any possible character) followed by the star (to represent possible repetitions of it). A full description of the differences between the shell’s matching patterns and regular expressions is beyond the scope of this lab. And, in practice, we tend not to have occasion to use particularly esoteric regular expressions in the shell anyway.

Imagine we also want to make a backup copy of these files, with file names like file1.txtbackup. Try this (don’t be alarmed if it doesn’t work):

for i in file*txt; do
  cp $i $ibackup
done

You should have received eight complaints from the cp command explaining how to use it properly, and if you do an ls, see that nothing has actually happened.

What went wrong? You may have been suspicious when you typed $ibackup. How would the shell know if you want to take $i and then append backup to it, as opposed to retrieving the value of some variable named ibackup?

Answer: it wouldn’t know what you intended; it indeed tried to copy the files to a new file named after the contents of the variable ibackup. Add to this the fact that the shell does not complain about undefined variables; it simply treats them as empty strings. In the end, you tried to run the cp command with only one filename; for instance:

cp file1.txt

which, if you were to try it, would print the same error you just saw repeatedly.

The point of this is that we need some way to delineate the end of a variable name and the beginning of a suffix. The trick, as it turns out, is to wrap the variable name in braces:

for i in file*txt; do
  cp $i ${i}backup
done

But wait: when we appended .txt to all the files, we didn’t have to do this. Why not? Because variable names cannot contain periods, so $i.txt was not ambiguous. But, you needn’t memorize the rules for variable names – you are welcome to always use braces, even if they are not necessary in a given scenario.

After doing all this, I just realized that I meant for all these files to be named fileN.csv, not fileN.txt. Oops.

As you might imagine, trimming off suffixes (and, sometimes, prefixes) from filenames is a relatively common shell task. Thus, bash provides syntax to accomplish this. Given a variable which might end with a certain suffix, the syntax ${VARIABLE%suffix} gives us the variable’s contents, minus the suffix. (Any variable not ending with the specified suffix passes through unmodified.)

Let’s use this to fix our mistake:

for i in file*txt; do
  mv $i ${i%txt}csv
done

In one fell swoop, we have trimmed off the txt and then appended csv. Check with ls to confirm we fixed our mistake. (Note that the backup files are unmodified, because they do not end in txt, as the pattern in the loop header requires.)

A similar method is used to strip prefixes: ${variable#prefix}. Using this syntax, write a loop to change the names of our files from fileN.csv to dataN.csv.

Creating scripts

The loops we’ve written have been useful, but they’ve also been a lot to type. If we expect to have to do things like this frequently, it would be nice to create a tool once that we can use over and over again.

A shell script is a file that contains a series of commands to be executed by a shell, just like a Python program is a series of commands to be executed by Python (especially if it does not contain any function definitions). To make a shell script, we simply use a text editor to write the commands and save them in a file; sometimes, but not always, we choose to use a file name ending in .sh. (This is a matter of preference rather than conferring some specific technical benefit.) And, we make the first line of a bash shell script:

#!/bin/bash

Let’s try this out. Create a file named csv2txt. The first line should be the special directive given above. The remainder of the file should contain the code for a loop that changes any file in the current directory whose name ends with .csv to an equivalently-named file ending only with .txt.

Once you have saved the file, go to the shell and:

chmod u+x csv2txt

This marks the file as an executable program, rather than just data. Picking this command apart, chmod (change mode) changes the permissions or modes of the file. The letter u is short for “user” (that would be you). The plus symbol means that we want to add a permission, not remove it; x stands for execute.

Now, run the script:

./csv2txt

If all worked, ls should demonstrate your success.

(To explain how we ran the script: as when we run any program, we gave the name of the program. But, because it was not installed in a standard location where programs are stored, we had to say where to find it; the shell does not look in the current directory by default. The current directory can be referenced with a single period, just as the enclosing directory is referenced by two, as in when we say cd ... The slash is a delimiter between the directory name and the file name.)

This script was useful for a narrowly-defined purpose, but it would be nice to make it a bit more flexible. How about being able to specify the filename extensions (suffixes) from which and to which we want to change?

Before we write that script, let’s understand how command-line arguments to shell scripts work.

When a shell script is launched with command-line arguments, they are simply stored in numbered variables. The first argument is in $1, the second in $2, and so on.

Try making a shell script like this one:

#!/bin/bash
echo "The first arg was $1 and the second was $2"

Save it, chmod it, and run it with some test arguments. For instance, if you named it args, you might try:

./args alpha bravo

But, also try running it with zero, one, and three arguments.

What happened, and why? Remember, undefined variables don’t result in a NameError like they do in Python; they just give us a blank value.

Because we take pride in our code, we want it to be robust. We also understand that robust code saves us from trying to debug inscrutable errors down the road. If we intend our script always to be used with two arguments, then let’s force the issue.

There are two ingredients you need to accomplish this: a way to count the number of arguments, and conditional execution.

As it turns out, the first is easy: alongside $1 and its ilk, shell scripts also receive the variable $#, which holds the number of arguments that were provided by the user.

As far as conditionals go, here is the general syntax for one:

if [ test ]; then
  commands
fi

Taking this piece-by-piece: if matches our expectations from Python. The specific test condition is enclosed in brackets; note the precise spacing we have here, which must be adhered to scrupulously. We’ll discuss what can be filled in for test in a moment. The semicolon here is less well-motivated than in the case of a for loop, where it indicated the end of a list, but we keep it for uniformity. The keyword then, similarly to the do in a for loop, acts like the colon in a Python if statement. The commands that are conditionally executed are indented by convention, but like the for loop, the shell does not actually confer semantic meaning to indentation; rather, the body of the conditional begins with then and ends with fi.

(Why fi? It’s if backwards. Someone apparently had a little chuckle about this when they designed the shell syntax. We’ll leave it to you to decide if you find this cute or just weird.)

Here are some of the tests we can use:

  • =, !=, \<, \>: compare two strings for equality (note we use only a single equals sign), non-equality, or which string comes first in alphabetical order. (Note that < and > are used for redirection, so we need to escape them with the backslash when used for comparisons.
  • -eq, -ne, -lt, -le, -gt, and -ge: compare two variables mathematically, interpreting them as numbers rather than strings. (This is just like the difference between sort and sort -n.) In order: equals, not equals, less than, less than or equal to, greater than, greater than or equal to.

Examples:

if [ $NAME = Alice ]; ...
if [ $AGE -lt 18 ]; ...

With this new syntax in mind, modify your script that prints out the arguments so that it only does so if exactly two arguments were specified, and nothing otherwise.

Once you have that working, it seems a little unhelpful to fail silently if the wrong number of arguments were given. To resolve this limitation, it would be most convenient if we had a way of having an else clause. Here is the syntax for it:

if [ test ]; then
  commands
else
  commands
fi

Modify your script to print out a helpful error message (remember to use echo for this) in the case of anything other than two arguments.

For completeness, here is how you would have an “else-if” clause:

if [ test ]; then
  commands
elif [ test ]; then
  commands
else
  commands
fi

(Note that the else clause at the end is not obligatory.)

It’s time to make your csv2txt script more general:

cp csv2txt chsuffix

Open up chsuffix in your text editor. We want this script to be able to be run like this:

chsuffix txt csv

and change all the files in the current directory that end with .txt to end with .csv – or to be able to do this with any other extensions we might desire.

Replacing the hard-coded suffixes in our original csv2txt file with the variables that contain arguments from the command-line does not require any special syntax; we can just plug in the variables where the hard-coded values were. This results in some tortuous syntax, but it does work. Don’t forget to make the script robust by making sure the right number of parameters were provided, though. If not, print out a message explaining how to use the script.

Checking downloaded files

Remember that you (imaginarily) downloaded all these files from some web site to use for your project. You are concerned that some might be missing: perhaps you forgot to click on one of the links, or the server had an error, or the data set has gaps. It would be good to do an automated scan over the files (again, imagine there are hundreds or thousands of them, not eight). Also, suppose you know how long each file should be; any file not matching that length is suspect and should be flagged.

Go ahead and corrupt the files slightly:

  • Delete a couple of them.
  • Open up one of them in a text editor and add an extra line.
  • Open up another and delete everything, so that it is empty.

Remember that the files were created with sequentially-numbered file names. So, to check for the presence of all the expected files, you could write a for loop, using seq, to generate all the expected numbers. In the body of the loop, generate the whole filename that is expected, and check if it exists. If it does, then silently continue; if not, print an error indicating the name of the missing file.

To accomplish this, you’ll need a new type of test within your conditional:

  • -e name checks whether the given name is the name of a file or directory within the current directory (in other words, it passes if it is either)
  • -f filename checks whether the given file exists and is not a directory
  • -d dirname checks whether the specified directory exists and is actually a directory, not a file

For instance:

if [ -e foo.txt ]; ...

To negate a condition:

if [ ! -e bar.txt ]; ...

This seems like another tool that might be useful over and over again, so write it in a script rather than just entering it in the terminal; this also avoids retyping everything while you debug.

Once you get it working, this script should detect the missing files, but not the corrupted ones.

Remember that wc -l (word count, lines) is a tool that counts the number of lines in a specified file. When run, it prints an output like:

34 foo.txt

Unfortunately, we only want the first column of this output, not the filename. Fortunately, there is a command we can use to choose only a specific column: awk. This tool allows us to do various things; we’ll see some of them in class later. For the time being, suffice to say that the following will give you back only the first column:

awk '{print $1}'

Let’s put this all together. In your script, you did nothing in the case of an existing file, and printed an error in the case of a missing one. But now, in the case of an existing file, you should check whether that file has the correct number of lines (in this case, one).

To do this:

  • Use wc -l to count the number of lines.
  • Pipe its output into the awk command, above, to get only the actual count.
  • Surround all of this in backticks, as we do when we want to use the output of a command (or series of commands) as a variable or value.
  • Write an if condition that mathematically compares this to the numerical value 1 and prints a warning if they are not equal.

Conclusion

Shell scripting allows us to perform tasks and make reusable tools to accomplish file manipulation and text processing. Automating these tasks avoids tedium and reduces the chances to make mistakes.

While everything that can be accomplished with shell scripting can also be performed in Python, the shell often affords us the ability to write certain types of tasks more succinctly than we could in a heavier-weight language.

We hope you can imagine scenarios where the specific scripts we wrote today are not just instructional examples, but very much useful for a project you are working on. Many similar tasks can also be tackled with shell scripting, and investing the time to learn the basics is very likely to be rewarded.