Shell Scripting

The goal of this lab is to learn how to accomplish a few common file-processing tasks using shell scripting.

Introduction

The shell is the program that you interact with whenever you open a terminal window on a Unix-style system. The terminal program itself simply draws the window and text (emphasis on “drawing”). The shell is the program that generates the text, such as the prompt to enter in a command, and then processes the commands you enter.

The shell handles most commands that you enter, like git or python3, by launching the relevant program and handing it control over the terminal. Once the program exits, control is passed back to the shell, which prints a new prompt and waits for a new command. (You can put a command in the background by appending an ampersand (&) to the end of the line, but uses of this feature are beyond the scope of this lab.)

The shell also offers the opportunity to write shell scripts in a shell-specific programming language. We’ll focus on shell-scripting in this lab.

There are many different shells available (bash, ksh, tcsh, etc). The majority of these alternatives behave identically when used to perform basic tasks like launching a program, redirecting output, or connecting commands using pipes. In this lab, we’ll focus on writing bash scripts, since bash is the de facto standard on Linux and MacOS systems.

Shell scripting is most useful for text-processing tasks, and for manipulating (e.g. copying, moving, renaming, or deleting) files. To support text manipulation and abstraction, shell scripting languages support variables, which are typically intended to be used for strings. To enable repeated work, such as performing a task on every file in a directory, they provide loops. And, to allow behavior to be selective, they support conditionals.

Before we get started, we’d like to point you to a useful tool: explainshell.com allows you to enter a Linux command-line and then see the help text that matches each piece of syntax.

Also, you may want to review the Linux Lab from CAPP 30121. The sections on redirection and piping are particularly useful.

Getting started

During this lab, we will create a number of dummy files to use for practice. To start, please make a directory within your home directory, but not within your repository, to house these files; we want to minimize the chances that anything you will do during this lab will affect your repository. For instance:

$ cd
$ mkdir shell-practice
$ cd shell-practice

You will remain in this sub-directory for the entire lab.

As is our convention, we will use $ to indicate the Linux command-line prompt. You should not include the $ when you run these commands.

Shell variables

Shell scripting languages, like bash, typically include a way to define shell variables. These variables typically have string values, but can, depending on the context, be interpreted as integers, etc.

Here’s a sample variable definition and a sample use:

$ FAV_COLOR=blue
$ echo $FAV_COLOR
blue

The first line defines a variable named FAV_COLOR and assigns it the initial value blue. Notice that there are no spaces on either side of the equal sign. The syntax is quite strict: the shell will generate an error message if you add a space before or after the equal sign. Also, notice that since the desired value (blue) does not include any embedded spaces, we do not need to use quotes around it.

To evaluate the value of a shell variable, we prepend it with a $ as in $FAV_COLOR. Recall that the echo command simply prints the value of command-line arguments to standard out. So, the second line will print blue, the value of FAV_COLOR, to the terminal or more generally, to standard out.

Variable names are case sensitive, so FAV_COLOR and fav_color are different variables. Also, evaluating a variable that has not been defined yields an empty string. So, evaluating the command:

$ echo $fav_color

will yield a blank line.

There are times when we want to use a variable to construct a larger string. For example, we might want to add the word “bell” after the value of favorite color. Here’s an example that illustrates this process:

$ echo ${FAV_COLOR}bell
bluebell

Adding curly braces around the variable name allows the shell interpreter to separate the variable name (FAV_COLOR) from the rest of the string (bell). The curly braces are only required in this situation, but it always OK to include them for clarity.

Here are two examples that illustrate another import aspect of shell variable use:

$ echo "${FAV_COLOR}bells grow in spring."
bluebells grow in spring.
$ echo '${FAV_COLOR}bells grow in spring.'
${FAV_COLOR}bells grow in spring.

As the first of these examples illustrates, the bash interpreter replaces variables that occur in double-quoted strings with their corresponding values. It, however, does not replace uses of variables that occur in strings surrounded by single quotes. This behavior is illustrated by the second example.

It can be useful to capture the result of running a command in a variable. To do so, surround the command with back ticks (the quote key near the tab and escape keys on your keyboard). For example, to capture the current working directory, which we can get from the pwd command, in a variable named WDIR, we would run the command:

$ WDIR=`pwd`
$ echo ${WDIR}
/home/student/shell-practice

The variables we have defined thus far are only visible in the current shell. We can export a variable to make it visible to programs that we might run from the current shell (including other instances of the shell that we might launch):

$ export FAV_COLOR=blue

There are a set of environment variables defined by default. For example:

$ echo $SHELL
/bin/bash

You can use the printenv command to get a list of the variables currently defined.

Creating dummy files

Now let’s make some dummy files to work with. For this lab, it is okay for the files to contain boilerplate data; we can just imagine that they might have contain something more useful. We’ll be focusing on the names of the files.

We’ll use the echo command and redirection to construct a simple file:

$ echo File this away > example.txt

Recall that the greater-than symbol indicates that standard out should be redirected to the specified file. We can use the cat command to see the contents of the resulting file:

$ cat example.txt
File this away

Now, let’s see how to write a for loop in bash syntax:

$ for i in a b c 1 2 3; do
>   echo $i
> done

Note that this loop can be entered in on multiple lines; when you press return after the first or second line, bash will print a special prompt (>) wait for you to complete the loop before executing it. Give this loop a try. Once you do, you should see lines with a, b, c, 1, 2, and 3 followed by a new prompt. (Do not include the > in your code.)

Picking this code apart, this loop looks similar to a Python loop: we start with for, then give the name of a loop variable – in this case, i. This variable will take on different values during each iteration.

After in, we then give the list of values to loop over. Here, the syntax is slightly different than Python’s; we just specify a list of values (no delimiter, like [, required). Successive values are separated by spaces. Although there is no opening delimiter, we do need to be clear about when the list ends. We mark the end of the list with a semicolon (;). Adding the semicolon indicates that do is a keyword rather than as one of the values in the list.

The keyword do plays a role similar to the colon (:) in a Python loop. Starting on the next line, we give one or more commands to execute for each iteration. At any point, we can refer to the loop variable using name preceded by a dollar sign ($).

We mark the end of the loop with the keyword done. We need a way to mark the end of the look because, unlike Python, bash does not assign semantic meaning to indentation. While it is appropriate to indent the body of the loop for readability, the shell does not interpret this indentation as delineating where the loop body begins and ends the way Python does. Rather, it is the done that actually signals the end the loop body. (Similarly, the do indicates where the loop begins.)

Suppose you want one of the lines that will be printed out by this loop to be “CS 122”. Try inserting this phrase amidst the other values in the list and see what happens.

The lesson here is that the shell, by default, breaks everything up into words. If we want a multi-word phrase (anything containing spaces) to be treated as a single unit, we must use quotes around it:

$ for i in a b c "CS 122" 1 2 3; do
>  echo $i
> done
a
b
c
CS 122
1
2
3

Explicitly specifying a list to iterate over is useful, but often we want this list to be generated dynamically. To generate dynamic lists, we often use the output from some command. For instance, try executing the following command:

$ seq 1 8

As noted earlier, we use back ticks to denote that we want to run a command and then use its output for something else in a script. For instance, try this:

$ for i in `seq 1 8`; do
>  echo "The next number is: $i"
> done

Had we simply written seq 1 8 in our loop header, we would have created a list of length three with the contents seq, 1, and 8. Had we used quotes, we would have created a list with a single entry: the string seq 1 8. It is the use of back ticks, specifically, that asks to run the command and store its results as a list.

Let’s put these ideas together to create several files:

$ for i in `seq 1 8`; do
>  echo "This was originally file$i" > file$i
> done

After you run this loop, you should have eight, sequentially-numbered files, in your directory; use ls to check. (If you’ve been trying every example, you’ll also have example.txt, from earlier.)

Renaming files

Imagine you just downloaded these eight files from a web site and they contain data useful for your project. But, they aren’t named the way you want; you really want them to end with .txt.

You could manually rename each file; it only takes a minute, although it is tedious. But… what if there are eight files per state? That’s 400 files. In other words, think of these eight files as stand-ins for what could have been a lot more. Whatever your personal threshold for pain, we can think of scenarios that would cross it.

Let’s use a loop to rename them, instead. Remember that the mv command, used in the form mv a b, moves file a to file b. If these two filenames are in the same directory, this command would more accurately be thought of as renaming than moving.

Let’s go ahead and rename all of them. We want only to rename the files that have names beginning with file, not the separate example.txt. To loop over all the files matching some pattern, we can use a third style of loop header:

$ for i in file*; do
>  mv $i $i.txt
> done

Here, we ask for all the files whose names begin with file in the current directory. The asterisk (*) character acts as a wildcard, and this single pattern is expanded to a list of all files whose names match the pattern.

This wildcard and pattern beg a little explanation. The shell generally allows us to use regular expression syntax; for instance, we can use brackets for ranges. But, there are some subtle differences: for instance, we write only the asterisk to denote what would be, in a proper regular expression, a period (to represent any possible character) followed by a star (to represent possible repetitions of it). A full description of the differences between the shell’s matching patterns and regular expressions is beyond the scope of this lab. And, in practice, we tend not to have occasion to use particularly esoteric regular expressions in the shell.

Imagine we also want to make a backup copy of these files, with file names like file1.txtbackup. Recall that we need to surround variables with curly braces when constructing larger strings:

$ for i in file*txt; do > cp $i ${i}backup > done

After running this command, I just realized that I meant for all these files to be named fileN.csv, not fileN.txt. Oops.

As you might imagine, trimming off suffixes (and, sometimes, prefixes) from filenames is a relatively common shell task. Thus, bash provides syntax to accomplish it. Given a variable which might end with a certain suffix, the syntax ${VARIABLE%suffix} gives us the variable’s contents, minus the suffix. (Any variable not ending with the specified suffix passes through unmodified.)

Let’s use this to fix our mistake:

$ for i in file*txt; do
>  mv $i ${i%txt}csv
> done

In one fell swoop, we have trimmed off the txt and then appended csv. Check with ls to confirm we fixed our mistake. (Note that the backup files are unmodified, because they do not end in txt, as the pattern in the loop header requires.)

A similar method is used to strip prefixes: ${variable#prefix}. Using this syntax, write a loop to change the names of our files from fileN.csv to dataN.csv.

Creating scripts

The loops we’ve written have been useful, but they’ve also been a lot to type. If we expect to have to do tasks like this frequently, it would be nice to create a tool once that we can use over and over again.

A shell script is a file that contains a series of commands to be executed by a shell, just like a Python program is a series of commands to be executed by Python (especially if it does not contain any function definitions). To create a shell script, we simply use a text editor to write the commands and save them in a file; sometimes, but not always, we choose to use a file name ending in .sh. (This is a matter of preference rather than conferring some specific technical meaning.) The first line of a bash shell script should contain:

#!/bin/bash

Let’s try out this process. Create a file named csv2txt.sh. The first line should be the special directive given above. The remainder of the file should contain the code for a loop that changes any file in the current directory whose name ends with .csv to an equivalently-named file ending only with .txt.

Once you have saved the file, go to the shell and run the command:

$ chmod u+x csv2txt.sh

This command marks the file as an executable program. Picking this command apart, chmod (change mode) changes the permissions or modes of the file. The letter u is short for “user” (that would be you). The plus symbol means that we want to add a permission, not remove it; x stands for execute. (See Changing permissions, owner, and group from the CAPP 30121 Linux lab for a more detailed discussion of permissions.)

Now, you can run the script with this command:

$ ./csv2txt.sh

If all worked properly, ls should demonstrate your success.

(To explain how we ran the script: as when we run any program, we gave the name of the program. But, because it was not installed in a standard location where programs are stored, we had to say where to find it; the shell does not look in the current directory by default. The current directory can be referenced with a single period, just as the enclosing directory is referenced by two periods, as in when we write cd ... The slash is a delimiter between the directory name and the file name.)

This script was useful for a narrowly-defined purpose, but it would be nice to make it a bit more flexible. How about being able to specify the filename extensions (suffixes) from which and to which we want to change?

Before we write that script, let’s understand how command-line arguments to shell scripts work.

When a shell script is launched with command-line arguments, they are simply stored in numbered variables. The first argument is in $1, the second in $2, and so on.

Try making a shell script like this one:

#!/bin/bash
echo "The first arg was $1 and the second was $2"

Save it, change its permissions using chmod, and run it with some test arguments. For instance, if you named it args.sh, you might try:

./args.sh alpha bravo

But, also try running it with zero, one, and three arguments.

What happened, and why? Remember, undefined variables don’t result in a NameError as they do in Python; they just give us an empty string.

Because we take pride in our code, we want it to be robust. We also understand that robust code saves us from trying to debug inscrutable errors down the road. If we intend our script always to be used with two arguments, then let’s force the issue.

There are two ingredients you need to accomplish this task: a way to count the number of arguments and conditional execution.

As it turns out, the first is easy: alongside $1 and its ilk, shell scripts also receive the variable $#, which holds the number of arguments that were provided by the user.

As far as conditionals go, here is the general syntax for a conditional:

if [ test ]; then
  commands
fi

Reviewing this statement piece-by-piece: if matches our expectations from Python. The specific test condition is enclosed in brackets; note the precise spacing we have here, which must be adhered to scrupulously. We’ll discuss what can be filled in for test in a moment. The semicolon here is less well-motivated than in the case of a for loop, where it indicated the end of a list, but we keep it for uniformity. The keyword then, similarly to the do in a for loop, acts like the colon in a Python if statement. The commands that are conditionally executed are indented by convention, but like the for loop, the shell does not actually confer semantic meaning to indentation; rather, the body of the conditional begins with then and ends with fi.

(Why fi? It’s if backwards. Someone apparently had a little chuckle about this when they designed the shell syntax. We’ll leave it to you to decide if you find this cute or just weird.)

Here are some of the tests we can use:

  • =, !=, \<, \>: compare two strings for equality (note we use only a single equals sign), non-equality, or which string comes first in alphabetical order. (Note that < and > are used for redirection, so we need to escape them with the backslash when used for comparisons.
  • -eq, -ne, -lt, -le, -gt, and -ge: compare two variables mathematically, interpreting them as numbers rather than strings. (This is just like the difference between sort and sort -n.) In order: equals, not equals, less than, less than or equal to, greater than, greater than or equal to.

Examples:

if [ $NAME = Alice ]; ...
if [ $AGE -lt 18 ]; ...

With this new syntax in mind, modify your script that prints out the arguments so that it only does so if exactly two arguments were specified, and nothing otherwise.

Once you have that working, it seems a little unhelpful to fail silently if the wrong number of arguments were given. To resolve this limitation, use an else clause. Here is the syntax for else:

if [ test ]; then
  commands
else
  commands
fi

Modify your script to print out a helpful error message (remember to use echo for this purpose) in the case of anything other than two arguments.

For completeness, here is how you would have an “else-if” clause:

if [ test ]; then
  commands
elif [ test ]; then
  commands
else
  commands
fi

(Note that the else clause at the end is optional.)

It’s time to make your csv2txt.sh script more general:

$ cp csv2txt.sh chsuffix.sh

Open up chsuffix.sh in your text editor. We want this script to be able to be run like this:

$ chsuffix.sh txt csv

and change all the files in the current directory that end with .txt to end with .csv instead – or to be able to do this with any other extensions we might desire.

Replacing the hard-coded suffixes in our original csv2txt file with the variables that contain arguments from the command-line does not require any special syntax; we can just plug in the variables where the hard-coded values were. This results in some tortuous syntax, but it does work. Don’t forget to make the script robust by making sure the right number of parameters were provided, though. And make sure to print out a message explaining how to use the script.

Checking downloaded files

Remember that you (imaginarily) downloaded all these files from some web site to use for your project. You are concerned that some file might be missing: perhaps you forgot to click on one of the links, or the server had an error, or the data set has gaps. It would be good to do an automated scan over the files (again, imagine there are hundreds or thousands of them, not eight). Also, suppose you know how long each file should be; any file not matching that length is suspect and should be flagged.

Go ahead and corrupt the files slightly:

  • Delete a couple of them.
  • Open up one of them in a text editor and add an extra line.
  • Open up another and delete everything, so that it is empty.

Remember that the files were created with sequentially-numbered file names. So, to check for the presence of all the expected files, you could write a for loop, using seq, to generate all the expected numbers. In the body of the loop, generate the whole filename that is expected, and check if it exists. If it does, then silently continue; if not, print an error indicating the name of the missing file.

To accomplish this, you’ll need a new type of test within your conditional:

  • -e name checks whether the given name is the name of a file or directory within the current directory (in other words, it passes if the name is either a regular file or a directory)
  • -f filename checks whether the given file exists and is not a directory
  • -d dirname checks whether the specified directory exists and is actually a directory, not a regular file

For instance:

if [ -e foo.txt ]; ...

To negate a condition, use !:

if [ ! -e bar.txt ]; ...

This tool seems like another one that might be useful over and over again, so write it in a script rather than just entering the code in the terminal; this approach also avoids retyping everything while you debug.

Once you get it working, this script should detect the missing files, but not the corrupted ones.

Remember that wc -l (word count, lines) is a tool that counts the number of lines in a specified file. When run, it prints an output like this:

34 foo.txt

Unfortunately, we only want the first column of this output, not the filename. Fortunately, there is a command we can use to choose only a specific column: awk. This tool allows us to do various things. The following command will give you back only the first column:

awk '{print $1}'

Let’s put this all together. In your script, you did nothing in the case of an existing file, and printed an error in the case of a missing one. But now, in the case of an existing file, you should check whether that file has the correct number of lines (in this case, one).

To do this task:

  • Use wc -l to count the number of lines.
  • Pipe its output into the awk command, above, to get only the actual count.
  • Surround all of this code in back ticks, as we do when we want to use the output of a command (or series of commands) as a variable or value.
  • Write an if condition that mathematically compares this value to the numerical value 1 and prints a warning if they are not equal.

Conclusion

Shell scripting allows us to perform tasks and make reusable tools to accomplish file manipulation and text processing. Automating these tasks avoids tedium and reduces the chances to make mistakes.

While everything that can be accomplished with shell scripting can also be performed in Python, the shell often affords us the ability to write certain types of tasks more succinctly than we could in a heavier-weight language.

We hope you can imagine scenarios where the specific scripts we wrote today are not just instructional examples, but very much useful for a project you are working on. Many similar tasks can also be tackled with shell scripting, and investing the time to learn the basics is very likely to be rewarded.

Acknowledgments

Matthew Wachs designed the original version of this lab.