Shell Scripting¶
The goal of this lab is to learn how to accomplish a few common file-processing tasks using shell scripting.
Introduction¶
The shell is the program that you interact with whenever you open a terminal window on a Unix-style system. The terminal program itself simply draws the window and text (emphasis on “drawing”). The shell is the program that generates the text, such as the prompt to enter in a command, and then processes the commands you enter.
The shell handles most commands that you enter, like git
or
python3
, by launching the relevant program and handing it control
over the terminal. Once the program exits, control is passed back to
the shell, which prints a new prompt and waits for a new command.
(You can put a command in the background by appending an ampersand
(&
) to the end of the line, but uses of this feature are beyond
the scope of this lab.)
The shell also offers the opportunity to write shell scripts in a shell-specific programming language. We’ll focus on shell-scripting in this lab.
There are many different shells available (bash
, ksh
,
tcsh
, etc). The majority of these alternatives behave identically
when used to perform basic tasks like launching a program, redirecting
output, or connecting commands using pipes. In this lab, we’ll focus
on writing bash
scripts, since bash
is the de facto standard
on Linux and MacOS systems.
Shell scripting is most useful for text-processing tasks, and for manipulating (e.g. copying, moving, renaming, or deleting) files. To support text manipulation and abstraction, shell scripting languages support variables, which are typically intended to be used for strings. To enable repeated work, such as performing a task on every file in a directory, they provide loops. And, to allow behavior to be selective, they support conditionals.
Before we get started, we’d like to point you to a useful tool: explainshell.com allows you to enter a Linux command-line and then see the help text that matches each piece of syntax.
Also, you may want to review the Linux Lab from CAPP 30121. The sections on redirection and piping are particularly useful.
Getting started¶
During this lab, we will create a number of dummy files to use for practice. To start, please make a directory within your home directory, but not within your repository, to house these files; we want to minimize the chances that anything you will do during this lab will affect your repository. For instance:
$ cd
$ mkdir shell-practice
$ cd shell-practice
You will remain in this sub-directory for the entire lab.
As is our convention, we will use $
to indicate the Linux
command-line prompt. You should not include the $
when you run
these commands.
Shell variables¶
Shell scripting languages, like bash
, typically include a way to
define shell variables. These variables typically have string values,
but can, depending on the context, be interpreted as integers, etc.
Here’s a sample variable definition and a sample use:
$ FAV_COLOR=blue
$ echo $FAV_COLOR
blue
The first line defines a variable named FAV_COLOR
and assigns it
the initial value blue
. Notice that there are no spaces on either
side of the equal sign. The syntax is quite strict: the shell will
generate an error message if you add a space before or after the equal
sign. Also, notice that since the desired value (blue
) does not
include any embedded spaces, we do not need to use quotes around it.
To evaluate the value of a shell variable, we prepend it with a $
as in $FAV_COLOR
. Recall that the echo
command simply prints
the value of command-line arguments to standard out. So, the second
line will print blue
, the value of FAV_COLOR
, to the terminal
or more generally, to standard out.
Variable names are case sensitive, so FAV_COLOR
and fav_color
are different variables. Also, evaluating a variable that has not
been defined yields an empty string. So, evaluating the command:
$ echo $fav_color
will yield a blank line.
There are times when we want to use a variable to construct a larger string. For example, we might want to add the word “bell” after the value of favorite color. Here’s an example that illustrates this process:
$ echo ${FAV_COLOR}bell
bluebell
Adding curly braces around the variable name allows the shell
interpreter to separate the variable name (FAV_COLOR
) from the
rest of the string (bell
). The curly braces are only required in
this situation, but it always OK to include them for clarity.
Here are two examples that illustrate another import aspect of shell variable use:
$ echo "${FAV_COLOR}bells grow in spring."
bluebells grow in spring.
$ echo '${FAV_COLOR}bells grow in spring.'
${FAV_COLOR}bells grow in spring.
As the first of these examples illustrates, the bash
interpreter
replaces variables that occur in double-quoted strings with their
corresponding values. It, however, does not replace uses of variables
that occur in strings surrounded by single quotes. This behavior is
illustrated by the second example.
It can be useful to capture the result of running a command in a
variable. To do so, surround the command with back ticks (the quote
key near the tab and escape keys on your keyboard). For example, to
capture the current working directory, which we can get from the
pwd
command, in a variable named WDIR
, we would run the
command:
$ WDIR=`pwd`
$ echo ${WDIR}
/home/student/shell-practice
The variables we have defined thus far are only visible in the current
shell. We can export
a variable to make it visible to programs
that we might run from the current shell (including other instances of
the shell that we might launch):
$ export FAV_COLOR=blue
There are a set of environment variables defined by default. For example:
$ echo $SHELL
/bin/bash
You can use the printenv
command to get a list of the variables
currently defined.
Creating dummy files¶
Now let’s make some dummy files to work with. For this lab, it is okay for the files to contain boilerplate data; we can just imagine that they might have contain something more useful. We’ll be focusing on the names of the files.
We’ll use the echo command and redirection to construct a simple file:
$ echo File this away > example.txt
Recall that the greater-than symbol indicates that standard out should
be redirected to the specified file. We can use the cat
command
to see the contents of the resulting file:
$ cat example.txt
File this away
Now, let’s see how to write a for
loop in bash syntax:
$ for i in a b c 1 2 3; do
> echo $i
> done
Note that this loop can be entered in on multiple lines; when you
press return after the first or second line, bash will print a special
prompt (>
) wait for you to complete the loop before executing it.
Give this loop a try. Once you do, you should see lines with a, b, c,
1, 2, and 3 followed by a new prompt. (Do not include the >
in
your code.)
Picking this code apart, this loop looks similar to a Python loop: we
start with for
, then give the name of a loop variable – in this
case, i
. This variable will take on different values during each
iteration.
After in
, we then give the list of values to loop over. Here, the
syntax is slightly different than Python’s; we just specify a list of
values (no delimiter, like [
, required). Successive values are
separated by spaces. Although there is no opening delimiter, we do
need to be clear about when the list ends. We mark the end of the list
with a semicolon (;
). Adding the semicolon indicates that do
is a keyword rather than as one of the values in the list.
The keyword do
plays a role similar to the colon (:
) in a
Python loop. Starting on the next line, we give one or more commands
to execute for each iteration. At any point, we can refer to the loop
variable using name preceded by a dollar sign ($
).
We mark the end of the loop with the keyword done
. We need a way
to mark the end of the look because, unlike Python, bash
does not
assign semantic meaning to indentation. While it is appropriate to
indent the body of the loop for readability, the shell does not
interpret this indentation as delineating where the loop body begins
and ends the way Python does. Rather, it is the done
that actually
signals the end the loop body. (Similarly, the do
indicates where
the loop begins.)
Suppose you want one of the lines that will be printed out by this loop to be “CS 122”. Try inserting this phrase amidst the other values in the list and see what happens.
The lesson here is that the shell, by default, breaks everything up into words. If we want a multi-word phrase (anything containing spaces) to be treated as a single unit, we must use quotes around it:
$ for i in a b c "CS 122" 1 2 3; do
> echo $i
> done
a
b
c
CS 122
1
2
3
Explicitly specifying a list to iterate over is useful, but often we want this list to be generated dynamically. To generate dynamic lists, we often use the output from some command. For instance, try executing the following command:
$ seq 1 8
As noted earlier, we use back ticks to denote that we want to run a command and then use its output for something else in a script. For instance, try this:
$ for i in `seq 1 8`; do
> echo "The next number is: $i"
> done
Had we simply written seq 1 8
in our loop header, we would have
created a list of length three with the contents seq
, 1
, and
8
. Had we used quotes, we would have created a list with a single
entry: the string seq 1 8
. It is the use of back ticks,
specifically, that asks to run the command and store its results as a
list.
Let’s put these ideas together to create several files:
$ for i in `seq 1 8`; do
> echo "This was originally file$i" > file$i
> done
After you run this loop, you should have eight, sequentially-numbered
files, in your directory; use ls
to check. (If you’ve been trying
every example, you’ll also have example.txt
, from earlier.)
Renaming files¶
Imagine you just downloaded these eight files from a web site and they
contain data useful for your project. But, they aren’t named the way
you want; you really want them to end with .txt
.
You could manually rename each file; it only takes a minute, although it is tedious. But… what if there are eight files per state? That’s 400 files. In other words, think of these eight files as stand-ins for what could have been a lot more. Whatever your personal threshold for pain, we can think of scenarios that would cross it.
Let’s use a loop to rename them, instead. Remember that the mv
command, used in the form mv a b
, moves file a
to file
b
. If these two filenames are in the same directory, this command would
more accurately be thought of as renaming than moving.
Let’s go ahead and rename all of them. We want only to rename the
files that have names beginning with file
, not the separate
example.txt
. To loop over all the files matching some pattern, we
can use a third style of loop header:
$ for i in file*; do
> mv $i $i.txt
> done
Here, we ask for all the files whose names begin with file
in the
current directory. The asterisk (*
) character acts as a wildcard,
and this single pattern is expanded to a list of all files whose names
match the pattern.
This wildcard and pattern beg a little explanation. The shell generally allows us to use regular expression syntax; for instance, we can use brackets for ranges. But, there are some subtle differences: for instance, we write only the asterisk to denote what would be, in a proper regular expression, a period (to represent any possible character) followed by a star (to represent possible repetitions of it). A full description of the differences between the shell’s matching patterns and regular expressions is beyond the scope of this lab. And, in practice, we tend not to have occasion to use particularly esoteric regular expressions in the shell.
Imagine we also want to make a backup copy of these files, with file names like file1.txtbackup. Recall that we need to surround variables with curly braces when constructing larger strings:
$ for i in file*txt; do > cp $i ${i}backup > done
After running this command, I just realized that I meant for all these
files to be named fileN.csv
, not fileN.txt
. Oops.
As you might imagine, trimming off suffixes (and, sometimes, prefixes)
from filenames is a relatively common shell task. Thus, bash
provides syntax to accomplish it. Given a variable which might end
with a certain suffix, the syntax ${VARIABLE%suffix}
gives us the
variable’s contents, minus the suffix. (Any variable not ending with
the specified suffix passes through unmodified.)
Let’s use this to fix our mistake:
$ for i in file*txt; do
> mv $i ${i%txt}csv
> done
In one fell swoop, we have trimmed off the txt
and then appended
csv
. Check with ls
to confirm we fixed our mistake. (Note that
the backup files are unmodified, because they do not end in txt
,
as the pattern in the loop header requires.)
A similar method is used to strip prefixes:
${variable#prefix}
. Using this syntax, write a loop to change the
names of our files from fileN.csv
to dataN.csv
.
Creating scripts¶
The loops we’ve written have been useful, but they’ve also been a lot to type. If we expect to have to do tasks like this frequently, it would be nice to create a tool once that we can use over and over again.
A shell script is a file that contains a series of commands to be
executed by a shell, just like a Python program is a series of
commands to be executed by Python (especially if it does not contain
any function definitions). To create a shell script, we simply use a
text editor to write the commands and save them in a file; sometimes,
but not always, we choose to use a file name ending in .sh
. (This
is a matter of preference rather than conferring some specific
technical meaning.) The first line of a bash
shell script should
contain:
#!/bin/bash
Let’s try out this process. Create a file named csv2txt.sh
. The first
line should be the special directive given above. The remainder of the
file should contain the code for a loop that changes any file in the
current directory whose name ends with .csv
to an
equivalently-named file ending only with .txt
.
Once you have saved the file, go to the shell and run the command:
$ chmod u+x csv2txt.sh
This command marks the file as an executable program. Picking this
command apart, chmod
(change mode) changes the permissions or
modes of the file. The letter u
is short for “user” (that would be
you). The plus symbol means that we want to add a permission, not
remove it; x
stands for execute. (See Changing permissions, owner, and group from the CAPP 30121 Linux lab for a more detailed discussion of permissions.)
Now, you can run the script with this command:
$ ./csv2txt.sh
If all worked properly, ls
should demonstrate your success.
(To explain how we ran the script: as when we run any program, we gave
the name of the program. But, because it was not installed in a
standard location where programs are stored, we had to say where to
find it; the shell does not look in the current directory by
default. The current directory can be referenced with a single period,
just as the enclosing directory is referenced by two periods, as in
when we write cd ..
. The slash is a delimiter between the
directory name and the file name.)
This script was useful for a narrowly-defined purpose, but it would be nice to make it a bit more flexible. How about being able to specify the filename extensions (suffixes) from which and to which we want to change?
Before we write that script, let’s understand how command-line arguments to shell scripts work.
When a shell script is launched with command-line arguments, they are
simply stored in numbered variables. The first argument is in $1
,
the second in $2
, and so on.
Try making a shell script like this one:
#!/bin/bash
echo "The first arg was $1 and the second was $2"
Save it, change its permissions using chmod
, and run it with some
test arguments. For instance, if you named it args.sh
, you might
try:
./args.sh alpha bravo
But, also try running it with zero, one, and three arguments.
What happened, and why? Remember, undefined variables don’t result in
a NameError
as they do in Python; they just give us an empty
string.
Because we take pride in our code, we want it to be robust. We also understand that robust code saves us from trying to debug inscrutable errors down the road. If we intend our script always to be used with two arguments, then let’s force the issue.
There are two ingredients you need to accomplish this task: a way to count the number of arguments and conditional execution.
As it turns out, the first is easy: alongside $1
and its ilk,
shell scripts also receive the variable $#
, which holds the number
of arguments that were provided by the user.
As far as conditionals go, here is the general syntax for a conditional:
if [ test ]; then
commands
fi
Reviewing this statement piece-by-piece: if
matches our
expectations from Python. The specific test condition is enclosed in
brackets; note the precise spacing we have here, which must be adhered
to scrupulously. We’ll discuss what can be filled in for test
in a
moment. The semicolon here is less well-motivated than in the case of
a for
loop, where it indicated the end of a list, but we keep it
for uniformity. The keyword then
, similarly to the do
in a
for
loop, acts like the colon in a Python if
statement. The
commands that are conditionally executed are indented by convention,
but like the for
loop, the shell does not actually confer semantic
meaning to indentation; rather, the body of the conditional begins
with then
and ends with fi
.
(Why fi
? It’s if
backwards. Someone apparently had a little
chuckle about this when they designed the shell syntax. We’ll leave it
to you to decide if you find this cute or just weird.)
Here are some of the tests we can use:
=
,!=
,\<
,\>
: compare two strings for equality (note we use only a single equals sign), non-equality, or which string comes first in alphabetical order. (Note that<
and>
are used for redirection, so we need to escape them with the backslash when used for comparisons.-eq
,-ne
,-lt
,-le
,-gt
, and-ge
: compare two variables mathematically, interpreting them as numbers rather than strings. (This is just like the difference betweensort
andsort -n
.) In order: equals, not equals, less than, less than or equal to, greater than, greater than or equal to.
Examples:
if [ $NAME = Alice ]; ...
if [ $AGE -lt 18 ]; ...
With this new syntax in mind, modify your script that prints out the arguments so that it only does so if exactly two arguments were specified, and nothing otherwise.
Once you have that working, it seems a little unhelpful to fail
silently if the wrong number of arguments were given. To resolve this
limitation, use an else
clause. Here is the syntax for else:
if [ test ]; then
commands
else
commands
fi
Modify your script to print out a helpful error message (remember to
use echo
for this purpose) in the case of anything other than two
arguments.
For completeness, here is how you would have an “else-if” clause:
if [ test ]; then
commands
elif [ test ]; then
commands
else
commands
fi
(Note that the else
clause at the end is optional.)
It’s time to make your csv2txt.sh
script more general:
$ cp csv2txt.sh chsuffix.sh
Open up chsuffix.sh
in your text editor. We want this script to be
able to be run like this:
$ chsuffix.sh txt csv
and change all the files in the current directory that end with
.txt
to end with .csv
instead – or to be able to do this with
any other extensions we might desire.
Replacing the hard-coded suffixes in our original csv2txt
file
with the variables that contain arguments from the command-line does
not require any special syntax; we can just plug in the variables
where the hard-coded values were. This results in some tortuous
syntax, but it does work. Don’t forget to make the script robust by
making sure the right number of parameters were provided, though. And
make sure to print out a message explaining how to use the script.
Checking downloaded files¶
Remember that you (imaginarily) downloaded all these files from some web site to use for your project. You are concerned that some file might be missing: perhaps you forgot to click on one of the links, or the server had an error, or the data set has gaps. It would be good to do an automated scan over the files (again, imagine there are hundreds or thousands of them, not eight). Also, suppose you know how long each file should be; any file not matching that length is suspect and should be flagged.
Go ahead and corrupt the files slightly:
- Delete a couple of them.
- Open up one of them in a text editor and add an extra line.
- Open up another and delete everything, so that it is empty.
Remember that the files were created with sequentially-numbered file
names. So, to check for the presence of all the expected files, you
could write a for
loop, using seq
, to generate all the
expected numbers. In the body of the loop, generate the whole filename
that is expected, and check if it exists. If it does, then silently
continue; if not, print an error indicating the name of the missing
file.
To accomplish this, you’ll need a new type of test within your conditional:
-e name
checks whether the given name is the name of a file or directory within the current directory (in other words, it passes if the name is either a regular file or a directory)-f filename
checks whether the given file exists and is not a directory-d dirname
checks whether the specified directory exists and is actually a directory, not a regular file
For instance:
if [ -e foo.txt ]; ...
To negate a condition, use !
:
if [ ! -e bar.txt ]; ...
This tool seems like another one that might be useful over and over again, so write it in a script rather than just entering the code in the terminal; this approach also avoids retyping everything while you debug.
Once you get it working, this script should detect the missing files, but not the corrupted ones.
Remember that wc -l
(word count, lines) is a tool that counts the
number of lines in a specified file. When run, it prints an output
like this:
34 foo.txt
Unfortunately, we only want the first column of this output, not the
filename. Fortunately, there is a command we can use to choose only a
specific column: awk
. This tool allows us to do various
things. The following command will give you back only the first column:
awk '{print $1}'
Let’s put this all together. In your script, you did nothing in the case of an existing file, and printed an error in the case of a missing one. But now, in the case of an existing file, you should check whether that file has the correct number of lines (in this case, one).
To do this task:
- Use
wc -l
to count the number of lines. - Pipe its output into the
awk
command, above, to get only the actual count. - Surround all of this code in back ticks, as we do when we want to use the output of a command (or series of commands) as a variable or value.
- Write an
if
condition that mathematically compares this value to the numerical value1
and prints a warning if they are not equal.
Conclusion¶
Shell scripting allows us to perform tasks and make reusable tools to accomplish file manipulation and text processing. Automating these tasks avoids tedium and reduces the chances to make mistakes.
While everything that can be accomplished with shell scripting can also be performed in Python, the shell often affords us the ability to write certain types of tasks more succinctly than we could in a heavier-weight language.
We hope you can imagine scenarios where the specific scripts we wrote today are not just instructional examples, but very much useful for a project you are working on. Many similar tasks can also be tackled with shell scripting, and investing the time to learn the basics is very likely to be rewarded.
Acknowledgments¶
Matthew Wachs designed the original version of this lab.