CS 123, Spring 2021: C, strings and ciphers

Goals

This week we're working in C. C is a stripped down language that is much closer to the bare hardware than Python. While this language can be frustrating at times, it is fast and has the potential to reduce scaling headaches by avoiding unnecessary overhead. Today, we will:

Please git pull upstream master to get the lab4 distribution, which contains the skeleton files you will use for this exercise.

Compiling and running C

C code can be compiled and run from the command line using the compiler clang, with the following commands:

clang <SOURCE FILE> -Wall -std=gnu99 -o <EXECUTABLE>
./<EXECUTABLE>

where <SOURCE FILE> is replaced by the name of the source file and <EXECUTABLE> is replace by the name for the executable.

The first command compiles the code (translates it into an executable file that can then be run; the second command does just that: runs the executable. Note that for the programming assignment, we will provide a Makefile that saves you from having to type out the (somewhat long) first command; this is customary for larger projects (in fact, for the programming assignment, because of the many files that we will provide, the first command would be more complex than the one above).

As an example, here are commands to compile and run the code in mycode.c:

clang mycode.c -Wall -std=gnu99 -o mycode
./mycode

To start you off, try compiling and running the program hello.c in lab4.

Before attempting to run the compiled code, ensure that no errors have been reported by the compiler. If there were errors, the executable is the version left over from the last time you successfully compiled, and does not reflect any intended changes.

Working with chars

A char represents a single letter or character. In a computer, these values are represented numerically, because ultimately, computers always work with numbers. For example, the character 'A' is stored on disk as 65 (actually, as a binary representation of the number 65). The number 65, as a number, would also have been stored in this manner. For this reason, in C,

'A' == 65

The assignment of numerical values to characters that is most commonly used for basic English text is called ASCII.

Because 'B' comes just after 'A' in the ASCII representation, we find that 'B'==66, 'C'==67, etc. In fact, The programming language C allows you to treat characters as integers and integers as characters. This design means a programmer can perform arithmetic on characters as if they were numbers. If you try the following:

char c = '!' + '!';
printf("%c", c);

you will discover that in character arithmetic, '!' + '!' == 'B', because '!' is 33 and 'B' is 66. Similarly '%'*2 is 'J', because the numerical values of '%' and 'J' are 37 and 74 respectively.

Here are some additional truthful expressions that represent some useful properties of characters:

    'A' + 3 == 'D'
  'f' - 'a' == 5
        '0' == 48
       '\0' == 0
        '7' != 7
  '7' - '0' == 7
'G' < 'M'

How does C know, then, whether to treat a char as a character or as a number,? The answer is in the context. If you use printf with the %c format specifier, you are asking C to treat it as a character. If you use %d instead, you are asking C to treat the char as a number. This context is important when printing a char to the screen. It is not important when doing arithmetic, because the underlying calculation is the same, whether the value being manipulated should be interpreted as a character or as a number.

Why is it called a char, then, if it represents a number? A char is the smallest (uses least memory) integer type in C, yet sufficiently large to represent any possible character (in English text). Thus, the name is more a suggestion about how it might be used rather than a dictate. A char can be used to store a small number, and an int can be used (wastefully, in terms of memory) to represent a character.

You can view a chart of the basic ASCII character set by typing the following into a terminal:

man ascii

Press q to return to the terminal.

The lesson is this: characters can be treated as numbers in many useful ways. You can add and subtract from characters to move around the ASCII chart. You can compare characters using greater-than and less-than relations, and because of the way the ASCII character set is laid out, letters will often have the quantitative properties you expect: that is, 'b' is greater than 'a' but less than 'c'. In fact, 'b' + 1 == 'c'.

Tasks

In this lab, you will do the following tasks.

Working with pointers and strings

Strings in C are an interesting example because they can be viewed as arrays of characters or pointers to characters (variables that hold the location, address in memory, of a character). In our first example we're going to create a function that computes the length of a string. Note that there are no native operations in C that handle entire string objects as a whole, all operations must be built out of smaller operations on individual characters. There is no string class/type in C. When we say "string", we mean a pointer to an array of characters with a specific format.

char *str = "Hello";

The variable str will store a single pointer (memory address), that points only to the first character 'H':

        +----+----+---+---+---+---+---+----+----+----+
Memory: | .. | .. | H | e | l | l | o | \0 | .. | .. |
        +----+----+---+---+---+---+---+----+----+----+
                    ^
                   str

Notice the '\0' after the 'o'. This character, which is called the NUL terminator, is used to mark the end of the string. Functions that work on strings use the NUL terminator to know when to stop.

Strings in C can be viewed both as pointers to characters, which makes sense given the type, and as arrays of characters. The expressions for reading a character from a string are different depending on which way you want to think about it.

Let's look at the pointer version first. Given the variable str declared above, the expression *str will return the value 'H', because the value of str is the address of the first character of the string and the * operator, which expects an address as an operand, returns the value stored at that address. To get the ith element of str, you can use the expression *(str+i). The expression (str+i) takes the value of str, which is an address, and adds enough bytes to get the address of the ith character. Once we have that address, applying the * operator yields its contents, that is, the ith character.

The array version is, in some ways, more straightforward. Given the variable str, the expression str[0] will yield the first character in the string and the expression str[i] will yield the ith character.

Let's combine these ideas to compute the length of a string. Here is a version of mystrlen that uses the array notation:

int mystrlen(char *s) {
    int i = 0;
    int len = 0;
    while (s[i] != '\0') {
      len++;
      i++;
    }
    return len;
}

Here is a pointer version that is a direct translation of the array version:

int mystrlen(char *s) {
    int i = 0;
    int len = 0;
    while (*(s+i) != '\0') {
      len++;
      i++;
    }
    return len;
}

Notice that the only difference between the two is in the way s is used in the loop test. Finally, here is a version that a practiced C programmer would likely write:

int mystrlen(char *s) {
    int len = 0;
    while (*s != '\0') {
      len++;
      s++;
    }
    return len;
}

This version eliminates the use of the index variable i in favor of incrementing s by one character for each iteration of the loop. Note that by incrementing s, we no longer have access to the start of the string in mystrlen, which is fine in this case, but can cause problems in others.

Malloc

In the next task, you will need to allocate space for a new string. The library function malloc is used to allocate space. It takes a number of bytes as an argument, allocates the requested number of bytes, and returns a pointer of type void * (a generic pointer) to the first byte. Here is an example call:

char *s = (char *) malloc(sizeof(char) * N);

This statement calls malloc to allocate enough space for N characters, where N is some variable, casts the resulting pointer from the generic void * to the desired char *, and initializes s to refer to the allocated space.

Follow the pattern above when you need to allocate space for this lab.

String concatenation

Task 2: Your task in this section is to create a function that concatenates/adds two strings together. It should have the following header:

char *mystrcat(char *A, char *B) // strcat for string concatenation

It takes in two string inputs, and returns the concatenation of those same strings. The call mystrcat("Hello", "World") should return "HelloWorld". Note that the naive solution:

char *output = A + B; // This code does not work

does not work. This statement will only create a third pointer to some bizarre location in memory that is the addition of the two locations A and B (pointers are just numeric locations in memory).

Instead you need to recognize that A and B point to locations in memory, as illustrated above for str. You will need to allocate space to hold the new string.

In this particular case you would need to allocate space for 5+5+1 == 11 characters and fill that space with:

{'H', 'e', 'l', 'l', 'o', 'W', 'o', 'r', 'l', 'd', '\0'}

You will need to transfer each letter individually, add a terminator at the end and then finally return the pointer to the beginning 'H'.

First you will need to determine the length of the two strings. Then, you will need to allocate enough space to hold the addition of the two strings plus a terminating character (using malloc). Then, copy the characters in the first string. Then copy the characters in the second string. Then you will need to write a terminator character '\0' and finally return the pointer to the beginning of the result string.

Including the header file, string.h, will get you access to a library of string functions. For your purposes today, the function strlen, which takes a string and returns the number of characters in the string, will be useful. For example, strlen("Hello") will return 5. Note that the representation for the string "Hello" actually takes 6 characters: five for the characters of Hello and one for the NUL terminator. A very common bug is to forget to allocate space for the NUL terminator in the string.

The file mystrcat.c contains a skeleton for your implementation of mystrcat along with some test code.

Caesar Cipher

Task 3: Your task in this section is to write functions implementing a Caesar Cipher. The Caesar Cipher is a rudimentary encryption scheme whereby each letter is shifted by a constant up or down. For example we could move the word "Hop" up by one to "Ipq" H->I, o->p, p->q. This method was invented (they say) by Julius Caesar and was used in communication with his generals. It is not recommended for use today as it is very easy to crack.

To write this easily we'll take advantage of characters' numerical representation in the ASCII format, 'A' = 65, 'B' = 66, 'C' = 67, etc. Because characters are the same as one-byteintegers we can abuse the type system and type 'A'+2 to obtain 'C'.

We have included a file caesar.c in lab4 that contains skeletons for some of the functions you will write in this part of the lab along with some test code.

Small functions to identify letters

We'll want to shift letters but not punctuation or spaces. In addition we will need to distinguish upper-case from lower-case letters. Write two functions:

bool is_upper(char c)
bool is_lower(char c)

They should return true if the character is an upper case letter (between 'A' and 'Z') or is a lower case letter (between 'a' and 'z'), respectively. Because statements like 'B'>'A' are themselves boolean values, these functions can be written as simple one-liners. As an example function, is_letter, could be written using the above functions as:

bool is_letter(char c) {
    return is_upper(c) || is_lower(c);
}

Making these sorts of functions manually is good practice. However, for future reference, there are plenty of useful functions for working with chars in the standard library file ctypes.h.

Note that in class, we saw the use of int, and the numbers 0 and 1, for booleans. This is, in fact, the original way of supporting booleans in C: use integers to store booleans, and interpret any non-zero integer as true, and zero as false. But, for clarity, we can retrofit a type named bool to C, and gain access to constants true and false, by including stdbool.h, and that is what we have done here. To be clear, behind the scenes, integers are still being used; it is just that your code uses aliases for int, 0, and 1 by including this library.

Shifting a character

Write a function, shiftchar, that takes in a character, c, and an integer, shift, and returns that character shifted up or down by the integer. The alphabet should cycle around so that shiftchar('Z', 1) should yield 'A'.

Test that this function works on a variety of inputs. There are easy cases like shiftchar('A', 1)=='B' and more challenging cases like shiftchar('z', 3)=='c' or shiftchar('a', -3)=='x'. Make sure that it correctly encrypts both upper case and lower case letters but does not affect non-letters, i.e. shiftChar('!', 5) == '!'.

This part of the lab is not the most educational. If you're spending a lot of time on this function, ask for help. We'd rather you focus more on learning pointers.

Putting it all together

Build a function encryptinplace that takes an input string and shift integer, goes through the string and shifts each character appropriately. Do this in place, that is, change the input string directly. Do not shift non-letters. Your function will not need to return anything. Test your function from your main code. Does it work? Why can we do this? Inputs to functions in C are copied and changes to them are not relayed to the outside. Why is our string changed in our main function then? What exactly was our input to the function? Was it changed?

Potential error: To test your code you may be tempted to make a string as follows:

char *teststring = "Testing both UPPER and lower case letters, as well as punctuation.\n";

Unfortunately static strings that we specify directly in quotes are usually stored in read-only memory and can not be modified. Modifying them will result in a Segmentation Fault (a crash due to a memory access issue). To fix this, simply create a dynamic copy of your string using:

teststring = strdup(teststring); // duplicate teststring dynamically

This string is stored in the space reserved for dynamically-allocated memory and can be changed. Notice however that you are now also responsible for releasing the memory when you are done with it, by calling free(teststring) as will be described further down.

Again, encryptinplace should not need to return anything. It will change the memory pointed to by one of the inputs. It should work as follows:

char *message = strdup("Hello World!");
printf("message    : %s\n", message);
encryptinplace(message, 5); // this changes message
printf("ciphertext : %s\n", message);

Now, build a function encryptnewmemory that performs the same task but does not change the input string. Rather it should allocate space for a new string of the appropriate size and fill it with the shifted characters. Do not simply copy your string and then call your previous encryption function. Pretend that this must be done quickly and the waste of writing the information twice is not acceptable (if you copy then you're writing once in string-copy, and then a second time when you shift). Return the newly allocated string, i.e. this function should work in the following way:

char *message = strdup("Hello World!");
printf("message    : %s\n", message);
char *ciphertext = encryptnewmemory(message, 5);
printf("ciphertext : %s\n", ciphertext);

Now build decryption functions for each of your previous encryption functions. These should each be one line long.

Free

Allocating memory dynamically is done by calling malloc. Each call to malloc must be matched with a call to free, when the memory is no longer needed. free releases memory back into the common pool. This operation is especially important if your program creates lots of temporary arrays. If you don't free them then you'll quickly run out of space. We didn't need to think about this issue in Python because the language uses garbage collection to handle memory deallocation automatically. Think about the memory space that you've malloc'ed in your program. Free those variables when they are no longer needed. The syntax below allocates memory for 10 double variables and then finally releases it:

int length = 10;
double *A = (double *)malloc(length * sizeof(double));

// do stuff with A
A[0] = 10.5;
A[1] = 0.124;

// finished with A
free(A);

Notice that there are some functions that will call malloc for us and return the memory allocated (and thus handing to us the responsibility of freeing it). In these examples, there won't be an obvious malloc/free pair, which means we have to be extra careful. We saw that an example of this is when using strdup, but also for our own function mystrcat.