CMSC 15100: Lab 5 (Huffman code)

Due Friday, Oct 30, 10pm

1 Introduction

A computer's digital electronic circuitry processes high/low electrical signals whose states can be put in correspondence with the two-valued (0 and 1) binary system. Therefore, all information we process must have such a representation, as a sequence of binary digits (bits). For text messages, the common ASCII encoding represents each character symbol as a sequence of 8 bits. For example:



Character Encoding
s 01110011
h 01101000
e 01100101
l 01101100
a 01100001


The encoding of the message "she" will then be "01110011 01101000 01100101".

In this lab we will explore an alternative more efficient encoding technique that generates much shorter message encodings. A basic observation is that we would benefit from using up only enough binary space to represent the characters that our message contains. For example our message has only 5 distinct characters, so 3 binary bits should suffice (since those allow us to 2^3 combinations). Instead of using a fixed-size encoding for each character, the encoding we will look at will use variable length encoding for characters. A second observation is that we would like our more frequent characters to have shorter encodings. These two aspects are key to the Huffman encoding.

To determine the Huffman codes for our characters, we start by counting their frequency in the message. Consider the earlier message "shesellsseashells". The corresponding frequencies would be (a 1), (h 2), (e 4), (l 4), (s 6).

With this list of frequencies we will construct a Huffman tree by following this procedure:

1. The current sorted list becomes a list of single node Huffman trees, with weights given by the calculated frequencies.

2. We extract the 2 trees with the smallest weight from this list.

2. We merge the 2 trees into a new tree, whose weight is the combined weight of the two subtrees.

3. We reinsert this newly created tree back into the list such that the list still remains sorted by weight.

4. We repeat from step 2, until only 1 tree remains in the list. This is our Huffman tree.

For our example we would get:

We can now determine the encoding for each of our characters by following paths from the root to them and creating a corresponding 0/1 sequence by adding a 0 for every time we choose a left branch and a 1 for every time we choose a right branch. For example for, h we get 001. We can concatenate encodings of individual characters to get to the final encoding of our message.

For decoding, we follow through the 0/1 sequence and along the Huffman tree until we reach a leaf tree. We keep the character of the leaf tree and repeat the process until the entire message is decoded.

1 Frequency counts

For the first part of the lab, we will write functions that will transform our initial message into a sorted list of structures representing the frequency of each character. For extra credit you will have to write functions to take the message and transform it into an unsorted list of character frequencies. Otherwise, you may assume you are already starting with such a list of characters and write the sorting function.

Please make sure to select the Intermediate Student language.

Assume all our characters are lower case. We will define a struct to be used to collect the character frequencies:
;;(make-charfreq #\t 4)
(define-struct charfreq (ch freq))

Frequency counts (Extra credit)

Write a function called count-freq that takes a string representing our message and creates a list of charfreq. For example we should have:
> (count-freq "shesellsseashells")
(list
 (make-charfreq #\s 6)
 (make-charfreq #\h 2)
 (make-charfreq #\e 4)
 (make-charfreq #\l 4)
 (make-charfreq #\a 1))
We should define helper functions as needed. The inbuilt function string->list can be used to transform the input string into a list of characters. To compare characters, use char=? and define helper functions as needed.

Sorting

We will sort the list of frequencies in ascending order. To break ties, we sort items with the same frequency in ascending order of the characters. To achieve this we will rely on the inbuilt function sort. This function consumes a list and a function that establishes the order that we want to impose on any 2 items of the list. For example to sort numbers a simple example could be:
define (sort-rule e1 e2)

  (< e1 e2))



(sort (list 4 1 0 2) sort-rule)

which will produce
(list 0 1 2 4)
Copy the following function in your definitions window:
(define (sort-freqlist l)

  (sort l order-freq))
Define the function order-freq to implement our sorting order on the frequency elements. To check if characters are in ascending order we can use char<?.

2 Building the Huffman tree

We use the following data definition for our Huffman tree:
;;A Huffman tree is either
;;false (make-hufftree character weight lefthufftree righthufftree)

(define-struct hufftree (ch weight left right))

From the sorted list of frequencies we built in the previous step, we first create a list of single node Huffman trees. Use the following code to get this done:
(define (charfreq-to-treelist chlist)

   (cond

     [(empty? chlist) (list )]

     [else (append

       (list (make-hufftree (charfreq-ch (first chlist)) (charfreq-freq (first chlist)) false false))

      (charfreq-to-treelist (rest chlist)))]))

The frequency counts now correspond to the weight field in the hufftree.

The procedure is the following:

1. Remove the 2 trees with the smallest weight. The weight of a tree can always be known from the weight field of its root.

2. Merge the 2 trees in the following way: create a new root node with the character field set to #\- ,the weight field set to the sum of the weights of the trees being merged and the left and right subtrees set to the first and second list items that were removed in step 1.

3. Reinsert this newly created tree into the list such that it still remains sorted by weight.

4. Repeat until only 1 tree remains in the list. This is our Huffman tree.



Define the helper function mergetrees that consumes 2 Huffman trees and produces the result of merging them as explained in step 2.

Define a helper function instree that reinserts the Huffman tree into a sorted list of Huffman trees.

Define a function build-tree-helper that uses mergetrees and instree to create the final Huffman tree.

Finally put everything together into a function build-tree that takes our unsorted list of frequency counts and calls the appropriate helper functions to produce a Huffman tree. If you have worked on the extra credit part, create a function build-tree-extra that relies on count-freq and build-tree to transform our string message directly into a Huffman tree.

A call to create a Huffman tree from the frequencies in the string "shesellsseashells" results in:
(make-hufftree
 #\-
 17
 (make-hufftree
  #\-
  7
  (make-hufftree
   #\-
   3
   (make-hufftree #\a 1 false false)
   (make-hufftree #\h 2 false false))
  (make-hufftree #\e 4 false false))
 (make-hufftree
  #\-
  10
  (make-hufftree #\l 4 false false)
  (make-hufftree #\s 6 false false)))

3 (Optional) Encode/Decode messages

The encoding for each one of our original characters can be determined by following a path to it from the root and adding a 0 to the 0/1 sequence if we follow a left branch and a 1 if we follow a right branch. For example for character h in the pictured tree, we have the code 001 For the decoding process, we repeat the following procedure for the entire length of the 0/1 sequence:

Start at the root of the Huffman tree. Consume elements from the 0/1 sequence and follow the corresponding path in the tree, where 0 means take the left branch and 1 take the right branch. Stop when we reach a leaf node and record the character found there.

Design a function decode that uses a huffman tree and a list of 0/1 values corresponding to the encoding of our original message to decode and retrieve the original message.

4 Submission

Please sumbit your work by Friday October 30th 10pm using the DrScheme handin mechanism. You submission should include the following functions: sort-freqlist, order-freq, charfreq-to-treelist,mergetrees, instree,build-tree-helper and build-tree. If you have worked on the extra credit part, your submission should also include count-freq and build-tree-extra.