CMSC 16100: Honors Introduction to Computer Programming I (Autumn 2003)

Homework

Assignment #3

Assignment #3 is due at 11:59 PM on Monday, 20 October 2003.

This assignment is an exercise in recursive manipulation of list structure. There are a number of steps in this assignment, and it is easy to fall into a black hole. Start early, and work systematically.

0. Recursive descent parsing

In Assignment #2, we unparsed numerical formulae. In this assignment, we parse them, which is substantially harder. The basic idea is still to represent the recursion in the grammar through recursive definitions of Scheme functions associated with the nonterminals. But the structure of the recursion for parsing is more complicated and less uniform than the structure for unparsing.

The basic idea of recursive descent parsing is to associate with each nonterminal symbol in a grammar (e.g., <fmla>), a function that finds the longest prefix of a token list derivable from that nonterminal symbol, and parses it into abstract syntax form. But, to make the recursion work, the function must return the list of remaining tokens as well as the abstract syntax. There are several ways to achieve this, each with advantages and disadvantages. In this assignment, we let each function return a list of two things: an abstract syntax parsed from the beginning of the token list, and a shorter token list for further parsing.

Everybody who has tried this exercise so far (i.e. your wise and ancient instructor, and one of your young and brilliant TAs) has reflexively used rest to extract the list of remaining tokens from the outputs of parsing functions. Since that output is a list of two things, the correct extractor is second, of course. Try to make more interesting mistakes than that one.

1. Fully parenthesized parsing

As with unparsing, the fully parenthesized form is easier than the one with optional parentheses. Recall the grammar for fully parenthesized numerical formulae:

<fmla> ::= <fmla> + <fmla>
        |  <fmla> - <fmla>
        |  <fmla> * <fmla>
        |  <fmla> / <fmla>
        |  <fmla> ^ <fmla>
        |  ( <fmla> )
        |  fib ( <fmla> , <fmla> , <fmla> )
        |  <number>

I wrote up a complete solution in

fp_parse_formula-1.0.scm (error in string->token_list missing "-" as a token delimiter)
fp_parse_formula-1.1.scm (corrected error in string->token_list)

Look carefully at the structure of my solution.

I used the string.ss and list.ss libraries from MzScheme. Just copy over the require function calls into your definitions.
I defined string->token_list to perform lexical analysis---transforming a string into a list of tokens. Don't worry about how this function works. It uses a horrible grungy form of pattern-matching that was imported into MzScheme, but which doesn't follow usual Scheme style at all. You can study this more if you like to go into some of the Unix commands grep, egrep, awk, ... For this assignment, just copy over my function.
I defined number_string? to test whether a given string represents a number in Scheme. I did it in the easiest possible way. If we ever get to the point of really polishing up details, we might replace this (fat chance).
The real action is in parse_fmla. Make sure you understand the simple way in which fp_string->formula and fp_token_list->formula use parse_fmla.
To understand parse_fmla, look at the end of the definition first. Unfortunately, Scheme syntax requires this part to go last, while the definition might be easier to follow if it came first:
```
;; Body of parse_fmla
    
(cond ((number_string? (first toklist)) (parse_fmla_start_number toklist)) ;; <number>
          
      ((string=? (first toklist) "(")   (parse_fmla_start_paren  toklist)) ;; (
          
      ((string=? (first toklist) "fib") (parse_fmla_start_fib    toklist)) ;; fib
          
      )
```
In order to work from left to right across the token list, we have to keep checking the first symbol and narrowing down the productions that we might apply. This conditional structure gets tangled up with the grammatical recursion, and it's easy to get confused. I found that the computation for each of these three cases (number, left parenthesis, fib) was so long that it was very hard to see the basic conditional structure, so I made each of the three computations into a call on another function. The use of parse_fmla_start_number etc. is purely organizational. It is not required by the structure of the recursion, but I found the code much easier to get right once I organized it this way.
Look carefully at the way I used let*, for example:
```
(let* ((parse_fmla_1             (parse_fmla (rest toklist)))         ;; <fmla>
       (operand_1                (first  parse_fmla_1))
       (suffix_after_op_1        (second parse_fmla_1))
             
       (operator                 (first suffix_after_op_1))           ;; one of +-*/^
       (suffix_after_operator    (rest  suffix_after_op_1))
             
       (parse_fmla_2             (parse_fmla suffix_after_operator))  ;; <fmla>
       (operand_2                (first  parse_fmla_2))
       (suffix_after_op_2        (second parse_fmla_2))
             
       (right_paren              (first suffix_after_op_2))           ;; )
       (suffix_after_right_paren (rest  suffix_after_op_2))
       )

...

      )
```
Each of the blocks of 2 or 3 name-value pairs (separated by empty lines) corresponds to one step of left-right parsing across one or more right-hand sides of productions in the grammar. I give the nonterminal or terminal being parsed in a comment in the right margin. I cooked up a naming scheme for parse_fmla_1, operand_1, suffix_after_op_1, etc., and followed it consistently wherever this form arose. Without that consistency, chaos sets in pretty quickly.

The let* forms are crucial in these definitions, both to achieve reasonable time efficiency (notice that otherwise there would be multiple recursive evaluations of (parse_fmla (rest toklist)) and other similar expressions), and to keep the definitions readable. If you haven't done much programming, you may be laughing, or weeping in frustration, that I call these definitions "readable." Well, as programs go, they are. You can learn to read and deal with this sort of definition code. Without the careful organization around the let* form, they would be really unreadable.
Notice the few cases of error reporting. Good error reporting is a difficult topic, and the definition of "good" depends on a lot of context that we don't have. Error reporting from parsing is even more difficult than most sorts of error reporting. I only provided error reports when I had to check the identity of a symbol to insure a correct parse, and it was irresistibly natural to provide an alternative when the identity was wrong. I recommend that you continue that practice of minimal error reporting, in the style that I established here. It's not a terribly good style, but it stands for potential future improvement about as well as we can manage now.
Cut and paste parse_fmla out into its own definitions window, and experiment with it. Then take the whole definition of fp_string->formula and observe its behavior.
When you have a general idea how the definitions work, go back and review all of the comments containing productions or fragments of right-hand sides. Figure out how information in productions is scattered about the definitions, using the hints in those comments.

1. Optionally parenthesized parsing

Your mission is to parse the normal sorts of optionally parenthesized strings into abstract syntax representations of formulae, using the recursive descent parsing technique illustrated above for fully parenthesized concrete syntax. Stick as closely as possible to the style of my definitions above. Cut and paste my code whenever you can.

Although our recursive descent parser is small compared to the huge amount of code in a fully deliverable application program, it is plenty big to lead you into the pit of chaos (and despair) if you attack it willy-nilly. I have worked it out into a series of steps, each one of which is manageable. I recommend very strongly that you test each step before moving on to the next. "Short cuts lead to long delays." Since there are several steps, it is important to complete the first ones early in the week, to leave time for a thoughtful pursuit of the remaining ones. I'm pretty sure that 1.3 parse_term is the hardest step, so don't leave that one too late.

Every time you complete a step, save a working file of definitions and a test suite in case you mess up the next step. This will help you back out of a tar pit. It will also give you something to hand in for substantial partial credit if you don't finish the whole thing.

For reference, here is the grammar:

<fmla> ::= <fmla> + <term>
        |  <fmla> - <term>
        |  <term>

<term> ::= <term> * <factor>
        |  <term> / <factor>
        |  <factor>

<factor> ::= <base> ^ <factor>
          |  <base>

<base>  ::= <number>
         |  fib ( <fmla> , <fmla> , <fmla> )
         |  ( <fmla> )

Watch out. I'm going to change the grammar after the first two steps. It won't change the language, but it will help you write the right function definitions.

Implement recursive descent parsing for this grammar by defining a function associated with each nonterminal: parse_fmla, parse_term, parse_factor, parse_base, and two more that will come in when we adjust the grammar. Each of these functions takes a list of tokens, finds the longest prefix derivable from the associated nonterminal, and returns a list of two things: the abstract syntax for that prefix, and the list of tokens remaining when that prefix is removed.

The way to avoid chaos is to implement these functions from the bottom up: parse_base first, then parse_factor, etc.

1.0 `string->formula`, `token_list->formula`

Convert my old functions fp_string->formula and fp_token_list->formula into string->formula and token_list->formula. Nothing to it. Then, as you work on the various parse_ functions, you can observe them both directly, and through string->formula. I strongly suggest that you use both views of the parsing process. When you're stuck with an error, you can compose string->token_list with your functions to test them more easily.

These functions are trivial, but test them anyway. Trivial testing has a paradoxical effect. When you do it, you didn't need to. When you don't, you did.

Save your working definitions and a small test suite. If you are familiar with RCS or CVS or some other version management system, you may use that. If you have no better idea, create a subdirectory called 1.0, and save a completely working version of the trivial preliminaries in there before proceeding. Yeah, it seems goofy at this trivial stage, but it will help start a good habit.

1.1 `parse_base`

The main point of this step is to define and test parse_base, in exactly its final form. To do so, we need a phony version of parse_fmla:


;; For each nonterminal <nt> in the grammar, there is a function
;;
;; parse_<nt>: list_of_tokens -> (list formula list_of_tokens)
;;
;; (parse_<nt> toklist) is a list of two things: a formula parsed from
;; the longest possible prefix of toklist derivable from <nt>, and the
;; suffix of toklist remaining after that prefix is removed.

(define (parse_fmla toklist)

  ;; <fmla> ::= <base>
  
  (parse_base toklist)
    
  ) ;; end parse_fmla

This phony definition of parse_fmla implements the phony production <fmla> ::= <base>. We have to change it later. But it allows us to implement exactly the right parse_base, including its recursive calls to parse_fmla. Steal my definition above, exactly as it is, and make it a global definition (not contained in any other definition).

Now, implement parse_base, in the form:


(define (parse_base toklist)
  
  ;; Parse a token list into a <base> starting with a number
  
  (define (parse_base_start_number toklist)

    ...

    ;; <base> ::= <number>

    ...

    ) ;; end parse_base_start_number
  
  ;; Parse a token list into a <base> starting with a left parenthesis
  
  (define (parse_base_start_paren toklist)

    ...

    ;; <base> ::= ( <fmla> )

    ...

    ) ;; End parse_base_start_paren
  
  ;; Parse a token list into a <base> starting with fib
  
  (define (parse_base_start_fib toklist)

    ...

    ;; <base> ::= fib ( <fmla> , <fmla> , <fmla> )

    ...

    ) ;; end parse_base_start_fib
  
  ;; Body of parse_base
  
  (cond ((number_string? (first toklist)) (parse_base_start_number toklist)) ;; <number>
        
        ((string=? (first toklist) "(")   (parse_base_start_paren  toklist)) ;; (
        
        ((string=? (first toklist) "fib") (parse_base_start_fib    toklist)) ;; fib
        
        )
  
  ) ;; end parse_base

The structure of parse_base here is a lot like the structure of parse_fmla from the fully parenthesized version. You can (and should) cut and paste most of the code from there. Notice the key difference: the production with parentheses does not also introduce a binary operator. So you actually cut out a lot of the code from the old parse_fmla.

Test parse_base thoroughly before proceeding. Your definitions so far should be able to parse any expression involving only numbers, parentheses, and applications of fib. Make sure that it parses even the weirdest nested forms correctly. Notice that the optionally parenthesized grammar actually allows some forms that are not allowed by the fully parenthesized grammar, such as "(((24)))". Whenever a string is parsable by both grammars, the abstract syntax is the same.

Save your working definitions and a test suite into a subdirectory named 1.1, or use your own favorite version control system. This one is already worth backing up to in case you goof up the rest.

1.2 `parse_factor`

Now define and test parse_factor, in exactly its final form. Change the phony version of parse_fmla to:


;; For each nonterminal <nt> in the grammar, there is a function
;;
;; parse_<nt>: list_of_tokens -> (list formula list_of_tokens)
;;
;; (parse_<nt> toklist) is a list of two things: a formula parsed from
;; the longest possible prefix of toklist derivable from <nt>, and the
;; suffix of toklist remaining after that prefix is removed.

(define (parse_fmla toklist)

  ;; <fmla> ::= <factor>
  
  (parse_factor toklist)
    
  ) ;; end parse_fmla

parse_factor is actually simpler than parse_base, and it doesn't need the extra layer of function definition (parse_base_start_paren, etc.) to organize it clearly. Here's the basic structure:


(define (parse_factor toklist)
  
  (let* (

    ...

         )

    (cond ((and (not (empty? suffix_after_op_1))
                (string=? (first suffix_after_op_1) "^")                      ;;
 ^   
                )
           
           (let* (

                  ...

                  )
             
             ;; <factor> ::= <base> ^ <factor>
             
             ...
              
             )
           )
          
          (else
           
           ;; <factor> ::= <base>
           
           ...

           )
          
          )
    )
  
  ) ;; end parse_factor

Notice the interleaved structure of let* and cond. This structure arises because both of the productions with <factor> on the left-hand side start with <base>, so the first let* structure recursively parses the longest prefix of toklist derivable from <base>. Then, the next token (if any---the remaining list of tokens might be empty) determines whether to go on and use the production <factor> ::= <base> ^ <factor>, or to stop here and use the production <factor> ::= <base>.

Test parse_factor thoroughly before proceeding. Your definitions should now parse any formula involving numbers, parentheses, fib, and ^.

Save your working definitions and test suite in a subdirectory named 1.2.

1.3 `parse_term`

By now, you should know how to fix up a new phony version of parse_fmla. Do it.

A serious complication arises when we add in the productions with <term> on the left-hand side. Notice that two of these productions also have <term> as the leftmost item on the right-hand side. That didn't happen with <base> or <factor>. This phenomenon is called left recursion, and it messes up the recursive descent method of parsing. If you implement parse_term simple-mindedly in the style of parse_factor, you'll get an infinite loop. In general, left-recursion may be indirect, going through the leftmost symbols on the right-hand sides of several different productions before closing the loop. But in this grammar all of the left recursions are direct and obvious.

Think this one out carefully before you start coding. Quick and dirty fiddling leads to chaos here. My favorite way to think it out is to modify the grammar, replacing the productions with <term> on the left-hand side by:

I mistyped these productions. Corrected them 17 October 3:35 PM.

<term> ::= <factor> <term_tail>

<term_tail> ::= * <factor> <term_tail>
             |  / <factor> <term_tail>
             |  <empty>

<empty> stands for the empty sequence of tokens.

The modified grammar has no more left recursion in the productions for <term> (we'll deal with left recursion in <fmla> later). It derives the same concrete syntax strings as the old version. But it appears to impose right associativity on * and /, instead of the left associativity that we require. We fix this in the program, rather than in the grammar. The function parse_term_tail takes an extra argument, giving the abstract syntax of any <term> material that has already been parsed to the left. It incorporates that abstract syntax into the material that it passes on to the right.

So, define parse_term and parse_term_tail as suggested in the discussion above, using the following form:


(define (parse_term toklist)
  
  (let* (

         ...                                         ;; <factor>

         )
    
    ...                                              ;; <term_tail>
    
    ;; <term> ::= <factor> <term_tail>
    
    )
  
  ) ;; end parse_term

;; (parse_term_tail term_so_far toklist) parses the longest
;; possible prefix of toklist into a formula derivable from
;; <term_tail>, and returns a list of two things:
;;
;; 1. the abstract syntax for that formula, treating * and /
;; left-associatively, and incorporating the abstract syntax
;; given by term_so_far as an additional operand on the left;
;;
;; 2. the suffix of toklist remaining after that prefix is removed.

(define (parse_term_tail term_so_far toklist)
  
  (cond (

        ...                                                                    ;; *

         (let* (

                ...                                                            ;; <factor>

                )
           
           (parse_term_tail (list '* term_so_far operand_1) suffix_after_op_1) ;; <term_tail>
           
           ;; <term_tail> ::= * <factor> <term_tail>
           
           )
         )

         (

         ...                          ;; /

         (let* (

                ...                   ;; <factor>

                )
           
            ...                       ;; <term_tail>
           
           ;; <term_tail> ::= / <factor> <term_tail>
           
           )
         )
        
        (
         
         ;; <term_tail> ::= <empty>

         ...
         
         )
        )
  ) ;; end parse_term_tail

Since <term> has only one production, parse_term is quite simple, and doesn't need any conditional structure. The real action is in parse_term_tail. I provided one of the recursive calls---the one that constructs a multiplication---to help you get off the ground.

Now, it's really really really important to test and get parse_term and parse_term_tail right before continuing. You should be able to figure out precisely which sorts of formulae can be parsed by the definitions so far.

Save your working definitions and test suite in a subdirectory named 1.3.

1.4 `parse_fmla`

Yee hi! parse_fmla is just the same as parse_term, with a few changes of symbols and names of functions in recursive calls. Copy and paste your definition of parse_term, and change the right stuff, including the comments. Test it all out.

Save your working definitions and test suite in a subdirectory named 1.4.

1.5 Reorganizing local definitions

Cut and paste the definitions of parse_base etc. into the definition of token_list->formula so that they are local definitions, instead of global ones. It's easy, but rerun your test suite just in case you mismoused.

Save your working definitions and test suite in a subdirectory named 1.5. This is the version that you'll hand in, but it's always best to follow good habits to the end.

2. What to hand in

A file named parse_formula.scm containing the final versions of all the function definitions discussed above. Do not hand in the intermediate steps (unless you don't finish: see the end of section 2).
A file named parse_formula.tst.scm containing a test suite demonstrating the behavior of string->formula. This is a bit harder than demonstrating the internal functions, but you can do it.
Optionally, if you have anything to report (including acknowledgments of where you got ideas), a file named comments.txt with your comments.

If you don't finish, it is much better to hand in a reasonably well tested solution to one of the intermediate steps, than a pile of code that is intended to solve the whole problem, but doesn't. So I hope you saved intermediate steps as I suggested. If you hand in an intermediate version, include a file named comments.txt explaining what you accomplished.

3. Connecting to the PAC GUI

This section is optional in case you're getting impatient with the lack of a GUI to give context to the project. There is nothing further to hand in.

I will fill in this part soon.

Copyright Michael J. O'Donnell <michael_odonnell@acm.org>. Licensed for free use.
This page is generated from PHP source code, with supporting files.

Last modified: Tue Oct 21 09:47:48 CDT 2003