[CS Dept., U Chicago] Courses


CMSC 16100: Honors Introduction to Computer Programming I (Autumn 2003)

Homework

Assignment #3



Assignment #3 is due at 11:59 PM on Monday, 20 October 2003.

This assignment is an exercise in recursive manipulation of list structure. There are a number of steps in this assignment, and it is easy to fall into a black hole. Start early, and work systematically.

0. Recursive descent parsing

In Assignment #2, we unparsed numerical formulae. In this assignment, we parse them, which is substantially harder. The basic idea is still to represent the recursion in the grammar through recursive definitions of Scheme functions associated with the nonterminals. But the structure of the recursion for parsing is more complicated and less uniform than the structure for unparsing.

The basic idea of recursive descent parsing is to associate with each nonterminal symbol in a grammar (e.g., <fmla>), a function that finds the longest prefix of a token list derivable from that nonterminal symbol, and parses it into abstract syntax form. But, to make the recursion work, the function must return the list of remaining tokens as well as the abstract syntax. There are several ways to achieve this, each with advantages and disadvantages. In this assignment, we let each function return a list of two things: an abstract syntax parsed from the beginning of the token list, and a shorter token list for further parsing.

Everybody who has tried this exercise so far (i.e. your wise and ancient instructor, and one of your young and brilliant TAs) has reflexively used rest to extract the list of remaining tokens from the outputs of parsing functions. Since that output is a list of two things, the correct extractor is second, of course. Try to make more interesting mistakes than that one.

1. Fully parenthesized parsing

As with unparsing, the fully parenthesized form is easier than the one with optional parentheses. Recall the grammar for fully parenthesized numerical formulae:

<fmla> ::= <fmla> + <fmla>
        |  <fmla> - <fmla>
        |  <fmla> * <fmla>
        |  <fmla> / <fmla>
        |  <fmla> ^ <fmla>
        |  ( <fmla> )
        |  fib ( <fmla> , <fmla> , <fmla> )
        |  <number>

I wrote up a complete solution in

Look carefully at the structure of my solution.

1. Optionally parenthesized parsing

Your mission is to parse the normal sorts of optionally parenthesized strings into abstract syntax representations of formulae, using the recursive descent parsing technique illustrated above for fully parenthesized concrete syntax. Stick as closely as possible to the style of my definitions above. Cut and paste my code whenever you can.

Although our recursive descent parser is small compared to the huge amount of code in a fully deliverable application program, it is plenty big to lead you into the pit of chaos (and despair) if you attack it willy-nilly. I have worked it out into a series of steps, each one of which is manageable. I recommend very strongly that you test each step before moving on to the next. "Short cuts lead to long delays." Since there are several steps, it is important to complete the first ones early in the week, to leave time for a thoughtful pursuit of the remaining ones. I'm pretty sure that 1.3 parse_term is the hardest step, so don't leave that one too late.

Every time you complete a step, save a working file of definitions and a test suite in case you mess up the next step. This will help you back out of a tar pit. It will also give you something to hand in for substantial partial credit if you don't finish the whole thing.

For reference, here is the grammar:

<fmla> ::= <fmla> + <term>
        |  <fmla> - <term>
        |  <term>

<term> ::= <term> * <factor>
        |  <term> / <factor>
        |  <factor>

<factor> ::= <base> ^ <factor>
          |  <base>

<base>  ::= <number>
         |  fib ( <fmla> , <fmla> , <fmla> )
         |  ( <fmla> )

Watch out. I'm going to change the grammar after the first two steps. It won't change the language, but it will help you write the right function definitions.

Implement recursive descent parsing for this grammar by defining a function associated with each nonterminal: parse_fmla, parse_term, parse_factor, parse_base, and two more that will come in when we adjust the grammar. Each of these functions takes a list of tokens, finds the longest prefix derivable from the associated nonterminal, and returns a list of two things: the abstract syntax for that prefix, and the list of tokens remaining when that prefix is removed.

The way to avoid chaos is to implement these functions from the bottom up: parse_base first, then parse_factor, etc.

1.0 string->formula, token_list->formula

Convert my old functions fp_string->formula and fp_token_list->formula into string->formula and token_list->formula. Nothing to it. Then, as you work on the various parse_ functions, you can observe them both directly, and through string->formula. I strongly suggest that you use both views of the parsing process. When you're stuck with an error, you can compose string->token_list with your functions to test them more easily.

These functions are trivial, but test them anyway. Trivial testing has a paradoxical effect. When you do it, you didn't need to. When you don't, you did.

Save your working definitions and a small test suite. If you are familiar with RCS or CVS or some other version management system, you may use that. If you have no better idea, create a subdirectory called 1.0, and save a completely working version of the trivial preliminaries in there before proceeding. Yeah, it seems goofy at this trivial stage, but it will help start a good habit.

1.1 parse_base

The main point of this step is to define and test parse_base, in exactly its final form. To do so, we need a phony version of parse_fmla:


;; For each nonterminal <nt> in the grammar, there is a function
;;
;; parse_<nt>: list_of_tokens -> (list formula list_of_tokens)
;;
;; (parse_<nt> toklist) is a list of two things: a formula parsed from
;; the longest possible prefix of toklist derivable from <nt>, and the
;; suffix of toklist remaining after that prefix is removed.

(define (parse_fmla toklist)

  ;; <fmla> ::= <base>
  
  (parse_base toklist)
    
  ) ;; end parse_fmla

This phony definition of parse_fmla implements the phony production <fmla> ::= <base>. We have to change it later. But it allows us to implement exactly the right parse_base, including its recursive calls to parse_fmla. Steal my definition above, exactly as it is, and make it a global definition (not contained in any other definition).

Now, implement parse_base, in the form:


(define (parse_base toklist)
  
  ;; Parse a token list into a <base> starting with a number
  
  (define (parse_base_start_number toklist)

    ...

    ;; <base> ::= <number>

    ...

    ) ;; end parse_base_start_number
  
  ;; Parse a token list into a <base> starting with a left parenthesis
  
  (define (parse_base_start_paren toklist)

    ...

    ;; <base> ::= ( <fmla> )

    ...

    ) ;; End parse_base_start_paren
  
  ;; Parse a token list into a <base> starting with fib
  
  (define (parse_base_start_fib toklist)

    ...

    ;; <base> ::= fib ( <fmla> , <fmla> , <fmla> )

    ...

    ) ;; end parse_base_start_fib
  
  ;; Body of parse_base
  
  (cond ((number_string? (first toklist)) (parse_base_start_number toklist)) ;; <number>
        
        ((string=? (first toklist) "(")   (parse_base_start_paren  toklist)) ;; (
        
        ((string=? (first toklist) "fib") (parse_base_start_fib    toklist)) ;; fib
        
        )
  
  ) ;; end parse_base

The structure of parse_base here is a lot like the structure of parse_fmla from the fully parenthesized version. You can (and should) cut and paste most of the code from there. Notice the key difference: the production with parentheses does not also introduce a binary operator. So you actually cut out a lot of the code from the old parse_fmla.

Test parse_base thoroughly before proceeding. Your definitions so far should be able to parse any expression involving only numbers, parentheses, and applications of fib. Make sure that it parses even the weirdest nested forms correctly. Notice that the optionally parenthesized grammar actually allows some forms that are not allowed by the fully parenthesized grammar, such as "(((24)))". Whenever a string is parsable by both grammars, the abstract syntax is the same.

Save your working definitions and a test suite into a subdirectory named 1.1, or use your own favorite version control system. This one is already worth backing up to in case you goof up the rest.

1.2 parse_factor

Now define and test parse_factor, in exactly its final form. Change the phony version of parse_fmla to:


;; For each nonterminal <nt> in the grammar, there is a function
;;
;; parse_<nt>: list_of_tokens -> (list formula list_of_tokens)
;;
;; (parse_<nt> toklist) is a list of two things: a formula parsed from
;; the longest possible prefix of toklist derivable from <nt>, and the
;; suffix of toklist remaining after that prefix is removed.

(define (parse_fmla toklist)

  ;; <fmla> ::= <factor>
  
  (parse_factor toklist)
    
  ) ;; end parse_fmla

parse_factor is actually simpler than parse_base, and it doesn't need the extra layer of function definition (parse_base_start_paren, etc.) to organize it clearly. Here's the basic structure:


(define (parse_factor toklist)
  
  (let* (

    ...

         )

    (cond ((and (not (empty? suffix_after_op_1))
                (string=? (first suffix_after_op_1) "^")                      ;;
 ^   
                )
           
           (let* (

                  ...

                  )
             
             ;; <factor> ::= <base> ^ <factor>
             
             ...
              
             )
           )
          
          (else
           
           ;; <factor> ::= <base>
           
           ...

           )
          
          )
    )
  
  ) ;; end parse_factor

Notice the interleaved structure of let* and cond. This structure arises because both of the productions with <factor> on the left-hand side start with <base>, so the first let* structure recursively parses the longest prefix of toklist derivable from <base>. Then, the next token (if any---the remaining list of tokens might be empty) determines whether to go on and use the production <factor> ::= <base> ^ <factor>, or to stop here and use the production <factor> ::= <base>.

Test parse_factor thoroughly before proceeding. Your definitions should now parse any formula involving numbers, parentheses, fib, and ^.

Save your working definitions and test suite in a subdirectory named 1.2.

1.3 parse_term

By now, you should know how to fix up a new phony version of parse_fmla. Do it.

A serious complication arises when we add in the productions with <term> on the left-hand side. Notice that two of these productions also have <term> as the leftmost item on the right-hand side. That didn't happen with <base> or <factor>. This phenomenon is called left recursion, and it messes up the recursive descent method of parsing. If you implement parse_term simple-mindedly in the style of parse_factor, you'll get an infinite loop. In general, left-recursion may be indirect, going through the leftmost symbols on the right-hand sides of several different productions before closing the loop. But in this grammar all of the left recursions are direct and obvious.

Think this one out carefully before you start coding. Quick and dirty fiddling leads to chaos here. My favorite way to think it out is to modify the grammar, replacing the productions with <term> on the left-hand side by:

I mistyped these productions. Corrected them 17 October 3:35 PM.

<term> ::= <factor> <term_tail>

<term_tail> ::= * <factor> <term_tail>
             |  / <factor> <term_tail>
             |  <empty>

<empty> stands for the empty sequence of tokens.

The modified grammar has no more left recursion in the productions for <term> (we'll deal with left recursion in <fmla> later). It derives the same concrete syntax strings as the old version. But it appears to impose right associativity on * and /, instead of the left associativity that we require. We fix this in the program, rather than in the grammar. The function parse_term_tail takes an extra argument, giving the abstract syntax of any <term> material that has already been parsed to the left. It incorporates that abstract syntax into the material that it passes on to the right.

So, define parse_term and parse_term_tail as suggested in the discussion above, using the following form:


(define (parse_term toklist)
  
  (let* (

         ...                                         ;; <factor>

         )
    
    ...                                              ;; <term_tail>
    
    ;; <term> ::= <factor> <term_tail>
    
    )
  
  ) ;; end parse_term

;; (parse_term_tail term_so_far toklist) parses the longest
;; possible prefix of toklist into a formula derivable from
;; <term_tail>, and returns a list of two things:
;;
;; 1. the abstract syntax for that formula, treating * and /
;; left-associatively, and incorporating the abstract syntax
;; given by term_so_far as an additional operand on the left;
;;
;; 2. the suffix of toklist remaining after that prefix is removed.

(define (parse_term_tail term_so_far toklist)
  
  (cond (

        ...                                                                    ;; *

         (let* (

                ...                                                            ;; <factor>

                )
           
           (parse_term_tail (list '* term_so_far operand_1) suffix_after_op_1) ;; <term_tail>
           
           ;; <term_tail> ::= * <factor> <term_tail>
           
           )
         )

         (

         ...                          ;; /

         (let* (

                ...                   ;; <factor>

                )
           
            ...                       ;; <term_tail>
           
           ;; <term_tail> ::= / <factor> <term_tail>
           
           )
         )
        
        (
         
         ;; <term_tail> ::= <empty>

         ...
         
         )
        )
  ) ;; end parse_term_tail

Since <term> has only one production, parse_term is quite simple, and doesn't need any conditional structure. The real action is in parse_term_tail. I provided one of the recursive calls---the one that constructs a multiplication---to help you get off the ground.

Now, it's really really really important to test and get parse_term and parse_term_tail right before continuing. You should be able to figure out precisely which sorts of formulae can be parsed by the definitions so far.

Save your working definitions and test suite in a subdirectory named 1.3.

1.4 parse_fmla

Yee hi! parse_fmla is just the same as parse_term, with a few changes of symbols and names of functions in recursive calls. Copy and paste your definition of parse_term, and change the right stuff, including the comments. Test it all out.

Save your working definitions and test suite in a subdirectory named 1.4.

1.5 Reorganizing local definitions

Cut and paste the definitions of parse_base etc. into the definition of token_list->formula so that they are local definitions, instead of global ones. It's easy, but rerun your test suite just in case you mismoused.

Save your working definitions and test suite in a subdirectory named 1.5. This is the version that you'll hand in, but it's always best to follow good habits to the end.

2. What to hand in

  1. A file named parse_formula.scm containing the final versions of all the function definitions discussed above. Do not hand in the intermediate steps (unless you don't finish: see the end of section 2).

  2. A file named parse_formula.tst.scm containing a test suite demonstrating the behavior of string->formula. This is a bit harder than demonstrating the internal functions, but you can do it.

  3. Optionally, if you have anything to report (including acknowledgments of where you got ideas), a file named comments.txt with your comments.

If you don't finish, it is much better to hand in a reasonably well tested solution to one of the intermediate steps, than a pile of code that is intended to solve the whole problem, but doesn't. So I hope you saved intermediate steps as I suggested. If you hand in an intermediate version, include a file named comments.txt explaining what you accomplished.

3. Connecting to the PAC GUI

This section is optional in case you're getting impatient with the lack of a GUI to give context to the project. There is nothing further to hand in.

I will fill in this part soon.


Valid HTML 4.0!


Last modified: Tue Oct 21 09:47:48 CDT 2003