Assignment #3 is due at 11:59 PM on Monday, 20 October 2003.
This assignment is an exercise in recursive manipulation of list structure. There are a number of steps in this assignment, and it is easy to fall into a black hole. Start early, and work systematically.
In Assignment #2, we unparsed numerical formulae. In this assignment, we parse them, which is substantially harder. The basic idea is still to represent the recursion in the grammar through recursive definitions of Scheme functions associated with the nonterminals. But the structure of the recursion for parsing is more complicated and less uniform than the structure for unparsing.
The basic idea of recursive descent parsing is to associate with each nonterminal symbol in a grammar (e.g., <fmla>), a function that finds the longest prefix of a token list derivable from that nonterminal symbol, and parses it into abstract syntax form. But, to make the recursion work, the function must return the list of remaining tokens as well as the abstract syntax. There are several ways to achieve this, each with advantages and disadvantages. In this assignment, we let each function return a list of two things: an abstract syntax parsed from the beginning of the token list, and a shorter token list for further parsing.
Everybody who has tried this exercise so far (i.e. your
wise and ancient instructor, and one of your young and brilliant TAs)
has reflexively used rest
to extract the list of
remaining tokens from the outputs of parsing functions. Since that
output is a list of two things, the correct extractor is
second
, of course. Try to make more interesting mistakes
than that one.
As with unparsing, the fully parenthesized form is easier than the one with optional parentheses. Recall the grammar for fully parenthesized numerical formulae:
<fmla> ::= <fmla> + <fmla> | <fmla> - <fmla> | <fmla> * <fmla> | <fmla> / <fmla> | <fmla> ^ <fmla> | ( <fmla> ) | fib ( <fmla> , <fmla> , <fmla> ) | <number>
I wrote up a complete solution in
fp_parse_formula-1.0.scm
(error in string->token_list
missing "-
"
as a token delimiter)fp_parse_formula-1.1.scm
(corrected error in string->token_list
)Look carefully at the structure of my solution.
string.ss
and list.ss
libraries from MzScheme. Just copy over the
require
function calls into your definitions.string->token_list
to perform
lexical analysis---transforming a string into a list of
tokens. Don't worry about how this function works. It uses a horrible
grungy form of pattern-matching that was imported into
MzScheme, but which doesn't follow usual Scheme
style at all. You can study this more if you like to go into some of
the Unix commands grep
, egrep
,
awk
, ... For this assignment, just copy over my
function.number_string?
to test whether a given
string represents a number in Scheme. I did it in the easiest
possible way. If we ever get to the point of really polishing up
details, we might replace this (fat chance).parse_fmla
. Make sure you
understand the simple way in which fp_string->formula
and fp_token_list->formula
use
parse_fmla
.parse_fmla
, look at the end of the
definition first. Unfortunately, Scheme syntax requires this
part to go last, while the definition might be easier to follow if it
came first:
In order to work from left to right across the token list, we have to keep checking the first symbol and narrowing down the productions that we might apply. This conditional structure gets tangled up with the grammatical recursion, and it's easy to get confused. I found that the computation for each of these three cases (number, left parenthesis, fib) was so long that it was very hard to see the basic conditional structure, so I made each of the three computations into a call on another function. The use of;; Body of parse_fmla (cond ((number_string? (first toklist)) (parse_fmla_start_number toklist)) ;; <number> ((string=? (first toklist) "(") (parse_fmla_start_paren toklist)) ;; ( ((string=? (first toklist) "fib") (parse_fmla_start_fib toklist)) ;; fib )
parse_fmla_start_number
etc. is purely organizational. It is not required by the structure of
the recursion, but I found the code much easier to get right
once I organized it this way.let*
, for example:
Each of the blocks of 2 or 3 name-value pairs (separated by empty lines) corresponds to one step of left-right parsing across one or more right-hand sides of productions in the grammar. I give the nonterminal or terminal being parsed in a comment in the right margin. I cooked up a naming scheme for(let* ((parse_fmla_1 (parse_fmla (rest toklist))) ;; <fmla> (operand_1 (first parse_fmla_1)) (suffix_after_op_1 (second parse_fmla_1)) (operator (first suffix_after_op_1)) ;; one of +-*/^ (suffix_after_operator (rest suffix_after_op_1)) (parse_fmla_2 (parse_fmla suffix_after_operator)) ;; <fmla> (operand_2 (first parse_fmla_2)) (suffix_after_op_2 (second parse_fmla_2)) (right_paren (first suffix_after_op_2)) ;; ) (suffix_after_right_paren (rest suffix_after_op_2)) ) ... )
parse_fmla_1
,
operand_1
, suffix_after_op_1
, etc., and
followed it consistently wherever this form arose. Without that
consistency, chaos sets in pretty quickly.let*
forms are crucial in these definitions,
both to achieve reasonable time efficiency (notice that
otherwise there would be multiple recursive evaluations of
(parse_fmla (rest toklist))
and other similar
expressions), and to keep the definitions readable. If you
haven't done much programming, you may be laughing, or weeping in
frustration, that I call these definitions "readable." Well, as
programs go, they are. You can learn to read and deal with this sort
of definition code. Without the careful organization around the
let*
form, they would be really
unreadable.parse_fmla
out into its own definitions
window, and experiment with it. Then take the whole definition of
fp_string->formula
and observe its
behavior.Your mission is to parse the normal sorts of optionally parenthesized strings into abstract syntax representations of formulae, using the recursive descent parsing technique illustrated above for fully parenthesized concrete syntax. Stick as closely as possible to the style of my definitions above. Cut and paste my code whenever you can.
Although our recursive descent parser is small compared to the huge
amount of code in a fully deliverable application program, it is
plenty big to lead you into the pit of chaos (and despair) if you
attack it willy-nilly. I have worked it out into a series of steps,
each one of which is manageable. I recommend very strongly
that you test each step before moving on to the next. "Short
cuts lead to long delays." Since there are several steps, it is
important to complete the first ones early in the week, to leave time
for a thoughtful pursuit of the remaining ones. I'm pretty sure that
1.3 parse_term
is the hardest step, so don't leave that
one too late.
Every time you complete a step, save a working file of definitions and a test suite in case you mess up the next step. This will help you back out of a tar pit. It will also give you something to hand in for substantial partial credit if you don't finish the whole thing.
For reference, here is the grammar:
<fmla> ::= <fmla> + <term> | <fmla> - <term> | <term> <term> ::= <term> * <factor> | <term> / <factor> | <factor> <factor> ::= <base> ^ <factor> | <base> <base> ::= <number> | fib ( <fmla> , <fmla> , <fmla> ) | ( <fmla> )
Watch out. I'm going to change the grammar after the first two steps. It won't change the language, but it will help you write the right function definitions.
Implement recursive descent parsing for this grammar by defining a
function associated with each nonterminal: parse_fmla
,
parse_term
, parse_factor
,
parse_base
, and two more that will come in when we adjust
the grammar. Each of these functions takes a list of tokens, finds the
longest prefix derivable from the associated nonterminal, and
returns a list of two things: the abstract syntax for that prefix, and
the list of tokens remaining when that prefix is removed.
The way to avoid chaos is to implement these functions from the
bottom up: parse_base
first, then
parse_factor
, etc.
string->formula
, token_list->formula
Convert my old functions fp_string->formula
and
fp_token_list->formula
into
string->formula
and
token_list->formula
. Nothing to it. Then, as you work
on the various parse_
functions, you can observe them
both directly, and through string->formula
. I strongly
suggest that you use both views of the parsing process. When you're
stuck with an error, you can compose
string->token_list
with your functions to test them
more easily.
These functions are trivial, but test them anyway. Trivial testing has a paradoxical effect. When you do it, you didn't need to. When you don't, you did.
Save your working definitions and a small test suite. If
you are familiar with RCS or CVS or some other
version management system, you may use that. If you have no better
idea, create a subdirectory called 1.0
, and save a
completely working version of the trivial preliminaries in there
before proceeding. Yeah, it seems goofy at this trivial stage, but it
will help start a good habit.
parse_base
The main point of this step is to define and test
parse_base
, in exactly its final form. To do so,
we need a phony version of parse_fmla
:
;; For each nonterminal <nt> in the grammar, there is a function ;; ;; parse_<nt>: list_of_tokens -> (list formula list_of_tokens) ;; ;; (parse_<nt> toklist) is a list of two things: a formula parsed from ;; the longest possible prefix of toklist derivable from <nt>, and the ;; suffix of toklist remaining after that prefix is removed. (define (parse_fmla toklist) ;; <fmla> ::= <base> (parse_base toklist) ) ;; end parse_fmla
This phony definition of parse_fmla
implements the
phony production <fmla> ::= <base>. We have to change it
later. But it allows us to implement exactly the right
parse_base
, including its recursive calls to
parse_fmla
. Steal my definition above, exactly as it is,
and make it a global definition (not contained in any other
definition).
Now, implement parse_base
, in the form:
(define (parse_base toklist) ;; Parse a token list into a <base> starting with a number (define (parse_base_start_number toklist) ... ;; <base> ::= <number> ... ) ;; end parse_base_start_number ;; Parse a token list into a <base> starting with a left parenthesis (define (parse_base_start_paren toklist) ... ;; <base> ::= ( <fmla> ) ... ) ;; End parse_base_start_paren ;; Parse a token list into a <base> starting with fib (define (parse_base_start_fib toklist) ... ;; <base> ::= fib ( <fmla> , <fmla> , <fmla> ) ... ) ;; end parse_base_start_fib ;; Body of parse_base (cond ((number_string? (first toklist)) (parse_base_start_number toklist)) ;; <number> ((string=? (first toklist) "(") (parse_base_start_paren toklist)) ;; ( ((string=? (first toklist) "fib") (parse_base_start_fib toklist)) ;; fib ) ) ;; end parse_base
The structure of parse_base
here is a lot like the
structure of parse_fmla
from the fully parenthesized
version. You can (and should) cut and paste most of the code from
there. Notice the key difference: the production with parentheses does
not also introduce a binary operator. So you actually cut out a lot of
the code from the old parse_fmla
.
Test parse_base
thoroughly before
proceeding. Your definitions so far should be able to parse
any expression involving only numbers, parentheses, and applications
of fib
. Make sure that it parses even the weirdest nested
forms correctly. Notice that the optionally parenthesized grammar
actually allows some forms that are not allowed by the fully
parenthesized grammar, such as "(((24)))"
. Whenever a
string is parsable by both grammars, the abstract syntax is the
same.
Save your working definitions and a test suite into a
subdirectory named 1.1
, or use your own favorite version
control system. This one is already worth backing up to in case you
goof up the rest.
parse_factor
Now define and test
parse_factor
, in exactly its final form.
Change the phony version of parse_fmla
to:
;; For each nonterminal <nt> in the grammar, there is a function ;; ;; parse_<nt>: list_of_tokens -> (list formula list_of_tokens) ;; ;; (parse_<nt> toklist) is a list of two things: a formula parsed from ;; the longest possible prefix of toklist derivable from <nt>, and the ;; suffix of toklist remaining after that prefix is removed. (define (parse_fmla toklist) ;; <fmla> ::= <factor> (parse_factor toklist) ) ;; end parse_fmla
parse_factor
is actually simpler than
parse_base
, and it doesn't need the extra layer of
function definition (parse_base_start_paren
, etc.) to
organize it clearly. Here's the basic structure:
(define (parse_factor toklist) (let* ( ... ) (cond ((and (not (empty? suffix_after_op_1)) (string=? (first suffix_after_op_1) "^") ;; ^ ) (let* ( ... ) ;; <factor> ::= <base> ^ <factor> ... ) ) (else ;; <factor> ::= <base> ... ) ) ) ) ;; end parse_factor
Notice the interleaved structure of let*
and
cond
. This structure arises because both of the
productions with <factor> on the left-hand side start with
<base>, so the first let*
structure recursively
parses the longest prefix of toklist
derivable from
<base>. Then, the next token (if any---the remaining list of
tokens might be empty
) determines whether to go on and
use the production <factor> ::= <base> ^ <factor>,
or to stop here and use the production <factor> ::=
<base>.
Test parse_factor
thoroughly before
proceeding. Your definitions should now parse any formula
involving numbers, parentheses, fib
, and
^
.
Save your working definitions and test suite in a
subdirectory named 1.2
.
parse_term
By now, you should know how to fix up a new phony version of
parse_fmla
. Do it.
A serious complication arises when we add in the productions with
<term> on the left-hand side. Notice that two of these
productions also have <term> as the leftmost item on the
right-hand side. That didn't happen with <base> or
<factor>. This phenomenon is called left recursion, and
it messes up the recursive descent method of parsing. If you implement
parse_term
simple-mindedly in the style of
parse_factor
, you'll get an infinite loop. In general,
left-recursion may be indirect, going through the leftmost symbols on
the right-hand sides of several different productions before closing
the loop. But in this grammar all of the left recursions are direct
and obvious.
Think this one out carefully before you start coding. Quick and dirty fiddling leads to chaos here. My favorite way to think it out is to modify the grammar, replacing the productions with <term> on the left-hand side by:
I mistyped these productions. Corrected them 17 October 3:35 PM.
<term> ::= <factor> <term_tail> <term_tail> ::= * <factor> <term_tail> | / <factor> <term_tail> | <empty>
<empty> stands for the empty sequence of tokens.
The modified grammar has no more left recursion in the productions
for <term> (we'll deal with left recursion in <fmla>
later). It derives the same concrete syntax strings as the old
version. But it appears to impose right associativity on * and /,
instead of the left associativity that we require. We fix this in the
program, rather than in the grammar. The function
parse_term_tail
takes an extra argument, giving the
abstract syntax of any <term> material that has already been
parsed to the left. It incorporates that abstract syntax into the
material that it passes on to the right.
So, define parse_term
and parse_term_tail
as suggested in the discussion above, using the following form:
(define (parse_term toklist) (let* ( ... ;; <factor> ) ... ;; <term_tail> ;; <term> ::= <factor> <term_tail> ) ) ;; end parse_term ;; (parse_term_tail term_so_far toklist) parses the longest ;; possible prefix of toklist into a formula derivable from ;; <term_tail>, and returns a list of two things: ;; ;; 1. the abstract syntax for that formula, treating * and / ;; left-associatively, and incorporating the abstract syntax ;; given by term_so_far as an additional operand on the left; ;; ;; 2. the suffix of toklist remaining after that prefix is removed. (define (parse_term_tail term_so_far toklist) (cond ( ... ;; * (let* ( ... ;; <factor> ) (parse_term_tail (list '* term_so_far operand_1) suffix_after_op_1) ;; <term_tail> ;; <term_tail> ::= * <factor> <term_tail> ) ) ( ... ;; / (let* ( ... ;; <factor> ) ... ;; <term_tail> ;; <term_tail> ::= / <factor> <term_tail> ) ) ( ;; <term_tail> ::= <empty> ... ) ) ) ;; end parse_term_tail
Since <term> has only one production, parse_term
is quite simple, and doesn't need any conditional structure. The real
action is in parse_term_tail
. I provided one of the
recursive calls---the one that constructs a multiplication---to help
you get off the ground.
Now, it's really really really important to test and get
parse_term
and parse_term_tail
right before
continuing. You should be able to figure out precisely which
sorts of formulae can be parsed by the definitions so far.
Save your working definitions and test suite in a
subdirectory named 1.3
.
parse_fmla
Yee hi! parse_fmla
is just the same as
parse_term
, with a few changes of symbols and names of
functions in recursive calls. Copy and paste your definition of
parse_term
, and change the right stuff, including the
comments. Test it all out.
Save your working definitions and test suite in a
subdirectory named 1.4
.
Cut and paste the definitions of parse_base
etc. into
the definition of token_list->formula
so that they are
local definitions, instead of global ones. It's easy, but rerun your
test suite just in case you mismoused.
Save your working definitions and test suite in a
subdirectory named 1.5
. This is the version that you'll
hand in, but it's always best to follow good habits to the
end.
parse_formula.scm
containing the final
versions of all the function definitions discussed above. Do not hand
in the intermediate steps (unless you don't finish: see the end of
section 2).parse_formula.tst.scm
containing a test
suite demonstrating the behavior of
string->formula
. This is a bit harder than
demonstrating the internal functions, but you can do it.comments.txt
with your comments.If you don't finish, it is much better to hand in
a reasonably well tested solution to one of the intermediate steps,
than a pile of code that is intended to solve the whole problem, but
doesn't. So I hope you saved intermediate steps as I suggested. If you
hand in an intermediate version, include a file named
comments.txt
explaining what you accomplished.
This section is optional in case you're getting impatient with the lack of a GUI to give context to the project. There is nothing further to hand in.
I will fill in this part soon.
![]() | |