<?php echo $HTML

Courses", $CS_COURSES_ROOT_URL); ?>

Assignment #2 is due at 11:59 PM on Monday, 13 October 2003.

This assignment exercises recursion on nested lists, and explores the connection between syntactic structure and string notation.

This is a draft, subject to further revision. I will finalize it on Wednesday.

All programming language implementations that I have ever seen have a significant component in charge of co-ordinating the internal structure in which a program is manipulated with the external form in which it is typed and displayed. In the olden days, before mathematical logic and way before computer science, linguists used the word "syntax" to refer to the structure underlying a spoken or written text (usually a single sentence).

Mathematical logicians corrupted the word "syntax" to mean the typed out form in which a text is presented, and computer scientists followed their lead. Later, computer scientists needed a phrase to describe syntax (in the old sense of the word). So, they started calling the syntax of a text the "abstract syntax." To distinguish the typed out form of presentation, they often call it the "concrete syntax," although traditionally it isn't a syntax at all.

This foolishness hit its limit (I hope) when an otherwise wonderful colleague of mine proposed a "syntax-free" programming language with a graphical presentation, instead of a typed out presentation. Of course his language has syntax, or it would have been a total flop. Oh well. From now on, we'll use the corrupt modern terms "concrete syntax" and "abstract syntax" for notation (which isn't syntax at all) and structure (which used to be called "syntax").

Noam Chomsky suggested that the connection of texts to syntax in natural languages could be described by a sort of formal grammars, which are now called "context-free grammars" (CFGs). The very notion of such formal descriptions of natural language has sparked bitter arguments and worked its way in and out of favor in linguistics. Meanwhile, CFGs have become essentially the only way to describe the connection between notation and syntax---I mean concrete syntax and abstract syntax---in programming languages. They have proved extremely useful for that purpose.

Don't even think hard about the phrase "context-free." Context does matter in a "context-free" grammar. It just doesn't matter in one particular way that distinguishes CFGs from "context-sensitive grammars." The terminology is totally bogus, so just think of CFG as a proper noun, and not a description.

CFGs describe "syntax" (yes, the CS community refers to the whole can of worms as "syntax") by describing the relationships between certain symbols, called terminals or tokens, that appear in concrete syntax, and other symbols, variously called variables or nonterminals (in CS, we usually say "nonterminals"), that stand for syntactically significant classes of subtexts. In a grammar for the English language, there are nonterminals standing for the grammatical categories sentence, subject, predicate, direct object, prepositional phrase, etc.

A CFG consists of a bunch of rules, called productions, each of which describes one way in which a particular nonterminal symbol can be refined into a sequence of tokens and other nonterminals. In a grammar for English, there is probably a production allowing one to refine the nonterminal for sentence into a sequence of subject predicate direct object, and another one allowing one to refine noun into the token "dog".

Here's what a typical CFG looks like in the CS programming language business:

<formula> ::= <formula> + <formula>
           |  <formula> - <formula>
           |  <formula> * <formula>
           |  <formula> / <formula>
           |  <formula> ^ <formula>
           |  ( <formula> )
           |  fib ( <formula> , <formula> , <formula> )
           |  <number>

Symbols surrounded by pointy brackets (<...>) are nonterminals. The wacky symbols ::= and | join together the parts of productions. Other symbols (+, -, *, /, ^, fib, comma, and parentheses in this case) are terminals. Each line is a production. The ::= symbol indicates that the nonterminal on its left may be replaced by the stuff on its right. The | symbol inherits the nonterminal from the previous production, because we get tired of typing it and looking at it over and over. This particular notation for CFGs was proposed by Backus and Naur because it could be typed onto IBM punch cards. So people in "practical" CS call it a Backus-Naur form, or BNF. Linguistics and theoretical CS write grammars differently, but they mean the same thing.

The CFG/BNF above is incomplete, because it doesn't contain the productions for <number>. There are too many of them---one for each number that you're allowed to write. OK, here's one of them:

<number> ::= 473263

Token classes, such as number, are sometimes defined in a restricted form of BNF notation, and sometimes in a different notation based on regular expressions (a sort of textual patterns). We can ignore that level of detail for now.

Given such a BNF (assuming that you know exactly which tokens are numbers), you can derive every possible "formula" by a process, called a derivation: start with the nonterminal symbol <formula> all by itself. In each step, choose one nonterminal to replace, choose a production with that nonterminal on the left side, and replace it by the stuff on the right side. Stop when you have replaced all nonterminals, and are staring at a sequence of terminals. Example derivation, with the nonterminal to be replaced emboldened in each step:

<formula>
<formula> * <formula>
<number> * <formula>
<number> * <formula> + <formula>
<number> * <number> + <formula>
<number> * 42 + <formula>
<number> * 42 + ( <formula> )
<number> * 42 + ( <formula> / <formula> )
<number> * 42 + ( <formula> / <number> )
<number> * 42 + ( <formula>; / 2 )
<number> * 42 + ( <number> / 2 )
8 * 42 + ( <number> / 2 )
8 * 42 + ( 6 / 2 )

I did the steps in a stupid, whimsical order just to make the point that it doesn't matter. If you're going to write out derivations (which you usually shouldn't), you should usually do it systematically, always replacing the leftmost nonterminal, or always the rightmost.

If you've been really alert, you'll be wondering, "where's the (abstract) syntax?" A CFG determines very precisely which sequences of tokens you are and are not allowed to write down in a program, or a portion of a program. The abstract syntax is implicit, and not completely there. For the most part, linear sequential derivations (as in the example above) are bogus. The right way to think of a derivation is to conceive it as a tree:

      <formula>
      /   |   \
     /    |    ------
    |     |          \       
    |     |       <formula>
    |     |       /   |   \
    |     |      /    |    --------
    |     |     /     |            \
    |     |    |      |         <formula>
    |     |    |      |         /    |  \
    |     |    |      |    -----     |   -------
    |     |    |      |   /      <formula>      \
    |     |    |      |  /       /   |   \       \
<formula> | <formula> | |  <formula> |  <formula> |
    |     |    |      | |     |      |     |      |
<number>  | <number>  | |  <number>  |  <number>  |
    |     |    |      | |     |      |     |      |
    8     *    42     + (     6      /     2      )

The tree shows important information about the derivation, while leaving out irrelevant choices about the order of certain steps. It also suggests an abstract syntactic structure. It doesn't really tell us the structure completely, because we need to know that the 8, 42, 6, and 2 represent operands, the *, +, and / indicate which operations are performed on the operands, and the parentheses plus the derivations that stick them determine which structure we get, but are not actually part of that structure. Using all of that information, we can infer an abstract syntactic structure that looks like:

multiply
  /   \
 /    add
|    /   \
|   |     divide
|   |      / \
8   42    6   2

Work out for yourself how the abstract syntax tree is a sort of collapsed version of the derivation tree.

You hardly ever see an actual abstract syntax tree in a programming language manual. You see a lot of BNFs, which imply derivation trees, which suggest abstract syntax trees. As the concrete syntax gets more complicated, the correspondence becomes trickier.

LISP/Scheme have maintained such a simple connection between concrete and abstract syntax that you hardly need a BNF. Essentially, the parenthesis grouping determines the abstract syntax directly.

So far, we have treated each substantially meaningful symbol, including single character symbols such as +, -, *, /, ^, and also multicharacter symbols for numbers, such as 43876. Just as linguistics treats spelling of words in a different level of description from the syntax of sentences made up of words, programming language descriptions usually give syntax in terms of lists substantial symbols, called tokens, which might be longer than a single character. A separate level of derivation, called lexical analysis, associates sequences of tokens with character strings containing the spelled out tokens and some extra separators, such as blank spaces.

For now, our PAC works with numerical formulae built up from numbers, and the binary operators +, -, *, /, ^. Abstract syntax is really a tree structure, not a typed out structure. But everything we look at ends up being typed out in some form. To understand the point of this exercise, you need to think of the abstract syntax as a hidden internal form in our calculations, and the concrete syntax as a tool for presenting it to users. You can't trust the superficial look of things as your programming, since the token lists will actually look rather ugly, while Scheme's presentation of its internal forms is relatively nice. But when the formula->... functions are deployed in a complete application, they will lead to nice readable concrete forms for the user, and nice manipulable abstract internal forms for the programs.

Scheme programmers have a pretty standard form for representing trees as nested lists. If there is only one element in a tree (e.g., the number 3), then that element stands for the small tree. If a tree has a particular <operator> at the root, and two <argument>s hanging below it, then the tree is represented by:

(list <operator> <argument> <argument>)

The arguments themselves may be single elements, or lists representing further tree structure.

Numbers are represented directly by Scheme numbers. Our numerical operators are represented by the Scheme symbols '+, '-, '*, '/, 'expt, 'fib. Here are some examples of numerical formulae in the abstract syntax form:


347
(list '+ 2 3)
(list '+ (list '* 4 5) (list 'expt 2 3))
(list '+ 1 (list '+ (list '+ 2 3) 4))
(list '+ (list 'fib 0 1 6) (list 'fib 0 1 5))

This form looks a lot like the form in which you type a Scheme expression in order to compute its value:


347
(+ 2 3)
(+ (* 4 5) (expt 2 3))
(+ 1 (+ (+ 2 3) 4))
(+ (fib 0 1 6) (fib 0 1 5))

We will explore the difference when we dig a bit deeper into the design of Scheme, and find that it's even less than it appears to be here.

I have not given a precise definition of the representation of numerical formulae in the PAC, but there is enough information here to figure it out. If you are puzzled by any questions regarding what is and is not a legitimate representation of a formula, please ask in class and post to the online discussion.

Notice how a grammar looks sort of like a recursive function definition. That similarity cuts fairly deep, and we exploit it in this assignment and the next to unparse (translate from abstract to concrete syntax) and to parse (translate from concrete to abstract syntax) numerical formulae. Unparsing is much easier than parsing, which is why we do it first.

Our Perfectly Accurate Calculator will use the usual familiar concrete syntax for formulae typed in by a user. The example grammar above describes numerical formulae made up of numbers combined with the binary operations +, -, *, /, and ^ (for exponentiation), with optional parentheses. Unfortunately, it is highly ambiguous. There may be several different derivation trees, and several different abstract syntaxes, for the same concrete syntax (e.g., 3 + 2 * 4). To get off the ground very easily, we'll start with fully parenthesized formulae, described by the grammar:

<fmla> ::= ( <fmla> + <fmla> )
        |  ( <fmla> - <fmla> )
        |  ( <fmla> * <fmla> )
        |  ( <fmla> / <fmla> )
        |  ( <fmla> ^ <fmla> )
        |  fib ( <fmla> , <fmla> , <fmla> )
        |  <number>

I formula->full_paren_token_list ", "unparse_fp.scm") . " "; ?> to unparse according to this grammar. I also defined stringlist->string because the Scheme presentation of lists of strings is a bit hard on the eyes.

Demonstrate my function stringlist->string with a small test suite. 3 test cases are probably enough.

When you are confident about stringlist->string, use it to test my function formula->full_paren_token_list. This one needs a larger number of test cases to exercise all of the productions in the grammar.

Look carefully at the way in which clauses in the function definition correspond very precisely to productions in the grammar. Because our current grammar offers no choices in either parsing or unparsing (i.e., it gives a 1-1 correspondence between concrete and abstract syntax), the order in which I treated the productions was not very important. But I used the right sort of order anyway, to prepare for the next step.

Also notice my conditions, e.g.

(and (list? fmla) (symbol=? (first fmla) '-))

The (list? fmla) clause is not terribly important at this step, since every formula that is not just a number is represented by a list. But it becomes important in the next step, because the possibility of a number doesn't come up in every one of the functions. The order of the two (and ...) clauses is important. When the first clause of an and has the value false, Scheme doesn't evaluate the second clause, which in this case would fail. Experiment on your own with and to see how this works.

Full parenthesization makes (un)parsing rather easy, but it's a pain for the user to type in. We reduce parentheses by establishing certain precedence rules:

^ has highest precedence,
a ^ on the right takes precedence over a ^ on the left,
* and / have second precedence,
a * or / on the left takes precedence over a * or / on the right,
+ and - have lowest precedence,
a + or - on the left takes precedence over a + or - on the right.

These rules are rather typical for programming languages, except that left precedence is common, even for ^. There is a good mathematical reason why I gave ^ right precedence---maybe you can figure it out. I don't have to mention any precedence for fib, because it always comes with parentheses anyway.

Most programming language manuals say that +, -, * and / are left-associative binary operators, while (according to my requirement) ^ is a right-associative binary operator. I dislike these phrases a lot, but we're stuck with them. According to algebra, + and * are associative binary operators, which means that the value of an expression is not affected by their association to the left vs. right (although important computational qualities, such as overflow of intermediate values, may be affected). Associativity is an algebraic quality of the mathematical operations. So-called left- and right-associativity are not qualities of the operations, but only of the notational rules under which we write them.

Here is a grammar that enforces the precedence rules described above:

<fmla> ::= <fmla> + <term>
        |  <fmla> - <term>
        |  <term>

<term> ::= <term> * <factor>
        |  <term> / <factor>
        |  <factor>

<factor> ::= <base> ^ <factor>
          |  <base>

<base>  ::= <number>
         |  fib ( <fmla> , <fmla> , <fmla> )
         |  ( <fmla> )

The names term and factor are fairly standard in programming language manuals, but my fmla is usually called expression. This grammar is unambiguous (each concrete syntax corresponds to a unique abstract syntax), but it allows more than one concrete syntax for a given abstract syntax. In effect, it requires enough parentheses to overcome the effect of the precedence rules when necessary, but it allows additional parentheses at the user's discretion.

Do a few derivation trees with this grammar on your own, to see how it enforces the precedence rules. Notice that higher precedence corresponds to coming lower in the derivation tree. Also notice that the derivation trees are more complicated than those for the previous two grammars. The abstract syntax drops a lot of the detail from the derivation tree.

Now, define your own function formula->min_paren_token_list, to unparse a formula into a list of tokens, using parentheses only when they are required to overcome the precedence rules. Derive your definition from the grammar above:

define a separate function corresponding to each nonterminal (e.g., term->min_paren_token_list),
in each function, provide a conditional clause for each production with the corresponding nonterminal on the left side.

Keep the correspondence between productions and conditional clauses just as precise as it was in formula->full_paren_token_list. There are some rearrangements that work, and shorten the definition a bit, but they obscure the meaning.

The particular form of the conditions matters. The grammar has an infinite loop:

<fmla>
<term>
<factor>
<base>
( <fmla> )
( <term> )
...

The right function definition avoids this loop, and follows the shortest derivation (with the least parentheses) for the given formula. The right definition is much more natural to write than the wrong one, but see if you can figure out how to get it wrong.

Demonstrate your definitions with an appropriate test suite. Use all of the functions in the test suite, not just formula->min_paren_token_list.

Suppose that the number of binary operators in a formula is n. Roughly how many steps does it take to unparse the formula?

A file named collapse_stringlist-tst.scm containing your test suite for my stringlist->string function.
One or more appropriately named files containing your test suites for my formula->full_paren_token_list function
A file named unparse_mp.scm containing your definition of formula->min_paren_token_list and the other definitions that you need to make that one work.
One or more appropriately named files containing your test suites for your minimally parenthesized unparser.
A file named discussion.txt (or appropriate variant if you use a fancier format), containing a very brief discussion of the rationales for your test suites, and the rough number of steps required for unparsing.

Last modified: Wed Oct 8 11:21:08 CDT 2003