<?php require_once "fs_inner_context.inc"; ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
     "http://www.w3.org/TR/html4/loose.dtd">
<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">

<title><?php echo $HTML_TITLE; ?></title>

</head>

<?php echo html_bgcolor_body($BG_COLOR); ?>

<!-- Standard hierarchical header structure -->

<?php echo $CS_HEADER_HTML; ?>

<?php echo html_linked_text("<em>Courses</em>", $CS_COURSES_ROOT_URL); ?>

<?php echo $HTML_HEADER; ?>

<hr><hr>

<p>Assignment #2 is due at 11:59 PM on Monday, 13 October 2003.</p>

<p>This assignment exercises recursion on nested lists, and explores
the connection between syntactic structure and string notation.</p>

<strong>This is a draft, subject to further revision. I will finalize
it on Wednesday.</strong>

<?php echo html_header($HTML_HEADER_LEVEL + 1,
                       "0. Notation and Structure"); ?>

<?php echo html_header($HTML_HEADER_LEVEL + 2,
                       "0.1. Tangled Terminology"); ?>

<p>All programming language implementations that I have ever seen have
a significant component in charge of co-ordinating the internal
structure in which a program is manipulated with the external form in
which it is typed and displayed. In the olden days, before
mathematical logic and way before computer science, linguists used the
word "syntax" to refer to the structure underlying a spoken or written
text (usually a single sentence).</p>

<p>Mathematical logicians corrupted the word "syntax" to mean the
typed out form in which a text is presented, and computer scientists
followed their lead. Later, computer scientists needed a phrase to
describe syntax (in the old sense of the word). So, they started
calling the syntax of a text the "abstract syntax." To distinguish the
typed out form of presentation, they often call it the "concrete
syntax," although traditionally it isn't a syntax at all.</p>

<p>This foolishness hit its limit (I hope) when an otherwise wonderful
colleague of mine proposed a "syntax-free" programming language with a
graphical presentation, instead of a typed out presentation. Of course
his language has syntax, or it would have been a total flop. Oh
well. From now on, we'll use the corrupt modern terms "concrete
syntax" and "abstract syntax" for <em>notation</em> (which isn't
syntax at all) and structure (which used to be called "syntax").</p>

<?php echo html_header($HTML_HEADER_LEVEL + 2,
                       "0.2. Formal Grammars"); ?>

<p>Noam Chomsky suggested that the connection of texts to syntax in
natural languages could be described by a sort of formal grammars,
which are now called "context-free grammars" (CFGs). The very notion
of such formal descriptions of natural language has sparked bitter
arguments and worked its way in and out of favor in
linguistics. Meanwhile, CFGs have become essentially the only way to
describe the connection between notation and syntax---I mean concrete
syntax and abstract syntax---in programming languages. They have
proved extremely useful for that purpose.</p>

<p>Don't even think hard about the phrase "context-free." Context
<em>does</em> matter in a "context-free" grammar. It just doesn't
matter in one particular way that distinguishes CFGs from
"context-sensitive grammars." The terminology is totally bogus, so
just think of CFG as a proper noun, and not a description.</p>

<p>CFGs describe "syntax" (yes, the CS community refers to the whole
can of worms as "syntax") by describing the relationships between
certain symbols, called <em>terminals</em> or <em>tokens</em>, that
appear in concrete syntax, and other symbols, variously called
<em>variables</em> or <em>nonterminals</em> (in CS, we usually say
"nonterminals"), that stand for syntactically significant classes of
subtexts. In a grammar for the English language, there are
<em>nonterminals</em> standing for the grammatical categories
<em>sentence</em>, <em>subject</em>, <em>predicate</em>, <em>direct
object</em>, <em>prepositional phrase</em>, etc.</p>

<p>A CFG consists of a bunch of rules, called <em>productions</em>,
each of which describes one way in which a particular
<em>nonterminal</em> symbol can be refined into a sequence of
<em>tokens</em> and other <em>nonterminals</em>. In a grammar for
English, there is probably a production allowing one to refine the
nonterminal for <em>sentence</em> into a sequence of <em>subject</em>
<em>predicate</em> <em>direct object</em>, and another one allowing
one to refine <em>noun</em> into the token "dog".</p>

<p>Here's what a typical CFG looks like in the CS programming language
business:</p>

<blockquote><pre>
&lt;formula&gt; ::= &lt;formula&gt; + &lt;formula&gt;
           |  &lt;formula&gt; - &lt;formula&gt;
           |  &lt;formula&gt; * &lt;formula&gt;
           |  &lt;formula&gt; / &lt;formula&gt;
           |  &lt;formula&gt; ^ &lt;formula&gt;
           |  ( &lt;formula&gt; )
           |  fib ( &lt;formula&gt; , &lt;formula&gt; , &lt;formula&gt; )
           |  &lt;number&gt;
</pre></blockquote>

<p>Symbols surrounded by pointy brackets (&lt;...&gt;) are
<em>nonterminals</em>. The wacky symbols ::= and | join together the
parts of productions. Other symbols (+, -, *, /, ^, fib, comma, and
parentheses in this case) are <em>terminals</em>. Each line is a
<em>production</em>. The ::= symbol indicates that the
<em>nonterminal</em> on its left may be replaced by the stuff on its
right. The | symbol inherits the <em>nonterminal</em> from the
previous production, because we get tired of typing it and looking at
it over and over. This particular notation for CFGs was proposed by
Backus and Naur because it could be typed onto IBM punch cards. So
people in "practical" CS call it a Backus-Naur form, or
BNF. Linguistics and theoretical CS write grammars differently, but
they mean the same thing.</p>

<p>The CFG/BNF above is incomplete, because it doesn't contain the
productions for &lt;number&gt;. There are too many of them---one for
each number that you're allowed to write. OK, here's one of them:</p>

<blockquote><pre>
&lt;number&gt; ::= 473263
</pre></blockquote>

<p>Token classes, such as <em>number</em>, are sometimes defined in a
restricted form of BNF notation, and sometimes in a different notation
based on <em>regular expressions</em> (a sort of textual patterns). We
can ignore that level of detail for now.</p>

<p>Given such a BNF (assuming that you know exactly which tokens are
numbers), you can derive every possible "formula" by a process, called
a <em>derivation</em>: start with the <em>nonterminal</em> symbol
&lt;formula&gt; all by itself. In each step, choose one
<em>nonterminal</em> to replace, choose a <em>production</em> with
that <em>nonterminal</em> on the left side, and replace it by the
stuff on the right side. Stop when you have replaced all
<em>nonterminals</em>, and are staring at a sequence of
<em>terminals</em>. Example derivation, with the nonterminal to be
replaced emboldened in each step:</p>

<blockquote><pre>
<strong>&lt;formula&gt;</strong>
<strong>&lt;formula&gt;</strong> * &lt;formula&gt;
&lt;number&gt; * <strong>&lt;formula&gt;</strong>
&lt;number&gt; * <strong>&lt;formula&gt;</strong> + &lt;formula&gt;
&lt;number&gt; * <strong>&lt;number&gt;</strong> + &lt;formula&gt;
&lt;number&gt; * 42 + <strong>&lt;formula&gt;</strong>
&lt;number&gt; * 42 + ( <strong>&lt;formula&gt;</strong> )
&lt;number&gt; * 42 + ( &lt;formula&gt; / <strong>&lt;formula&gt;</strong> )
&lt;number&gt; * 42 + ( &lt;formula&gt; / <strong>&lt;number&gt;</strong> )
&lt;number&gt; * 42 + ( <strong>&lt;formula&gt</strong>; / 2 )
<strong>&lt;number&gt;</strong> * 42 + ( &lt;number&gt; / 2 )
8 * 42 + ( <strong>&lt;number&gt;</strong> / 2 )
8 * 42 + ( 6 / 2 )
</pre></blockquote>

<p>I did the steps in a stupid, whimsical order just to make the point
that it doesn't matter. If you're going to write out derivations
(which you usually shouldn't), you should usually do it
systematically, always replacing the leftmost <em>nonterminal</em>, or
always the rightmost.</p>

<p>If you've been <em>really</em> alert, you'll be wondering, "where's
the (abstract) syntax?" A CFG determines very precisely which
sequences of tokens you are and are not allowed to write down in a
program, or a portion of a program. The abstract syntax is implicit,
and not completely there. For the most part, linear sequential
derivations (as in the example above) are bogus. The right way to
think of a derivation is to conceive it as a tree:</p>

<blockquote><pre>
      &lt;formula&gt;
      /   |   \
     /    |    ------
    |     |          \       
    |     |       &lt;formula&gt;
    |     |       /   |   \
    |     |      /    |    --------
    |     |     /     |            \
    |     |    |      |         &lt;formula&gt;
    |     |    |      |         /    |  \
    |     |    |      |    -----     |   -------
    |     |    |      |   /      &lt;formula&gt;      \
    |     |    |      |  /       /   |   \       \
&lt;formula&gt | &lt;formula&gt; | |  &lt;formula&gt; |  &lt;formula&gt; |
    |     |    |      | |     |      |     |      |
&lt;number&gt;  | &lt;number&gt;  | |  &lt;number&gt;  |  &lt;number&gt;  |
    |     |    |      | |     |      |     |      |
    8     *    42     + (     6      /     2      )
</pre></blockquote>

<p>The tree shows important information about the derivation, while
leaving out irrelevant choices about the order of certain steps. It
also <em>suggests</em> an abstract syntactic structure. It doesn't
really tell us the structure completely, because we need to know that
the 8, 42, 6, and 2 represent operands, the *, +, and / indicate which
operations are performed on the operands, and the parentheses plus the
derivations that stick them determine which structure we get, but are
not actually part of that structure. Using all of that information, we
can infer an abstract syntactic structure that looks like:</p>

<blockquote><pre>
multiply
  /   \
 /    add
|    /   \
|   |     divide
|   |      / \
8   42    6   2
</pre></blockquote>

<p>Work out for yourself how the abstract syntax tree is a sort of
collapsed version of the derivation tree.</p>

<p>You hardly ever see an actual abstract syntax tree in a programming
language manual. You see a lot of BNFs, which imply derivation trees,
which suggest abstract syntax trees. As the concrete syntax gets more
complicated, the correspondence becomes trickier.</p>

<p><em>LISP</em>/<em>Scheme</em> have maintained such a simple
connection between concrete and abstract syntax that you hardly need a
BNF. Essentially, the parenthesis grouping determines the abstract
syntax directly.</p>

<?php echo html_header($HTML_HEADER_LEVEL + 2,
                       "0.3 Tokens and Lexical Analysis"); ?>

<p>So far, we have treated each substantially meaningful symbol,
including single character symbols such as +, -, *, /, ^, and also
multicharacter symbols for numbers, such as 43876. Just as linguistics
treats spelling of words in a different level of description from the
syntax of sentences made up of words, programming language
descriptions usually give syntax in terms of lists substantial
symbols, called <em>tokens</em>, which might be longer than a single
character. A separate level of derivation, called <em>lexical
analysis</em>, associates sequences of tokens with character strings
containing the spelled out tokens and some extra separators, such as
blank spaces.</p>

<?php echo html_header($HTML_HEADER_LEVEL + 1,
                       "1. Abstract Syntax for the PAC"); ?>

<p>For now, our PAC works with numerical formulae built up from
numbers, and the binary operators +, -, *, /, ^. Abstract syntax is
really a <em>tree</em> structure, not a typed out structure. But
everything we look at ends up being typed out in some form. To
understand the point of this exercise, you need to think of the
abstract syntax as a hidden internal form in our calculations, and the
concrete syntax as a tool for presenting it to users. You can't trust
the superficial look of things as your programming, since the token
lists will actually look rather ugly, while <em>Scheme</em>'s
presentation of its internal forms is relatively nice. But when the
<code>formula->...</code> functions are deployed in a complete
application, they will lead to nice readable concrete forms for the
user, and nice manipulable abstract internal forms for the
programs.</p>

<p><em>Scheme</em> programmers have a pretty standard form for
representing trees as nested lists. If there is only one element in a
tree (e.g., the number 3), then that element stands for the small
tree. If a tree has a particular
<code><var>&lt;operator&gt;</var></code> at the root, and two
<code><var>&lt;argument&gt;</var></code>s hanging below it, then the
tree is represented by:</p>

<blockquote>
<code>(list <var>&lt;operator&gt;</var> <var>&lt;argument&gt;</var> <var>&lt;argument&gt;</var>)</code>
</blockquote>

<p>The arguments themselves may be single elements, or lists
representing further tree structure.</p>

<p>Numbers are represented directly by <em>Scheme</em> numbers. Our
numerical operators are represented by the <em>Scheme</em> symbols
<code>'+</code>, <code>'-</code>, <code>'*</code>, <code>'/</code>,
<code>'expt</code>, <code>'fib</code>. Here are some examples of
numerical formulae in the abstract syntax form:</p>

<blockquote><pre><code>
347
(list '+ 2 3)
(list '+ (list '* 4 5) (list 'expt 2 3))
(list '+ 1 (list '+ (list '+ 2 3) 4))
(list '+ (list 'fib 0 1 6) (list 'fib 0 1 5))
</code></pre></blockquote>

<p>This form looks a lot like the form in which you type a
<em>Scheme</em> expression in order to compute its value:</p>

<blockquote><pre><code>
347
(+ 2 3)
(+ (* 4 5) (expt 2 3))
(+ 1 (+ (+ 2 3) 4))
(+ (fib 0 1 6) (fib 0 1 5))
</code></pre></blockquote>

<p>We will explore the difference when we dig a bit deeper into the
design of <em>Scheme</em>, and find that it's even less than it
appears to be here.</p>

<p>I have not given a precise definition of the representation of
numerical formulae in the PAC, but there is enough information here to
figure it out. If you are puzzled by any questions regarding what is
and is not a legitimate representation of a formula, please ask in
class and post to the online discussion.</p>

<?php echo html_header($HTML_HEADER_LEVEL + 1,
                       "2. Fully Parenthesized Concrete Syntax for the PAC"); ?>

<p>Notice how a grammar looks sort of like a recursive function
definition. That similarity cuts fairly deep, and we exploit it in
this assignment and the next to <em>unparse</em> (translate from
abstract to concrete syntax) and to <em>parse</em> (translate from
concrete to abstract syntax) numerical formulae. Unparsing is much
easier than parsing, which is why we do it first.</p>

<p>Our Perfectly Accurate Calculator will use the usual familiar
concrete syntax for formulae typed in by a user. The example grammar
above describes numerical formulae made up of numbers combined with
the binary operations +, -, *, /, and ^ (for exponentiation), with
optional parentheses. Unfortunately, it is highly ambiguous. There may
be several different derivation trees, and several different abstract
syntaxes, for the same concrete syntax (e.g., 3 + 2 * 4). To get off
the ground very easily, we'll start with fully parenthesized formulae,
described by the grammar:</p>

<blockquote><pre>
&lt;fmla&gt; ::= ( &lt;fmla&gt; + &lt;fmla&gt; )
        |  ( &lt;fmla&gt; - &lt;fmla&gt; )
        |  ( &lt;fmla&gt; * &lt;fmla&gt; )
        |  ( &lt;fmla&gt; / &lt;fmla&gt; )
        |  ( &lt;fmla&gt; ^ &lt;fmla&gt; )
        |  fib ( &lt;fmla&gt; , &lt;fmla&gt; , &lt;fmla&gt; )
        |  &lt;number&gt;
</pre></blockquote>

<p>I

<?php echo html_linked_text("defined <code>formula->full_paren_token_list</code> ", "unparse_fp.scm") . " "; ?>

to unparse according to this grammar. I also defined
<code>stringlist->string</code> because the <em>Scheme</em>
presentation of lists of strings is a bit hard on the eyes.</p>

<p>Demonstrate my function <code>stringlist->string</code> with a small
test suite. 3 test cases are probably enough.</p>

<p>When you are confident about <code>stringlist->string</code>, use
it to test my function
<code>formula->full_paren_token_list</code>. This one needs a larger
number of test cases to exercise all of the productions in the
grammar.</p>

<p>Look carefully at the way in which clauses in the function
definition correspond very precisely to productions in the
grammar. Because our current grammar offers no choices in either
parsing or unparsing (i.e., it gives a 1-1 correspondence between
concrete and abstract syntax), the order in which I treated the
productions was not very important. But I used the right sort of order
anyway, to prepare for the next step.</p>

<p>Also notice my conditions, e.g.</p>

<blockquote><code>
(and (list? fmla) (symbol=? (first fmla) '-))
</code></blockquote>

<p>The <code>(list? fmla)</code> clause is not terribly important at
this step, since every formula that is not just a number is
represented by a list. But it becomes important in the next step,
because the possibility of a number doesn't come up in every one of
the functions. The order of the two <code>(and ...)</code> clauses is
important. When the first clause of an <code>and</code> has the value
<code>false</code>, <em>Scheme</em> doesn't evaluate the second
clause, which in this case would fail. Experiment on your own with
<code>and</code> to see how this works.</p>

<?php echo html_header($HTML_HEADER_LEVEL + 1,
                       "3. Minimally Parenthesized Concrete Syntax for the PAC"); ?>

<p>Full parenthesization makes (un)parsing rather easy, but it's a
pain for the user to type in. We reduce parentheses by establishing
certain <em>precedence</em> rules:</p>

<ul>

<li><code>^</code> has highest precedence,<br><br></li>

<li>a <code>^</code> on the right takes precedence over a
<code>^</code> on the left,<br><br></li>

<li><code>*</code> and <code>/</code> have second precedence,<br><br></li>

<li>a <code>*</code> or <code>/</code> on the left takes precedence
over a <code>*</code> or <code>/</code> on the right,<br><br></li>

<li><code>+</code> and <code>-</code> have lowest precedence,<br><br></li>

<li>a <code>+</code> or <code>-</code> on the left takes precedence
over a <code>+</code> or <code>-</code> on the right.</li>

</ul>

<p>These rules are rather typical for programming languages, except
that left precedence is common, even for <code>^</code>. There is a
good mathematical reason why I gave <code>^</code> right
precedence---maybe you can figure it out. I don't have to mention any
precedence for <code>fib</code>, because it always comes with
parentheses anyway.</p>

<p>Most programming language manuals say that <code>+</code>,
<code>-</code>, <code>*</code> and <code>/</code> are
<em>left-associative</em> binary operators, while (according to my
requirement) <code>^</code> is a <em>right-associative</em> binary
operator. I dislike these phrases a lot, but we're stuck with
them. According to algebra, <code>+</code> and <code>*</code> are
<em>associative</em> binary operators, which means that the value of
an expression is not affected by their association to the left
vs. right (although important <em>computational</em> qualities, such
as overflow of intermediate values, may be
affected). <em>Associativity</em> is an algebraic quality of the
mathematical operations. So-called <em>left-</em> and
<em>right-associativity</em> are <em>not</em> qualities of the
operations, but only of the notational rules under which we write
them.</p>

<p>Here is a grammar that enforces the precedence rules described
above:</p>

<blockquote><pre>
&lt;fmla&gt; ::= &lt;fmla&gt; + &lt;term&gt;
        |  &lt;fmla&gt; - &lt;term&gt;
        |  &lt;term&gt;

&lt;term&gt; ::= &lt;term&gt; * &lt;factor&gt;
        |  &lt;term&gt; / &lt;factor&gt;
        |  &lt;factor&gt;

&lt;factor&gt; ::= &lt;base&gt; ^ &lt;factor&gt;
          |  &lt;base&gt;

&lt;base&gt;  ::= &lt;number&gt;
         |  fib ( &lt;fmla&gt; , &lt;fmla&gt; , &lt;fmla&gt; )
         |  ( &lt;fmla&gt; )
</pre></blockquote>

<p>The names <em>term</em> and <em>factor</em> are fairly standard in
programming language manuals, but my <em>fmla</em> is usually called
<em>expression</em>. This grammar is unambiguous (each concrete syntax
corresponds to a unique abstract syntax), but it allows more than one
concrete syntax for a given abstract syntax. In effect, it
<em>requires</em> enough parentheses to overcome the effect of the
precedence rules when necessary, but it <em>allows</em> additional
parentheses at the user's discretion.</p>

<p>Do a few derivation trees with this grammar on your own, to see how
it enforces the precedence rules. Notice that <em>higher</em>
precedence corresponds to coming <em>lower</em> in the derivation
tree. Also notice that the derivation trees are more complicated than
those for the previous two grammars. The abstract syntax drops a lot
of the detail from the derivation tree.</p>

<p>Now, define your own function
<code>formula->min_paren_token_list</code>, to unparse a formula into
a list of tokens, using parentheses only when they are required to
overcome the precedence rules. Derive your definition from the grammar
above:</p>

<ul>

<li>define a separate function corresponding to each
<em>nonterminal</em> (e.g.,
<code>term->min_paren_token_list</code>),<br><br></li>

<li>in each function, provide a conditional clause for each
<em>production</em> with the corresponding <em>nonterminal</em> on the
left side.</li>

</ul>

<p>Keep the correspondence between productions and conditional clauses
just as precise as it was in
<code>formula->full_paren_token_list</code>. There are some
rearrangements that work, and shorten the definition a bit, but they
obscure the meaning.</p>

<p>The particular form of the conditions matters. The grammar has an
      infinite loop:</p>

<blockquote><pre>
&lt;fmla&gt;
&lt;term&gt;
&lt;factor&gt;
&lt;base&gt;
( &lt;fmla&gt; )
( &lt;term&gt; )
...
</pre></blockquote>

<p>The right function definition avoids this loop, and follows the
shortest derivation (with the least parentheses) for the given
formula. The right definition is much more natural to write than the
wrong one, but see if you can figure out how to get it wrong.</p>

<p>Demonstrate your definitions with an appropriate test suite. Use
all of the functions in the test suite, not just
<code>formula->min_paren_token_list</code>.</p>


<?php echo html_header($HTML_HEADER_LEVEL + 1,
                       "4. Time Efficiency"); ?>

<p>Suppose that the number of binary operators in a formula is
<em>n</em>. Roughly how many steps does it take to unparse the
formula?</p>


<?php echo html_header($HTML_HEADER_LEVEL + 1,
                       "5. What to Hand In"); ?>

<ol>

<li>A file named <code>collapse_stringlist-tst.scm</code> containing
your test suite for my <code>stringlist->string</code>
function.<br><br></li>

<li>One or more appropriately named files containing your test suites
for my <code>formula->full_paren_token_list</code>
function<br><br></li>

<li>A file named <code>unparse_mp.scm</code> containing your
definition of <code>formula->min_paren_token_list</code> and the other
definitions that you need to make that one work.<br><br></li>

<li>One or more appropriately named files containing your test suites
for your minimally parenthesized unparser.<br><br></li>

<li>A file named <code>discussion.txt</code> (or appropriate variant
if you use a fancier format), containing a very brief discussion of
the rationales for your test suites, and the rough number of steps
required for unparsing.</li>

</ol>

<?php echo html_footer("index", 0, 1); ?>
<!-- hhmts start -->
Last modified: Wed Oct  8 11:21:08 CDT 2003
<!-- hhmts end -->

</body>

</html>
