CS223 Homework 6 Due Midnight, Thursday, May 31 A programming project: an interactive calculator. Build an interactive calculator that basically does the following (with indicated point values): 1. [25] inputs expressions to be evaluated, written in an algebraic notation (e.g. "3 * (2 - 1)"). The expression notation should support the usual conventions of operator precedence. Input initially will be assumed to be from stdin. 2. [10] evaluate the expressions entered in step 1 3. [5] print the result (be default on stdout) We will assume that we are dealing with integer arithmetic, supporting the following set of operations: +, -, *, div, mod plus a unary negation operator ~ (as in SML). Once the basic functionality is working, with just simple arithmetic expressions, we can start adding features: (a) [15] Simple variable definitions like let x = 3 * 5 and the use of such variables in expressions (e.g. 2 * x -1) (b) [20] simple unary function definitions, e.g. fun f x = x * 3 - 2 along with the use of such defined functions in expressions: f(3) + 2. (c) [15] relational operators like ==, < and conditional expressions of the form "if x < 3 then 2 else g y" (but not while loops, since these would not make sense because there are no side effects in the calculator language). Just == and < will suffice, but you can add others like >, <=, >=. (d) [5] recursive function definitions, like factorial. For up to 20 extra credit points, you can implement a type checker for the calculator language (once you have function and/or boolean values in addition to integer values, so that there are types to discriminate). You may not be able to complete all parts of the project, but do as many as you can. Quality is important. ********************************************* IMPORTANT: Please make your code as neat, well-formated, and readable as you can. Use mneumonic function and variable names. Include comments to explain how the code works and what your assumptions are. Your solution should have a README file explaining the overall structure of your implementation, how to build it, and how to run it. ********************************************* This project can be implemented in either SML or Haskell, or both (for 30% extra credit). It is essential that your code compile correctly -- very little credit will be given for code with syntax or type errors. It is also important that you show that you have done some testing to prove that your implementation works correctly, by providing test cases and testing transcripts. You can mine code from earlier homeworks, the simple language implementation example, and the Hutton calculator.lhs example on the Programs page. ================================================================== Step 1. Input and simple parsing. --------------------------------- An example of a very simple calculator is provided in the file IO-parsing.hs (added to the Programs page) that shows how the input readers can be used to process characters read using the IO monad (by the getChar action in particular). This example uses a variant of the Reader monad given in the file Reader.hs. At the end of Reader.hs, I define a Reader action expr which parses and evaluates extremely simple arithmetic expressions of the for "3", "1+2", "4+7+6", i.e. expressions formed by adding numbers. This expr Reader is used to process the input gathered by the top-level IO action getInput. This illustrates how to divide the task of character input from the task of input processing. The Reader.hs file has been rewritten mainly to define Reader as a Monad instance from the beginning, and it uses the usual "return" and ">>=" names for monad operations. This makes it possible to use the "do" notation to build Readers. For SML we have in stream.sml and inputfn.sml how to connect character input to input processing. Try adding an SML version of the expr reader to inputfn.sml. Files provided: IO-parsing.hs, Reader.hs, stream.sml, and inputfn.sml. Step 2. Lexical analysis (or tokenization) ------------------------------------------- The sample expr Reader at the end of Reader.hs goes directly from an input stream of characters to the integer result of expression evaluation. It analyzes the character stream into meaningful units like the "+" symbol and natural numbers (digit sequences), and then interprets the sequence of these elements as an expression and evaluates it as it parses. Naively, this process looks like (ignoring the complexities of the Reader monad): Instream ---------------------> Int expr To support more complicated terms and definitions, it will be convenient to factor this process into several stages. The parsing of input characters into expressions is typically split into two phases (1) lexical analysis, which splits the character stream into meaningful segments representing operators, numbers, identifiers, keyworks, and punctuation (e.g. parentheses). These units are usually called "tokens". For instance, "2+3" would break down into three tokens: (number 2, operator +, number 3). Tokens may or may not be separated by white space. (2) parsing the resulting sequence of tokens into expression structures, usually represented by values of an expression abstract syntax. For instance, (number 2, operation +, number 3) might be parsed into Plus(Nat 2, Nat 3) where Plus and Nat are data constructors for an expression datatype. We have seen such a datatype before in Problem 6 of Homework 1. Once we have expressions represented by such an abstract syntax, it is fairly straightforward to write an evaluator for them. This evaluator can take the form of a direct interpreter: eval :: Expr -> Int (val eval : expr -> int) or if we want to be fancy we could "compile" the expression into a sequence of abstract machine instructions as we did in Homework 1 with compilation of expressions to the RPN abstract machine. So we want to define a type of tokens and then tokenization will be a transformation of a character stream (char list, respectively [Char] in SML and Haskell) into a token stream. tokenize :: Reader TokenStream (Haskell) val tokenize : tokenstream reader (SML) where type TokenStream = [Token] (Haskell) type tokenstream = token list (SML) This will be based on a token reader: tokenRdr :: Reader Token (Haskell) tokenRdr : token reader (SML) The reader tokenRdr will be defined as a multi-way choice between readers for the various kinds of token, e.g. 324 -- natural number token x23 -- alphanumeric identifier + -- plus operator (and similarly for the other operators) ( -- left parenthesis ) -- right parenthesis These different forms of token can be represented by constructors for a token datatype: data Token = ID String | ... | PLUS | ... | LPAR | RPAR Lets assume the following set of basic arithmetic operations: +, - additive *, /, % multipicative (% = mod) ~ (arithmetic) negation These are all binary except for ~, which is unary. (3) Parsing expressions ----------------------- We define an datatype for the abstract syntax of expressions. In Haskell this would be something like: type Variable = String data Expr = Nat Int | ... | Plus Expr Expr | ... Next we want to take the result of tokenization, a token stream and write parsing actions that consume the token stream and produce expression abstract syntax (values of the Expr type). We can express parsing as yet another monad (e.g. in Haskell): data Parser a = P (TokenStream -> Maybe (a, TokenStream)) instance Monad Parser where return v = P (\ toks -> Just(v, toks)) P p >>= f = P (\ toks -> ...) fail s = by analogy with the Reader monad defined in Reader.hs. The Parser actions will be something like the expr :: Reader Int reader action at the end of Reader.hs, except that instead of computing the numeric result, they will build abstract syntax values representing the expressions that they parse. Here is the Haskell interface: getToken :: Parser Token -- like inputc for Reader parse :: Parser a -> (TokenStream -> Maybe(a,TokenStream)) -- perform a parse, analagous to performRdr isToken :: Token -> Parser () -- recognize a given token -- analagous to the symbol Reader natExpr :: Parser Expr identExpr :: Parser Expr expr :: Parser Expr term :: Parser Expr factor :: Parser Expr Here expr is the top-level parser action, which parses a complete top-level expression. But in order to implement the usual operator precedence rules, we need the auxiliary expression classes given by the following grammar (omiting the unary negation operator (~)): expr ::= term "+" expr expr ::= term "-" expr expr ::= term term ::= factor "*" term term ::= factor "/" term -- div term ::= factor "%" term -- mod term ::= factor factor ::= "(" expr ")" factor ::= "nat" -- nat token of form: Nat n factor ::= "id" -- id token of form: Id s Here expr, term, and factor are known as "nonterminals", while the quoted symbols "+", "-", "nat", "id", etc. represent terminal symbols, which are tokens. This grammar is designed to enforce the usual precedence rules, where the multiplicative operators "*", "/", "%" Using some regular expression notation, this grammar can be written more succinctly as: empty ::= expr ::= term ("+" expr | "-" expr | empty) term ::= factor ("*" term | "/" term | "%" term | empty) factor ::= "(" expr ")" | "nat" | "id" Now each of the nonterminals expr, term, and factor is translated into a Parser action. Here, for example, is how the grammar production for expr translates into a Parser action: expr :: Parser Expr expr = do t <- term do isToken PLUS e <- expr return (Plus t e) +++ do isToken MINUS e <- expr return (Minus t e) +++ return t Observe that the three parser actions are mutually recursive. ------------------- SML Note: In the SML version of the parser, the parser type can be a simple abbreviation for a function type: type 'a parser = tokenStream -> ('a * tokenStream) option Thus the parser actions expr, term, and factor will be defined as functions, and so can be mutually recursive: fun expr toks = chain term (fn t => chain (isToken T.PLUS) (fn _ => chain expr (fn e => return (Plus(t, e)))) +++ chain (isToken T.MINUS) (fn _ => chain expr (fn e => return (Minus(t, e)))) +++ return t) toks and term toks = (...) toks and factor toks = (...) toks Note that we have to "eta-expand" the functions so that they are in the right form for using the recursive "fun" declaration. --------------------- (4) Evaluating Expressions -------------------------- Once we have captured an expression as an abstract syntax value of type Expr, we can evaluate it using a simple expression interpreter: --------- module Eval where import Exprs eval :: Expr -> Int eval (Nat n) = n ... -------- Here we are ignoring identifiers (eval(Id s) can remain undefined initially) -- we'll come back to these later after identifier definitions have been added. (5) Top level calculator command -------------------------------- Now we put all these pieces together in the top-level calculator function, which we will call calc: Haskell version: ------------------------------ file: Calc.hs ------------------------------ -- Calc.hs module Calc where import Reader import Tokens import Exprs import Parser import Eval calc :: Instream -> Maybe Int calc instr = case (performRdr tokenize instr) of Nothing -> Nothing Just (toks,_) -> case (parse expr toks) of Nothing -> Nothing Just(v,_) -> Just (eval v) ------------------------------ SML version: ------------------------------ file: calc.sml ------------------------------ (* calc.sml *) structure Calc = struct structure R = Reader structure T = Tokens structure E = Exprs structure P = Parser structure V = Eval (* calc : R.instream -> int option *) fun calc (instr : R.instream) = case (T.tokenize instr) of NONE => NONE | SOME (toks,_) => (case (P.expr toks) of NONE => NONE | SOME(v,_) => SOME (V.eval v)) end (* structure Calc *) ------------------------------ For SML, we also need a CM description file to build the multifile program: ------------------------------ file: calc.cm ------------------------------ Group is $/basis.cm instream.sig listinstream.sml readerfn.sml reader.sml tokens.sml exprs.sml parser.sml eval.sml calc.sml ------------------------------ This file is located in the same directory as the source files that it lists, and we compile the calculator by running the command: - CM.make "calc.cm"; Here the reader.sml file just contains the following declaration: structure Reader : READER = ReaderFn(ListInstream) and we could change the underlying implementation of instreams by replacing ListInstream with, say, StreamInstream. The calculator process can now be broken into the following chain: instream -----------------------------> tokenStream tokenize tokenStream ---------------------------> expr parse expr ----------------------------------> int eval (6) Adding unary negation ------------------------- We are using "~" as the symbol denoting unary negation (to avoid some complications involved in overloading "-" as both the subtraction and the negation operators). We will assume that "~" should have higher precedence, i.e. should bind tighter, than any of the binary operations. We can add it to the grammar by adding a new nonterminal "signed" signed ::= "~" factor | factor and revising the production for term to be: term ::= signed ("*" term | "/" term | "%" term | empty) After adding the corresponding signed Parser action to parser.sml, we can calculate using negation: - Calc.calc (ListInstream.fromString "2*(3+~1)"); val it = SOME 4 : int option (7) Identifiers and definitions ------------------------------- Let's introduce identifiers as basic elements of expressions. At the token level, we have a token constructor ID data Token = ... | ID String | ... datatype toke = ... | ID of string | ... which takes the name of the identifier as an argument. So the identifier x is represented by the token (ID "x"). We already included these in the expression grammar above (section (3)) as elements of factors. The parser will translate a token like (ID "x" :: Token) into abstract syntax for an elementary expression: (Id "x" :: Expr) (note the use of ID for the token constructor, and Id for a variable expression constructor). So assuming we can parse an expression like "x+3" into abstract syntax like Plus(Id "x", Nat 3), how do we evaluate such an expression? Clearly we need to know what integer value is represented by the variable x. This information is obtained from an "environment", which is a finite map from variables to their values (in this case values will always be integers). We have seen a suitable data structure for representing environments before: association lists (alists) with strings for keys. Given an environment env = [..., ("x", 3), ...] we can evaluate "x+2" by looking up the value of x in env, which gives 3, and then adding 3 and 2 to get 5. So evaluation of an expression will now require an environment as an additional parameter: eval :: Expr -> Env -> Int val eval : (expr * env) -> int The next question is: where do environments come from? We get variable bindings by processing declarations. We can use the following conventional syntax for simple variable declarations: let variable = expr E.g. let x = 3-1 Lexically, we need to add two new token constructors: data Token = ... | LET | EQ where the token reader will translate the string "let" into the token LET, and the string "=" into EQ. The grammar needs to be extended with a new nonterminal, "decl", and a new rule: decl ::= "let" id "=" expr or decl ::= LET (ID s) EQ expr --------------------- *** Keyword tokens *** LET (standing for "let") is an example of a keyword token. The tokenize function must look for such keywords before recognizing general identifiers like "x" or "abc", because otherwise "let" would be translated as (ID "let"), i.e. as a general identifier. This goes for other keywords to be introduced later, like "fun" and "if", "then", "else". --------------------- The parser will translate this into a new abstract syntax type for declarations, say: data Decl = Let Variable Expr datatype decl = Let of variable * expr Now the calculator input can take two forms: (1) an expression to be evaluated, or (2) a declaration to be processed. We can call a calculator input form a "statement", and define the statement concrete and abstract syntax by stmt ::= decl | expr data Stmt = D Decl | E Expr datatype stmt = D of decl | E of expr Finally we can combine a sequence of declarations and expressions into a "program": type Prog = [Stmt] type prog = stmt list To evaluate a statement, we have a function evalStmt :: Stmt -> Env -> (Int, Env) For instance: evalStmt (E(Plus (Id "x") (Nat 2)), env) ==> (5, env) if env maps "x" to 3. For declarations, evalStmt (D(Decl "x" (Nat 3)), env) ==> (3, ("x",3):env) i.e. the value returned for a declaration is the value of the defining expression (the "definiens"), and the environment is the original environment extended with the new binding of "x". We can process programs is a couple of ways. First we can assume that each "line" of input is going to contain one statement, and have the top-level evaluation loop process one line after another, passing the environment as a parameter through the loop. Alternatively, we can add concrete syntax for sequences of statements and modify the parser to process them into statement lists. prog ::= stmt (";" prog | empty) Then a single "line" of input could have multiple statements separated by the semicolon token. (8) Function definitions and applications ----------------------------------------- Lets start with a very simple form of function definition fun f x = x+3 using the reserved token "fun" to introduce the function definition, followed by the function name (a variable), the parameter variable, and then "=" followed by the function body expression. The grammar rule for declarations then becomes decl ::= "let" id "=" expr | "fun" id id = expr or decl ::= LET ID EQ expr | FUN ID ID EQ expr (FUN is now a new basic token, like LET). The grammar for expressions is modified by adding a function application "f(expr)" as a new form of factor: factor ::= "(" expr ")" | id "(" expr ")" | nat | id Evaluating a function definition should add a function name to function value binding to the environment. A function value will be a lambda-abstraction "closure": FunVal(Fn("x", expr), env) where the abstract syntax for a function expression (corresponding to a lambda expression), is defined by datatype funExpr = Fn of variable * body and env is the current environment at the point where the function definition is evaluated -- it should contain bindings for any variables (other than the parameter "x") that appear free in the body expression. Fval is a dataconstructor for a more general "value" datatype datatype value = IntVal of int | BoolVal of bool | FunVal of funExpr * env For instance, after let y = 8 the declaration fun f x = y + x should add the binding ("f", FunVal(Fn("x", Plus(Id "y", Id "x")), ("y", IntVal 8)::env)) to the current environment env. The environment now can contain two kinds of bindings -- function bindings and integer bindings, so we use the value datatype defined above for bound values (the boolean values (BoolVal) won't appear in environments, but we need them as the value variant produced when evaluating relational expressions). To evaluate a factor expression of the form f(argexpr) in and environment env we will follow the following procedure: (i) look up the function variable f in env, obtaining a function value funval of the form funval = FunVal(Fn(var, body), env') (ii) evaluate the argument argexpr in env, obtaining a value argval. (iii) add a binding (var, argval) to env', yielding an environment env'' (this augments the closure environment env' so that it binds _all_ the variable appearing in body, including the function parameter var) (iv) evaluate the body expression in env'' to obtain the value of the application expression. (9) Relational operators and conditional expressions ---------------------------------------------------- We'll assume that two relational operators, "==" and "<" will suffice, since the others can be defined in terms of these, using conditional expressions. The conditional operators will be infix operators of lower precedence than any of the arithmetic operators. This can be expressed in the grammar by adding another nonterminal rexpr with the following rule: rexpr ::= expr "==" expr | expr "<" expr Then conditional expressions, which will become the new top-level form of expression, can be handled by introducing one more nonterminal, cexpr, with grammar rule cexpr ::= "if" rexpr "then" expr "else" expr | expr This grammar limits the role of relational expressions (rexpr) to be used only in the condition subexpression of conditional expressions. Allowing more general use of relational expressions would require a more complicated grammar. We need to introduce three new "keyword" tokens for "if" (IF), "then" (THEN), and "else". The infix operators "==" and "<" can be recognized by the symbol token reader as was done for the arithmetic operators. The relational and conditional expressions also have to be added to the expression abstract syntax type. They can be added as new constructors for the general expr type: datatype expr = Nat of int ... | Equal of expr * expr | Less of expr * expr | If of expr * expr * expr Evaluation of these new expressions is straightforward, but the type for expression values needs to be expanded to cover the boolean values of relational expressions. (10) Recursive functions Recursive functions do not need any additional syntax, either at the grammer level or the abstract syntax level. Recursion can be supported simply by adding the function binding as well as the parameter binding to the closure environment before evaluating the function body expression. In other words, we modify step (iii) in the procedure for function application as follows: (iii) add bindings (var, argval) and (f, funval) to env', yielding an environment env'' (this augments the closure environment env' so that it binds _all_ the variable appearing in body, including the function parameter var and the called function symbol itself, so it can be called recursively in the body). (11) Summary * Grammar Here is the final grammar for the calculator language: -- conditional expressions, the top-level form of expressions cexpr ::= "if" rexpr "then" expr "else" expr | expr -- relational expressions, where the arguments are arithmetic rexpr ::= expr "==" expr | expr "<" expr -- arithmetic expressions empty ::= expr ::= term ("+" expr | "-" expr | empty) term ::= signed ("*" term | "/" term | "%" term | empty) signed ::= "~" factor | factor factor ::= "(" expr ")" | id "(" expr ")" | nat | id nat ::= Nat n -- literal Nat token id ::= Id s -- literal Id token -- declarations, simple and function decl ::= "let" id "=" cexpr | "fun" id id = cexpr -- top-level "statements", which may be either (conditional) -- expressions, or declarations stmt ::= decl | cexpr -- programs, sequences of statments separated by ":" prog ::= stmt (";" prog | empty) The prog phrase is optional. The calculator can be designed to read one statement per line, in which case the line break is the implicit separator. * Top-level REPL loop Here is a sample calculator top-level loop (SML version) ------------------------------------------------------------------------ (* calc.sml *) structure Calc = struct structure S = ListInstream structure R = Reader structure T = Tokens structure E = Exprs structure P = Parser structure V = Eval fun printVal (V.IntVal n) = print (Int.toString n) | printVal (V.BoolVal b) = print (Bool.toString b) | printVal (V.FunVal _) = print ("") fun printStmt (V.ExprVal v) = (print ">> "; printVal v; print "\n") | printStmt (V.DeclVal(var,v)) = (print (">> "^var^" = "); printVal v; print "\n") (* repl : Env.env -> unit * the calculator's read-eval-print loop *) fun repl (env) = (* input one line *) case TextIO.inputLine TextIO.stdIn of NONE => () (* no input, exit *) | SOME line => (* tokenize the line *) (case (T.tokenize(S.fromString line)) of NONE => (print "Error - tokenize\n"; repl env) | SOME (toks,_) => (* parse token stream *) (case (P.stmt toks) of NONE => (print ("Error - parse failure: "^line^"\n"); repl env) | SOME(s,_) => (* parsed one statement, evaluate it *) (let val (res,env') = V.evalStmt env s in printStmt res; repl env' end handle Env.Unbound s => (print ("Error - unbound var: "^s^"\n"); repl env)))) fun calc () = repl (Env.empty) end (* structure Calc *) ------------------------------------------------------------------------