CS326 - Project 3 : Symbol tables and type-checking

Due Date: Friday, May 4, 11:59 p.m.

In this part of the project, you are going to extend your parser with a symbol table and add some semantic checks.

Introduction

In FLAME programs, identifiers are used to refer to function names and variables. For example, consider the following program:

fun gcd(x:int, y:int)
    g: int;
    begin
       g := y;
       while x > 0 do
           begin
              g := x;
              x := y - (y/x)*x;
              y := g
           end;
       return g
    end

fun main()
   x : int;
   y : int;
   r : float;
   begin
      print("Enter two numbers\n");
      read(x);
      read(y);
      r := gcd(x,y);
      write(v)
   end

In the the gcd() function, the identifiers x, y, and g are used. In main(), the identifiers x,y,r,v, and gcd are used.

Identifiers must always be declared before they are used. In this program, identifiers are either declared by a function definition, function parameters, or by a declaration statement such as "g:int;". If an undeclared identifier appears, it is a programming error. For example, the "write(v)" statement in main() is an error because "v" is never declared.

Identifiers also have scope. For example, the identifier "g" declared in gcd() is only visible inside the gcd() function. It can't be accessed in main() or any other functions. Similarly, the identifiers "x" and "y" which are used in both gcd() and main() are different. That is, the "x" declared in main() is a complelely different variable than the "x" declared as a parameter of gcd().

To keep track of identifiers, your compiler needs to manage a symbol table. The symbol table simply keeps track of everything that is known about each identifier that appears in a program. The interface to the symbol table primarily occurs in two places:

The lexer. Whenever an identifier is found on the input, a symbol table lookup is performed. If a symbol table entry is found, it is attached to the token as a special attribute. If no symbol table entry is found, a new symbol table entry is created and attched to the token. To do this in your flamelex.py file, you might modify your ID rule to consult the symbol table and attach the corresponding entry as the ".symtab" attribute of any identifier token.
The parser. As the parser runs, it fills in additional information about each symbol table entry. For example, if the following declaration appeared:
```
   x : int;
```
the lexer would first attach the symbol table entry for "x" to the corresponding ID token. The parser would then take this entry and add information about the type "int" to the symbol to indicate that "x" is an integer. It is important to note that the parser also uses the symbol table entry to perform error checking. For instance, if "x" already had a datatype assigned, it must mean that "x" was already declared. In this case, the parser would generate an error messages such as "x previously declared."
The parser also needs to take steps to manage scoping. For example, whenever a new function is defined, the parser should create a new symbol table to contain information about the local variables and parameters that are declared in that function. This is how the parser would keep track of the different variables defined in the gcd() and main() functions from the earlier example. Similarly, whenever the end of a function is reached, the parser should restore the previous scope.

By adding the symbol table to your parser, you will be able to perform a variety of semantic checks to your parser. Specifically, you will be able to identify the following errors such as the following:

Undeclared identifiers.
Redeclared identifiers.
Type errors in expressions.
Type errors in array indices.
Type errors in relational operations.

Part 1 - Fix Project 2

Modify your parser to fix all of the tests you failed in Project 2. Specific attention should be given to tests in which valid FLAME programs fail to parse correctly! You will recover half of the points deducted in Project 2 by passing the tests when Project 3 is handed in. Please note: since Project 3 adds additional error messages to the compiler, the project 2 tests will be modified to expect these error messages as normal compiler output.

Part 2 - Symbol Table

Create a file "symtab.py" that implements a simple symbol table for FLAME. Minimally, your implementation should probably include the following features:

A class or some kind of object for storing information about each symbol. Keep in mind that the parser is going to attach attributes to each symbol as it runs.
An object that maps identifier names to symbol table entries. Use a Python dictionary.
A global symbol table for containing the symbols of functions defined at program scope ("gcd" and "main" in the earlier example).
A function that creates a new symbol table. This function will be called whenever the parser enters a new FLAME function.
A function that removes the current symbol table and restores the previous symbol table. This function will be called whenever the parser leaves a FLAME function. Your function should probably return the symbol table being removed so that the parser can save it for later use (as an attribute of the function perhaps).
A function that creates a new symbol in the current symbol table. Used by the lexer when attaching symbol table entries to identifiers.
A function that looks up a symbol in the current symbol table. Used by the lexer when attaching symbol table entries to identifiers.
A function that looks up a symbol in all enclosing symbol tables. This will be used by the parser to retrieve symbols not defined locally. For example, in the earlier example, the main() function calls gcd(), but "gcd" is not a locally defined symbol. Instead, it's defined in the global symbol table. You will also need to this to manage nested functions correctly (nested functions can access variables declared in outer functions).

Part 3 - Modify your lexer

Modify your lexer to attach symbol table entries to each identifier token. If no symbol table entry exists, you should create a new entry and attach it to the token.

Keep in mind that the lexer interface is fairly dumb. All it does is find an existing symbol table entry (if any) and return it with the token. If no such entry exists, it creates an empty entry and returns it. The primary reason for doing this in the lexer is that the parser is always going to examine the symbol table for every identifier that it encounters. Rather than trying to modify the parser to do the lookup, it is usually easier and more efficient to do it in the lexer and pass the symbol table entry along with the token.

Part 4 - Modify your parser

Modify your parser to manage the symbol table as follows:

Whenever a new function is defined, a new symbol table should be created.
At the end of each function, the previous symbol table should be restored.

To do this, you will need to be a little sneaky with your grammar. Here is one way to do it:

def p_function(t):
    'function : fundecl ID LPAREN parmlist RPAREN .... statements '
    ... do whatever ...

    # Restore old symbol table
    symtab.pop_symtab()

def p_fundecl(t):
    'fundecl : FUN'
    # Create a new symbol table
    symtab.new_symtab()

In this case, a new symbol table will be created as soon as the "FUN" token is seen (and reduced) on the input. Afterwards, all identifiers inside the function will be placed into the new symbol table. The only exception is that the "ID" right after "funcdecl" is still placed in the old symbol table. However, not to worry---this is exactly the behavior you want. Finally, when the entire function declaration is reduced, the symbol table will be restored.

Now, modify your parser as follows:

Parameter and variable declarations such record type information for the corresponding identifier. Remember that a type may be "int", "float", or an array of some size. The size of an array is part of the type so int[4] is a different type than int[10].
If a parameter or declaration is redeclared, your parser should issue an error.
If an undeclared identifier is encountered anywhere in the body of a function, an "undeclared identifier" error should be issued. You will need to check in all of the places where an identifier can appear (expressions, locations, etc.).

Note: there are some subtle issues of scope. For instance, if a symbol is not declared locally, but is declared globally, your parser will have to check for this in selected locations.

Part 5 - Type Checking

Now that you've got basic identifier and type handling working, it is time to add some type checking to the compiler. Most of this work involves modifications to your arithmetic expression handling.

Make the type of an integer literal an "int".
Make the type of a floating point literal a "float".
Make the type of an identifier the same as its declaration.
The type of an array location is the same as the base type. For example, if a is declared as a : int[10], the type of "a" is "int[10]", and the type of "a[3]" is "int".
Modify the expression rules so that both sides of an operation must be the same type. Otherwise, a type error occurs. The type of an operation is the same type as its operands. For example, in "a + 3", "a" must be an integer and the result will be an integer.
Add rules that require array indices to be integers. If not, generate a type error.
Modify your assignment rule to check for types. The type of the left hand side, must match that of the expression on the right.
Relational operators follow the same rules as arithmetic expressions. For example, a < b is only valid if a and b are the same type. The result of a comparison is a boolean (or int). Therefore, a < b and c > d is valid even if the types in each comparison are different (e.g., if a,b are integers and c,d are floats).
For arithmetic expressions involving only numeric literals, go ahead and evaluate the literal in the compiler. For example, if someone writes a : int[40*50], go ahead and evaluate 40*50 and store 2000 as the expression value.

Implementation Hints

An easy way to represent types is as a tuple (type,dimension) where type is a string like "int" or "float" and dimension is an array dimension or None if the type is not an array.
Type checking code for arithmetic expressions and relational operators is essentially the same for each operator. Write some general purpose functions for handling the type checking.

Handin procedures

Your parser must be contained in a file 'flameparse.py'. We must be able to run this file as follows to run a test:
```
% python flameparse.py testname.flm
```
Make sure you have a README file that includes your name and anything notable about your implementation.
Create a Unix tar file for your project. This tar file should contain the 'flame' directory and be based on your login name. For example:
```
% tar -cf yourlogin.tar flame
```
On classes.cs.uchicago.edu, copy your tar file to the directory /stage/classes/current/CS326/handin/project3 For example:
```
% cp yourlogin.tar /stage/classes/current/CS326/handin/project3
```
Late assignments are not accepted. The handin directory will be closed after the due date.

IF WE CAN NOT UNPACK OR RUN YOUR HANDIN USING THE SUPPLIED TEST SCRIPTS, YOU WILL RECEIVE NO CREDIT!