CS326 - Project 3 : Symbol tables and type-checking
Due Date: Friday, May 4, 11:59 p.m.
In this part of the project, you are going to extend your parser with a
symbol table and add some semantic checks.
Introduction
In FLAME programs, identifiers are used to refer to function names and variables. For example, consider
the following program:
fun gcd(x:int, y:int)
g: int;
begin
g := y;
while x > 0 do
begin
g := x;
x := y - (y/x)*x;
y := g
end;
return g
end
fun main()
x : int;
y : int;
r : float;
begin
print("Enter two numbers\n");
read(x);
read(y);
r := gcd(x,y);
write(v)
end
In the the gcd() function, the identifiers x, y, and g are used. In main(), the identifiers
x,y,r,v, and gcd are used.
Identifiers must always be declared before they are used. In this program, identifiers are either declared by a function
definition, function parameters, or by a declaration statement such as "g:int;". If an undeclared identifier appears,
it is a programming error. For example, the "write(v)" statement in main() is an error because "v" is never declared.
Identifiers also have scope. For example, the identifier "g" declared in gcd() is only visible inside the gcd() function.
It can't be accessed in main() or any other functions. Similarly, the identifiers "x" and "y" which are used in both
gcd() and main() are different. That is, the "x" declared in main() is a complelely different variable than the
"x" declared as a parameter of gcd().
To keep track of identifiers, your compiler needs to manage a symbol table. The symbol table simply keeps track of
everything that is known about each identifier that appears in a program. The interface to the symbol table primarily occurs
in two places:
- The lexer. Whenever an identifier is found on the input, a symbol table lookup is performed. If a symbol
table entry is found, it is attached to the token as a special attribute. If no symbol table entry is found, a new
symbol table entry is created and attched to the token. To do this in your flamelex.py file, you
might modify your ID rule to consult the symbol table and attach the corresponding entry as the ".symtab" attribute
of any identifier token.
- The parser. As the parser runs, it fills in additional information about each symbol table entry. For example,
if the following declaration appeared:
x : int;
the lexer would first attach the symbol table entry for "x" to the corresponding ID token. The parser would then
take this entry and add information about the type "int" to the symbol to indicate that "x" is an integer.
It is important to note that the parser also uses the symbol table entry to perform error checking. For instance,
if "x" already had a datatype assigned, it must mean that "x" was already declared.
In this case, the parser would generate an error messages such as "x previously declared."
The parser also needs to take steps to manage scoping. For
example, whenever a new function is defined, the parser should create a new
symbol table to contain information about the local variables and
parameters that are declared in that function. This is how the parser
would keep track of the different variables defined in the gcd() and
main() functions from the earlier example. Similarly, whenever the end of a
function is reached, the parser should restore the previous scope.
By adding the symbol table to your parser, you will be able to perform a variety of semantic
checks to your parser. Specifically, you will be able to identify the following errors
such as the following:
- Undeclared identifiers.
- Redeclared identifiers.
- Type errors in expressions.
- Type errors in array indices.
- Type errors in relational operations.
Part 1 - Fix Project 2
Modify your parser to fix all of the tests you failed in Project 2. Specific attention should be given to
tests in which valid FLAME programs fail to parse correctly! You will recover half of the points
deducted in Project 2 by passing the tests when Project 3 is handed in. Please note: since Project 3
adds additional error messages to the compiler, the project 2 tests will be modified to expect these
error messages as normal compiler output.
Part 2 - Symbol Table
Create a file "symtab.py" that implements a simple symbol table for FLAME. Minimally, your implementation
should probably include the following features:
- A class or some kind of object for storing information about each symbol. Keep in mind that the parser
is going to attach attributes to each symbol as it runs.
- An object that maps identifier names to symbol table entries. Use a Python dictionary.
- A global symbol table for containing the symbols of functions defined at program scope
("gcd" and "main" in the earlier example).
- A function that creates a new symbol table. This function will be called whenever
the parser enters a new FLAME function.
- A function that removes the current symbol table and restores the previous
symbol table. This function will be called whenever the parser leaves a FLAME function.
Your function should probably return the symbol table being removed so that the parser
can save it for later use (as an attribute of the function perhaps).
- A function that creates a new symbol in the current symbol table. Used by
the lexer when attaching symbol table entries to identifiers.
- A function that looks up a symbol in the current symbol table. Used by
the lexer when attaching symbol table entries to identifiers.
- A function that looks up a symbol in all enclosing symbol tables. This
will be used by the parser to retrieve symbols not defined locally. For example, in
the earlier example, the main() function calls gcd(), but "gcd" is not a locally
defined symbol. Instead, it's defined in the global symbol table. You will also
need to this to manage nested functions correctly (nested functions can access
variables declared in outer functions).
Part 3 - Modify your lexer
Modify your lexer to attach symbol table entries to each identifier token.
If no symbol table entry exists, you should create a new entry and attach it to the
token.
Keep in mind that the lexer interface is fairly dumb. All it does is
find an existing symbol table entry (if any) and return it with the
token. If no such entry exists, it creates an empty entry and returns
it. The primary reason for doing this in the lexer is that the parser
is always going to examine the symbol table for every identifier that
it encounters. Rather than trying to modify the parser to do the
lookup, it is usually easier and more efficient to do it in the lexer
and pass the symbol table entry along with the token.
Part 4 - Modify your parser
Modify your parser to manage the symbol table as follows:
- Whenever a new function is defined, a new symbol table should be created.
- At the end of each function, the previous symbol table should be restored.
To do this, you will need to be a little sneaky with your grammar. Here is
one way to do it:
def p_function(t):
'function : fundecl ID LPAREN parmlist RPAREN .... statements '
... do whatever ...
# Restore old symbol table
symtab.pop_symtab()
def p_fundecl(t):
'fundecl : FUN'
# Create a new symbol table
symtab.new_symtab()
In this case, a new symbol table will be created as soon as the "FUN" token is seen (and reduced)
on the input. Afterwards, all identifiers inside the function will be placed into the new symbol table. The
only exception is that the "ID" right after "funcdecl" is still placed in the old symbol table.
However, not to worry---this is exactly the behavior you want. Finally, when the entire function declaration
is reduced, the symbol table will be restored.
Now, modify your parser as follows:
- Parameter and variable declarations such record type information for the
corresponding identifier. Remember that a type may be "int", "float", or an array of some
size. The size of an array is part of the type so int[4] is a different
type than int[10].
- If a parameter or declaration is redeclared, your parser should issue an error.
- If an undeclared identifier is encountered anywhere in the body of a function,
an "undeclared identifier" error should be issued. You will need to check
in all of the places where an identifier can appear (expressions, locations, etc.).
Note: there are some subtle issues of scope. For instance, if a symbol is not declared
locally, but is declared globally, your parser will have to check for this in selected
locations.
Part 5 - Type Checking
Now that you've got basic identifier and type handling working, it is time to
add some type checking to the compiler. Most of this work involves
modifications to your arithmetic expression handling.
- Make the type of an integer literal an "int".
- Make the type of a floating point literal a "float".
- Make the type of an identifier the same as its declaration.
- The type of an array location is the same as the base type.
For example, if a is declared as a : int[10], the type of "a" is
"int[10]", and the type of "a[3]" is "int".
- Modify the expression rules so that both sides of an operation
must be the same type. Otherwise, a type error occurs. The type
of an operation is the same type as its operands. For example,
in "a + 3", "a" must be an integer and the result will be an integer.
- Add rules that require array indices to be integers. If not,
generate a type error.
- Modify your assignment rule to check for types. The type of the left hand side,
must match that of the expression on the right.
-
Relational operators follow the same rules as arithmetic expressions.
For example, a < b is only valid if a and b are the same type.
The result of a comparison is a boolean (or int). Therefore,
a < b and c > d is valid even if the types in each comparison are different
(e.g., if a,b are integers and c,d are floats).
- For arithmetic expressions involving only numeric literals, go ahead
and evaluate the literal in the compiler. For example, if someone
writes a : int[40*50], go ahead and evaluate 40*50 and store 2000 as the
expression value.
Implementation Hints
- An easy way to represent types is as a tuple (type,dimension) where
type is a string like "int" or "float" and dimension is an array dimension
or None if the type is not an array.
- Type checking code for arithmetic expressions and relational operators
is essentially the same for each operator. Write some general purpose
functions for handling the type checking.
Handin procedures
-
Your parser must be contained in a file 'flameparse.py'. We must be able to
run this file as follows to run a test:
% python flameparse.py testname.flm
-
Make sure you have a README file that includes your name and anything notable about your implementation.
-
Create a Unix tar file for your project. This tar file should contain the 'flame' directory
and be based on your login name. For example:
% tar -cf yourlogin.tar flame
- On classes.cs.uchicago.edu, copy your tar file to the directory /stage/classes/current/CS326/handin/project3
For example:
% cp yourlogin.tar /stage/classes/current/CS326/handin/project3
- Late assignments are not accepted. The handin directory will be closed
after the due date.
IF WE CAN NOT UNPACK OR RUN YOUR HANDIN USING THE SUPPLIED TEST SCRIPTS, YOU WILL RECEIVE NO CREDIT!