Homework #2¶
Due: Wednesday October 18th at 11:59pm
This homework is intended to serve as an introduction to understanding the process of lexical analysis. Specifically, we will look at requirements needed to build the scanner that makes up the lexical analysis phase.
Getting started¶
For each assignment, a Git repository will be created for you on GitHub. However, before that repository can be created for you, you need to have a GitHub account. If you do not yet have one, you can get an account here: https://github.com/join.
To actually get your private repository, you will need this invitation URL:
When you click on an invitation URL, you will have to complete the following steps:
You will need to select your CNetID from a list. This will allow us to know what student is associated with each GitHub account. This step is only done for the very first invitation you accept.
Note
If you are on the waiting list for this course you will not have a repository made for you until you are admitted into the course. I will post the starter code on Ed so you can work on the assignment until you are admitted into the course.
You must click “Accept this assignment” or your repository will not actually be created.
After accepting the assignment, Github will take a few minutes to create your repository. You should receive an email from Github when your repository is ready. Normally, it’s ready within seconds and you can just refresh the page.
- You now need to clone your repository (i.e., download it to your machine).
Make sure you’ve set up SSH access on your GitHub account.
For each repository, you will need to get the SSH URL of the repository. To get this URL, log into GitHub and navigate to your project repository (take into account that you will have a different repository per project). Then, click on the green “Code” button, and make sure the “SSH” tab is selected. Your repository URL should look something like this: git@github.com:mpcs51300-win23/hw1-GITHUB-USERNAME.git.
If you do not know how to use
git clone
to clone your repository then follow this guide that Github provides: Cloning a Repository
If you run into any issues, or need us to make any manual adjustments to your registration, please let us know via Ed Discussion.
Transition Diagrams¶
You will need to draw transition diagrams for this homework assignment. You are free to use any application that makes it easy to draw them (e.g., PowerPoint, Keynote, Photoshop, etc.). You can also search online for free online applications that allow you to draw diagram. However, to make it easy for us to grade/see them use must produce them digital and not by hand.
Short Answer Questions¶
Each assignment may include a section dedicated to answering a few short answer questions. Please turn in a readable text file named saqs
(e.g., saqs.pdf
). Store this file inside the hw2/saqs
directory. You will place your answers to the following questions in this document. The file is not required to be a PDF but any text document that is easy to open is fine. You may place your solution in more than one file but make sure to label file clearly so we know what question your file is referring too.
SAQ 1¶
Write a regular expression for each of the following languages:
Given an alphabet = {0, 1}, L is the set of all strings of alternating pairs of 0s and pairs of 1s.
Given an alphabet = {0, 1}, L is the set of all strings of 0s and 1s that contain an even number of 0s or an even number of 1s.
Given the lowercase English alphabet, L is the set of all strings in which the letters appear in ascending lexicographical order.
SAQ 2¶
Consider the regular expression (note the spaces are just for presentation purposes and not part of the expression):

Use Thompson’s construction to construct an NFA for the regular expression. You do not need to convert the NFA to a DFA but you can for practice.
The Cal Language¶
For this assignment, we will build a direct-coded scanner for a simple toy language called Cal. The informal specification of the language is described below (a more formal specification will be provided in a future homework assignment):
The only values in the language are integers. Integers must begin with a digit from 1 to 9 and zero or more digits (0 to 9) can follow the first digit. All integers are positive.
The language contains variables (i.e., identifiers). Each identifier must begin with a letter from the English alphabet. The letter can be capitalized or lowercase. An identifier can contain zero or more integer digits (0 to 9), lowercase or uppercase English letters following the first letter.
The language contains arithmetic expressions that work on integer values. The operators for these expressions are
+
(addition),-
(subtraction),*
(multiplication), and/
(division). Expressions can contain a mixture of literal integers and variables as operands to the operators.The language contains three types of statements. Each statement ends with a semicolon (
;
). The following are the three types of statements:A variable declaration statement that has the structure:
let IDENTIFIER = EXPRESSION;
. The keywordlet
in all lowercase letters is required for a variable declaration.A assignment statement that has the structure:
IDENTIFIER = EXPRESSION;
.A print statement that has the structure:
print EXPRESSION;
. Notice thatprint
is a keyword and not a function call like in other languages.
Note
Some of this information is not required in order to implement the scanner but is given for clarity purposes.
Here are a few examples of Cal programs
Sample 1
print 3+4;
Sample 2
let a = 4;
a = 7;
print a;
Sample 3
let a = 4 + 12 * 123 - 34 / 6; print a;
let c = a + 4;
print c;
We will add a few more constructs in the next homework assignment but for now the language contains these main constructs.
Programming Question: Simple Scanner¶
As mentioned in the previous section, you will implement a direct-coded scanner for the Cal language. You must build the scanner using the method described in the M2 videos. As you may recall, here are the steps to implement the scanner:
Write down the RE for the input language
Build a big NFA
Build the DFA that simulates the NFA
Turn it into actual code
You don’t need to automate/code steps 1-3 or provide written work for those steps. You can simply create your DFAs by doing the conversions by hand and then using those DFAs to code the scanner. We actually don’t recommend doing 1-2 because your NFAs will be very large. However, you can for practice at later time.
To help you get started with the structure of scanner, we have provided some Go template code inside the hw2/template
directory. You can implement the scanner in C++,C,Go,Python or Java. However, all template code for all assignments will be given in Go but feel free to translate them into your preferred language of choice. Please note that this template code will not run as it is given to you. It only acts as means to give you some insight about how to structure your code. Lets first look at the hw2/template/token.go
file
type TokenType string
const (
ILLEGAL = "ILLEGAL"
INT = "INT"
)
type Token struct {
Type TokenType
Literal string
}
This provides a struct
that allows you to create tokens once you identify that a substring is a token. You can update the const
declaration with additional token types based on the Cal language. The Literal
field is the substring that you identified for the token.
The actual implementation of the Go scanner will be defined in the hw2/template/scanner.go
file. Inside this file, you have the following code
type Scanner struct {
// TODO: IMPLEMENT ME!
}
func New(input string) *Scanner {
//Feel free to change this implementation
scanner := &Scanner{}
return scanner
}
func (l *Scanner) NextToken() token.Token {
// TODO: Implement Me
return token.Token{token.INT, "0"}
}
You may update the Scanner
struct with additional fields to help you implement the scanner. The New
initializes and allocates a new scanner. You should think of the argument input
as the entire source code for a Cal program. You want to hold on to this input because you’ll use it in the NextToken()
function. However, You may modify this function as you please. The NextToken
function represents the token stream described in lecture. Each time the function is called it returns the next token identified from the input
source code. You can modify this code as you please.
You will need to define a main
program that will produce string representations for your token types when provided with a Cal source code. You do need to include line numbers as metadata along with any value data for token types that should tag along the associated value. You do not have to include column numbers. An example of this is shown in the next section.
Building, Running, and Testing the Scanner¶
You will need to provide a README
file that clearly and explicitly explains how we can build and run your scanner program on the CS linux servers. Your scanner is required to work correctly on a CS linux machine. Your scanner program is required to take in a single command line file that represents a Cal program. For example, if we had a file called simple1.cal
that contained the following code
let a = 23;
let c = a + 4;
print c;
then we can run the program as follows in Go and Python, assuming your scanner’s main code lives in a file scanner.go
/ scanner.py
$ go run scanner.go simple1.cal
LET(1)
ID(1, "a")
EQUALS(1)
INT_LIT(1,23)
SEMICOLON(1)
LET(2)
ID(2, "c")
EQUALS(2)
ID(2, "a")
PLUS(2)
INT_LIT(2,4)
SEMICOLON(2)
PRINT(3)
ID(3, "c")
SEMICOLON(3)
$ python3 scanner.py simple1.cal
LET(1)
ID(1, "a")
EQUALS(1)
INT_LIT(1,23)
SEMICOLON(1)
LET(2)
ID(2, "c")
EQUALS(2)
ID(2, "a")
PLUS(2)
INT_LIT(2,4)
SEMICOLON(2)
PRINT(3)
ID(3, "c")
SEMICOLON(3)
How you organize the structure of your code in your repository is up to you. You can also git rm
the template code for your final submission.
Grading¶
For this assignment, the weights will be:
SAQ 1 5%
SAQ 2 5%
Scanner:Completeness 50%
Scanner:Correctness 30%
Scanner:Design & Style 10%
Design, Style and Cleaning up¶
Before you submit your final solution, you should, remove
any print statements that you added for debugging purposes and
all in-line comments of the form: “YOUR CODE HERE” and “TODO …”
Think about your function decomposition. No code duplication. This homework assignment is relatively small so this shouldn’t be a major problem but could be in certain problems.
As you clean up, you should periodically save your file and run your code through the tests to make sure that you have not broken it in the process.
Submission¶
Before submitting, make sure you’ve added, committed, and pushed all your code to GitHub. You must submit your final work through Gradescope (linked from our Canvas site) in the “Homework #2” assignment page via two ways,
Uploading from Github directly (recommended way): You can link your Github account to your Gradescope account and upload the correct repository based on the homework assignment. When you submit your homework, a pop window will appear. Click on “Github” and then “Connect to Github” to connect your Github account to Gradescope. Once you connect (you will only need to do this once), then you can select the repsotiory you wish to upload and the branch (which should always be “main” or “master”) for this course.
Uploading via a Zip file: You can also upload a zip file of the homework directory. Please make sure you upload the entire directory and keep the initial structure the same as the starter code; otherwise, you run the risk of not passing the automated tests.
Depending on the assignment, once you submit your work, an “autograder” will run. This autograder should produce the same test results as when you run the code yourself; if it doesn’t, please let us know so we can look into it. A few other notes:
You are allowed to make as many submissions as you want before the deadline.
Please make sure you have read and understood our Late Submission Policy.
Your completeness score is determined solely based on the automated tests, but we may adjust your score if you attempt to pass tests by rote (e.g., by writing code that hard-codes the expected output for each possible test input).
Gradescope will report the test score it obtains when running your code. If there is a discrepancy between the score you get when running our grader script, and the score reported by Gradescope, please let us know so we can take a look at it.