Course Search Engine: Crawling the catalog¶
Due: Saturday, January 29th at 4:30pm
You must work alone on this assignment.
In this assignment, you will build a course search engine. You will build a web crawler that crawls a shadow copy of the college catalog to construct a simple index. The purpose of this assignment is to give you more Python programming experience and to have you work with HTML documents extracted from the web.
Getting started¶
Using the invitation URL provided on Ed Discussion, create a repository for your PA #2 files.
After accepting the assignment, Github will take a few minutes to create your repository. You should receive an email from Github when your repository is ready. Normally, it’s ready within seconds and you can just refresh the page.
Next, make sure you’ve
set the GITHUB_USERNAME
variable by running the following (replacing replace_me
with your GitHub username):
GITHUB_USERNAME=replace_me
(remember you can double-check whether the variable is properly set by
running echo $GITHUB_USERNAME
)
And finally, run these commands (if you don’t recall what some of these commands do, they are explained at the end of the Git Tutorial):
cd ~/capp30122
git clone git@github.com:uchicago-CAPP30122-win-2022/pa2-$GITHUB_USERNAME.git
cd pa2-$GITHUB_USERNAME
You will find the files you need for the programming assignment directly in the root of your repository.
Before describing the specifics of your task, we will briefly explain how to work with URLs and grab pages from the web.
Working with URLs¶
URL stands for uniform resource locator. A URL, for example:
http://www.classes.cs.uchicago.edu/archive/2022/winter/30122-1/pa/index.html
has the following format:
protocol://site address/path/filename
The protocol field (http
, in our example) specifies which protocol
should be used to interact with the resource. Common protocols include
http, https, and ftp. We will be working with http and https, the
hypertext transport protocols. The site address specifies the host
computer name (www
), the domain name (classes.cs.uchicago
),
and the top level domain (edu
). The path specifies part of the
hierarchical location (archive/2022/winter/30122-1
) of the desired
file (index.html
) in the file system on the host
computer. The path to the archive
directory is supplied by the
configuration of the web server program running on the host machine. The full path in the department’s file system for the
example above is:
/stage/classes/archive/2022/winter/30122-1/pa/index.html.
URLs
can also include a port number, which we will not be using.
Within a web page, a link can refer to an absolute URL (that is, it
starts with http://
or https://
) or a relative URL, which can
be converted into an absolute URL. A relative URL refers to a page
with a location that is relative to the current page (much like
relative path names in Linux). For example, here is a URL taken from
the index page for assignments mentioned above:
pa1/index.html
The absolute URL for this page is:
http://www.classes.cs.uchicago.edu/archive/2022/winter/30122-1/pa/pa1/index.html
URLs can use a fragment (denoted by a #) to refer to a specific location (known as an anchor) on a page. For our purposes, we only need the part of the URL that comes before the #.
We have provided some useful functions for working with URLs:
util.is_absolute_url(url):
Takes a URL and returns true if it is an absolute URL and false otherwise.
util.convert_if_relative_url(url1, url2):
Takes the URL for a page (
url1
) and a URL found on that page (url2
). Ifurl2
is an absolute URL, then this function will just return it. Ifurl2
is a relative URL, the function will return an equivalent absolute URL. The function will returnNone
if it is unable to convert the URL.The following code, for example:
url1 = "http://www.classes.cs.uchicago.edu/archive/2022/winter/30122-1/pa/index.html" url2 = "pa1/index.html" util.convert_if_relative_url(url1, url2)
would yield:
http://www.classes.cs.uchicago.edu/archive/2022/winter/30122-1/pa/pa1/index.html
as expected.
util.remove_fragment(url):
Takes a URL and removes the
#
and any text following it in the URL. For example, given the relative URLindex.html#minorprogramincomputerscience
, the function would returnindex.html
. If the URL does not contain a fragment, the function just returns the original URL.
Grabbing a web page¶
We will be using requests
, a Python package, for making http
connections and retrieving documents. We have supplied the following
wrapper functions:
util.get_request(url):
Takes a URL and returns a request object. This function will return
None
if the request fails.util.read_request(request):
Takes a request object and returns a string containing the HTML document retrieved by the request.
This function may generate a warning of the form:
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
You may ignore this warning.
util.get_request_url(request):
Takes a request object and returns the associated URL. Note that the returned URL may be different than the URL provided to the original call to
get_request
. This seeming anomaly occurs when the original URL redirects to another URL.
Note that your crawler should check for failures (that is, checking
if the output of any of these functions is None
before proceeding
further).
Note also that even though the recent lab used a different library to download web pages for this assignment, you should follow the above functions.
Catalog pages¶
For this assignment, we are interested in three types of
HTML tags: a
, div
, and p
. We will use the first to find
links to other pages, the second to find sections of the page pertinent to a particular course, and the
third to extract actual course titles and descriptions from within
the div
tags.
For example, here are three links from a page in the course catalog:
<a href="http://college.uchicago.edu">The University of Chicago</a>
<a href="../../azindex/index.html">AZ Index</a>
<a href="index.html#minorprogramincomputerscience"></a>
The first specifies a link using an absolute URL. The second specifies a link using a relative URL. And the third specifies a link using a relative URL with a fragment.
Course descriptions are enclosed in a div
tag of the form <div
class="courseblock main">
. For example, here is the entry for
Software Construction:
<div class="courseblock main">
<p class="courseblocktitle">
<strong>CMSC 22001. Software Construction. 100 Units.</strong>
</p>
<p class="courseblockdesc">
Large software systems are difficult to build. The course discusses
...
will be expected to actively participate in team projects in this
course.
</p>
<p class="courseblockdetail">
Instructor(s): S. Lu     Terms Offered: Autumn<br />
Prerequisite(s): CMSC 15400<br />
</p>
</div>
Course titles and descriptions can be found in p
tags nested
within the div
tag.
You will construct an index that maps words to lists of course
identifiers. A course identifier is an integer that uniquely
identifies a specific course code (“CMSC 12200”, for example). We
will provide a dictionary that maps course codes to course
identifiers. Course codes can be found at the beginning of each
course title. The string  
represents a single html
character entity (non-breaking space). Certain reserved characters in HTML must be
replaced with character entities. We saw in lecture that in order to include a <
as text that
we needed to use the <
character entity . This is required since <
is a keyword
of the HTML syntax (i.e., it’s part of defining a tag). In this assignment, you only need to worry about the  
entity.
However, when using the beautifulsoup4 .text
attribute it will convert the  
into
an Unicode non-breaking space character for you. You will only need to replace it
with a regular space when you construct a course code.
You might wonder why we need both course codes and course identifiers. In the next assignment, you will be using a relational database that will contain an index like the one you are constructing along with information that we have scraped from the UChicago time schedules. The course identifiers will be used in this database to link different types of information about a course, such as the title, section dates, times, and locations for a given quarter.
Sequences¶
If a group of courses forms a sequence, the UChicago course catalog
lists the course title and description for both the sequence (class="courseblock main"
) and the individual courses associated with
that sequence (class="courseblock subsequence"
). We will refer to
the individual courses as subsequences. See below for an example:
<div class="courseblock main">
<p class="courseblocktitle"><strong>CMSC 12100-12200-12300.
Computer Science with Applications I-II-III.</strong>
</p>
<p class="courseblockdesc">
This three-quarter sequence teaches computational thinking and
skills to students who are majoring in the sciences, mathematics,
and economics. ... Students learn Java, Python, R and C++.
</p>
</div>
<div class="courseblock subsequence">
<p class="courseblocktitle"><strong>CMSC 12100.
Computer Science with Applications I. 100 Units.</strong>
</p>
<p class="courseblockdesc">
</p>
<p class="courseblockdetail">
Instructor(s): A. Rogers     Terms Offered: Autumn<br />
Prerequisite(s): Placement into MATH 15200 or higher, or consent of instructor<br />
Note(s): This course meets the general education requirement in the mathematical sciences. <br />
</p>
</div>
<div class="courseblock subsequence">
<p class="courseblocktitle"><strong>CMSC 12200.
Computer Science with Applications II. 100 Units.</strong>
</p>
<p class="courseblockdesc">
</p>
<p class="courseblockdetail">
Instructor(s): A. Rogers     Terms Offered: Winter<br />
Prerequisite(s): CMSC 12100<br />
Note(s): This course meets the general education requirement in the
mathematical sciences. <br />
</p>
</div>
<div class="courseblock subsequence">
<p class="courseblocktitle"><strong>CMSC 12300.
Computer Science with Applications III. 100 Units.</strong>
</p>
<p class="courseblockdesc">
The course revolves around core ideas behind the management and
computation of large volumes of data ("Big Data"). ...
Hadoop. ... the entire dataset.<br />
</p>
<p class="courseblockdetail">
Terms Offered: Spring<br />
Prerequisite(s): CMSC 12200<br />
</p>
</div>
Working with HTML¶
You will use Beautiful Soup to parse and
navigate HTML documents with the parser html5lib
. Here is the reference for beautifulsoup4 that provides
more details on the material covered in class.
You may need to upgrade beautifulsoup4 and html5lib. You can do so with the following commands on the Linux command-line:
pip3 install --upgrade beautifulsoup4
pip3 install --upgrade html5lib
You will import Beautiful Soup as bs4
. To create a soup object
from a string text
, containing the text of an HTML document, you
can use the statement:
soup = bs4.BeautifulSoup(text, "html5lib")
Note that in our case, the parameter text
will be the string
returned from a call to read_request
.
Once you have a soup object, you can use the find_all
method to
extract a list of the tag objects in the soup that match a particular
pattern. For example, the expression:
soup.find_all("div")
would return a list of tags, one for each div
block that occurs in
the text. You can narrow the search by including information about
attributes and their values. For example, the expression:
soup.find_all('div', class_="courseblock main")
would return a list of tag objects, one for each div
block in
which the class
attribute has the value courseblock main
.
Note the use of class_
as the attribute name rather than the
expected name of class.
The underscore is required because
class
is a reserved word in Python.
There are various useful operations you can do on a tag object t
and
attribute name a
:
t.text
Extract the text associated with the tag
t
.t.has_attr(a)
Returns true if the tag contains an attribute
a
.t[a]
Returns the value associated with attribute
a
in the tag. This expression will fail ifa
is not an attribute oft
.
You can also call find_all
on tag objects to extract nested tags.
Your task¶
A search engine relies on an index that maps words to the pages in which they appear. This index is built by crawling the web and indexing the information on the pages the crawler visits. The crawler starts at a specified page and follows the links on that page to other pages and from those pages to yet more pages and so on until it has visited all pages reachable from the initial page.
Your task is to complete the function go
in crawler.py
. This
function takes a number of pages to crawl, the name of a JSON file
that contains the course code to course identifier dictionary, and the
output filename as arguments. Your implementation should crawl the
course catalog to create an index that maps words to lists of course
identifiers, and then generate a CSV file from the index.
Mini crawler
Your crawler should:
Start at the specified starting URL.
Only visit pages that have URLs that are okay to follow (see below).
Stop after it has visited the specified maximum number of pages.
Visit the pages in order of occurrence (first in, first out).
Visit a page only once to avoid cycles (for example, A->B->C->A).
Avoid calling
get_request
on the same URL multiple times.
We have provided a function, util.is_url_ok_to_follow(url,
limiting_domain)
, that takes a URL and a domain and returns True
, if
the URL:
is an absolute URL,
falls within a specified domain,
does not contain an “@” or “mailto:” symbol,
does not direct the crawler to archived course catalog pages, and
has a file name that ends with either no extension or the extension “html”.
The provided code also includes the appropriate starting URL and limiting domain.
You should visit the links in “first-in, first-out” order, which means that a queue is the appropriate data structure for managing the links. A queue typically has operations to add values to the end, to remove values from the front or head, and to determine whether the queue is empty. You can implement a queue with a few simple list operations or you can use the Python Queue library.
Starting at the main page of our shadow catalog with a maximum number of pages to visit of 1000, our implementation visited fewer than 100 pages. You can find a list of the links visited (in the order visited) here.
When tracking visited pages, make sure to handle the fact that, as
noted above, the URL returned by get_request_URL
may differ from
the URL used in the call to get_request
.
Do not build a recursive crawler. Instead, you should use a queue to keep track of which pages have been visited.
Mini indexer
You will need to write code that produces an index that maps a word to a list of course identifiers. A course identifier should be included in the index entry for a word if the word appears in the title or the description of the course in the course catalog. Your implementation should construct the index as your crawler visits the pages.
Your indexer should be case insensitive. For example, your indexer
should not distinguish between "Foo"
and "foo"
. Recall that
you can convert a string, s
, to lower case using the expression
s.lower()
.
For the purposes of this assignment, we will define a word to be a
string of length at least one that starts with a letter (A-Z or a-z) and contains
only letters, digits (0-9), and/or an underscore (_). Be sure to remove any
of the following punctuation marks: !
, .
or :
at the end of a word.
You only need to worry about those punctuation marks. You may use a regular expression
(if you know about them) but it is not necessary.
Some words occur very frequently in normal English text (“a”, “and”,
“of”, etc) and some occur very frequently in the catalog (“include”,
“students”, “units”, etc). We have defined a constant
(INDEX_IGNORE
) with a list of words that your indexer should
not include in the index.
If subsequences exist for a given courseblock main
, your indexer
should map words in the main title and description to all the courses
in the sequence. In addition, it should map words that appear only in
the description of a subsequence to the course identifier for that
subsequence, not all the identifiers for all courses in the sequence.
For example, the unique identifiers that correspond to “CMSC 12100”,
“CMSC 12200”, and “CMSC 12300” should appear in the index entries for
“sciences”, “mathematics”, and “economics” , while only the unique
identifier for “CMSC 12300” should appear in the index entry for
“hadoop”.
We have provided a useful function, named
util.find_sequence(tag)
, for working with subsequences. This
function takes a bs4 tag and checks for associated subsequences. If
subsequences exist, the function returns a list of the div
tag objects
for the subsequence; otherwise, it returns an empty list.
Output
In part 2 of this assignment, you will be loading the index generated
in this part of the assignment into a relational database. To
facilitate this process, you need to create a CSV file from your index.
Each line in the file should contain one course identifier and a
word, in that order, and separated by a bar (|
). Here are the
first few lines of the file generated by our implementation:
1866|aanl
1867|aanl
1868|aanl
1869|aanl
1870|aanl
1985|abandon
2194|abandon
484|abandoned
2336|abandoned
1175|abandonment
Notice that the same word may appear in multiple lines of the file
(corresponding to the word appearing in the descriptions of multiple
courses). Any given course identifier/word pair should occur at most
once. (FYI, aanl
is a department code, not a true English word,
but it fits our definition of “word”.)
Even though the output above is order by word (primary) and course identifier (secondary), you are not required to produce output in sorted order.
Testing Crawer/Indexer We have provided test code for this task in test_crawler.py
.
Code organization¶
Think carefully about the appropriate decomposition of work for this
task. You should not do everything in one function. Our
implementation of crawler.py
has five functions, including go
,
and is about 150 lines of code (not including headers or comments).
Grading¶
Programming assignments will be graded according to a general rubric. Specifically, we will assign points for completeness, correctness, design, and style. (For more details on the categories, see our PA Rubric page.)
The exact weights for each category will vary from one assignment to another. For this assignment, the weights will be:
Completeness: 65%
Correctness: 10%
Design: 15%
Style: 10%
Obtaining your test score¶
The completeness part of your score will be determined using automated
tests. To get your score for the automated tests, simply run the
following from the Linux command-line. (Remember to leave out the
$
prompt when you type the command.)
$ py.test
$ ./grader.py
Notice that we’re running py.test
without the -k
or -x
options: we want it to run all the tests. If you’re still failing
some tests, and don’t want to see the output from all the failed
tests, you can add the --tb=no
option when running py.test
:
$ py.test --tb=no
$ ./grader.py
Take into account that the grader.py
program will look at the
results of the last time you ran py.test
so, if you make any
changes to your code, you need to make sure to re-run py.test
. You
can also just run py.test
followed by the grader on one line by
running this:
$ py.test --tb=no; ./grader.py
Cleaning up¶
Before you submit your final solution, you should, remove
any
print
statements that you added for debugging purposes andall in-line comments of the form: “YOUR CODE HERE” and “REPLACE …”
Also, check your code against the style guide. Did you use good variable names? Do you have any lines that are too long, etc.
Do not remove header comments, that is, the triple-quote strings that describe the purpose, inputs, and return values of each function.
As you clean up, you should periodically save your file and run your code through the tests to make sure that you have not broken it in the process.
Submission¶
You will be submitting your work through Gradescope (linked from our Canvas site). The process will be the same as with the short exercises: Gradescope will fetch your files directly from your PA1 repository on GitHub, so it is important that you remember to commit and push your work! You should also get into the habit of making partial submissions as you make progress on the assignment; remember that you’re allowed to make as many submissions as you want before the deadline.
To submit your work, go to the “Gradescope” section on our Canvas site. Then, click on “Programming Assignment #2”. This is the same process as last quarter to connect your Github repository to Gradescope. You will see a “Connect to GitHub” button. Pressing that button will take you to a GitHub page asking you to confirm that you want to authorize Gradescope to access your GitHub repositories. Just click on the green “Authorize gradescope” button.
Then, under “Repository”, make sure to select your uchicago-CAPP30122-win-2022/pa2-$GITHUB_USERNAME.git
repository.
Under “Branch”, just select “main”.
Finally, click on “Upload”. An autograder will run, and will report back a score. Please note that this autograder runs the exact same tests (and the exact same grading script) described in the section above. If there is a discrepancy between the tests when you run them on your computer, and when you submit your code to Gradescope, please let us know.
Acknowledgments: This assignment was written by Anne Rogers and originally inspired by an assignment written by Mari Hearst at UC Berkeley. Trevor Coyle suggested the idea of building a course search engine and helped with the necessary revisions. The Django interface was built by Gustav Larsson.