==========================================
Course Search Engine: Crawling the catalog
==========================================
**Due: Tuesday, Jan 19th at 5pm**
You may work alone or in a pair on this assignment.
Over the course of the next two assignments, you will be building
parts of a course search engine. We have put together a simple web
application that will use your code to do the work of finding courses
that match the users criteria.
.. image:: screenshot.png
:height: 4in
:align: center
In this part of the assignment, you will be building a web crawler
that crawls a shadow copy of the college catalog construct a simple
index. The purpose of this assignment is to give you more Python
programming experience and to have you work with HTML documents
extracted from the web. You will use the index in the second part of
the assignment.
Before describing the specifics of your task, we will briefly explain
how to work with URLs and grab pages from the web.
Working with URLs
-----------------
URL stands for uniform resource locator. A URL, for example::
http://www.classes.cs.uchicago.edu/archive/2016/winter/12200-1/pa/index.html
has the following format::
protocol://site address/path/filename
The protocol field (``http``, in our example) specifies which protocol
should be used to interact with the resource. Common protocols include
http, https, ftp, etc. We will be working with http and https, the
hypertext transport protocols. The site address specifies the host
computer name (``www``), the domain name (``classes.cs.uchicago``),
and the top level domain (``edu``). The path specifies part of the
hierarchical location (``archive/2016/winter/12200-1``) of the desired
file (``index.html``) in the file system on the host
computer. The path to the ``archive`` directory is supplied by the
host machine. The full path in the department's file system for the
example above is:
``/stage/classes/archive/2016/winter/12200-1/pa/index.html.`` URLs
can also include a port number, which we will not be using.
Within a web page, a link can refer to an absolute URL (that is, it
starts with ``http://`` or ``https://``) or a relative URL, which can be converted into
an absolute URL. A relative URL refers to a page with an address that
is relative to the current page (much like relative path names in
Linux). For example, here is a URL from the assignment's index page mentioned
above::
pa/pa1/index.html
The absolute URL for this page is::
http://www.classes.cs.uchicago.edu/archive/2016/winter/12200-1/pa/pa1/index.html
URLs can use a fragment (denoted by a `#`) to refer to a specific
location (known as an anchor) on a page. For our purposes, we only
need the part of the URL that comes before the `#`.
We have provided some useful functions for working with URLs:
``util.is_absolute_url(url):``
Takes a URL and returns true if it is an absolute URL and false
otherwise.
``util.convert_if_relative_url(url1, url2):``
Takes the URL for a page (``url1``) and a URL found on that page (``url2``). If ``url2`` is an absolute URL, then this function will just return it. If ``url2`` is a relative URL, the function will return an equivalent absolute URL. The function will return ``None`` if it is unable to convert the URL.
The following code, for example::
url1 = "http://www.classes.cs.uchicago.edu/archive/2016/winter/12200-1/pa/index.html"
url2 = "pa/pa1/index.html"
util.convert_if_relative_url(url1, url2)
would yield::
http://www.classes.cs.uchicago.edu/archive/2016/winter/12200-1/pa/pa1/index.html
as expected.
``util.remove_fragment(url):``
Takes a URL and removes the ``#`` and any text following it in the URL. For example, given the relative URL ``index.html#minorprogramincomputerscience``, the function would return ``index.html``. If the URL does not contain a fragment, the function just returns the original URL.
Grabbing a web page
-------------------
We will be using ``requests``, a Python package, for making http
connections and retrieving documents. We have supplied the following
wrapper functions:
``util.get_request(url):``
Takes a URL and returns a request object. This function will return ``None`` if the request fails.
``util.read_request(request):``
Takes a request object and returns a string containing the HTML document retrieved by the request.
This function may generate a warning of the form::
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
You may ignore this warning.
``util.get_request_url(request):``
Takes a request object and returns the associated URL. Note that
the returned URL may be different than the URL provided to the
original call to ``get_request``. This change occurs when the
original URL redirects to another URL.
Note that your crawler should be checking for failures (that is, checking
if the output of any of these functions is ``None`` before proceeding
further).
Catalog Pages
-------------
For this assignment, we will be interested in three different types of
HTML tags: ``a``, ``div``, and ``p``. We will use the first to find
links to other pages, the second to find course descriptions, and the
third to extract actual course titles and descriptions from within
the ``div`` tags.
For example, here are three links from a page in the course catalog::
The University of Chicago
AZ Index
The first specifies a link using an absolute URL. The second specifies
a link using a relative URL. And the third specifies a link using a
relative URL with a fragment.
Course descriptions are enclosed in a ``div`` tag of the form ``
``. For example, here is the entry for
Software Construction::
CMSC 22001. Software Construction. 100 Units.
Large software systems are difficult to build. The course discusses
...
will be expected to actively participate in team projects in this
course.
Instructor(s): S. Lu Terms Offered: Autumn
Prerequisite(s): CMSC 15400
Course titles and descriptions can be found in ``p`` tags nested
within the ``div`` tag.
You will be constructing an index that maps words to lists of course
identifiers. A `course identifier` is an integer that uniquely
identifies a specific course code ("CMSC 12200", for example). We
will provide a dictionary that maps course codes to course
identifiers. Course codes can be found at the beginning of each
course title. The string `` `` represents a single html
character entity (non-breaking space). You will need to replace it
with a regular space when you construct a course code.
You might wonder why we need both course codes and course identifiers.
In the next assignment, you will be using a relational database that will
contain an index like the one you are constructing along with
information that we have scraped from the UChicago time schedules.
The course identifiers will be used in this database to link different
types of information about a course, such as the title, section dates,
times, and locations for a given quarter, etc.
Sequences
~~~~~~~~~
If a group of courses forms a sequence, the UChicago course catalog
lists the course title and description for both the sequence (``class="courseblock main"``) and the individual courses associated with
that sequence (``class="courseblock subsequence"``). We will refer to
the individual courses as subsequences. See below for an example::
CMSC 12100-12200-12300.
Computer Science with Applications I-II-III.
This three-quarter sequence teaches computational thinking and
skills to students who are majoring in the sciences, mathematics,
and economics. ... Students learn Java, Python, R and C++.
CMSC 12100.
Computer Science with Applications I. 100 Units.
Instructor(s): A. Rogers Terms Offered: Autumn
Prerequisite(s): Placement into MATH 15200 or higher, or consent of instructor
Note(s): This course meets the general education requirement in the mathematical sciences.
CMSC 12200.
Computer Science with Applications II. 100 Units.
Instructor(s): A. Rogers Terms Offered: Winter
Prerequisite(s): CMSC 12100
Note(s): This course meets the general education requirement in the
mathematical sciences.
CMSC 12300.
Computer Science with Applications III. 100 Units.
The course revolves around core ideas behind the management and
computation of large volumes of data ("Big Data"). ...
Hadoop. ... the entire dataset.
Terms Offered: Spring
Prerequisite(s): CMSC 12200
Working with HTML
-----------------
We will be using a Python library named Beautiful Soup to parse and
navigate HTML documents. Here is a brief summary of the material on
Beautiful Soup that we will discuss in class. You may need to install
Beautiful Soup on your VM. You can install it using the following
command on the Linux command-line::
sudo pip3 install beautifulsoup4
You will import Beautiful Soup as ``bs4``. To create a soup object
from a string ``text``, containing the text of an HTML document, you
can use the statement::
soup = bs4.BeautifulSoup(text, "html5lib")
Note that in our case, the string ``text`` will be the value returned
from a call to ``read_request``.
Once you have a soup object, you can use the ``find_all`` method to
extract a list of the tag objects in the soup that match a particular
pattern. For example, the expression::
soup.find_all("div")
would return a list of tags, one for each ``div`` block that occurs in
the text. You can narrow the search by including information about
attributes and their values. For example, the expression::
soup.find_all('div', class_="courseblock main")
would return a list of tag objects, one for each ``div`` block in
which the ``class`` attribute has the value ``courseblock main``.
Note the use of ``class_`` as the attribute name rather than the
expected name of ``class.`` The underscore is required because
``class`` is a reserved word in Python.
There are various useful operations you can do on a tag object ``t`` and
attribute name ``a``:
``t.text``
Extract the text associated with the tag ``t``.
``t.has_attr(a)``
Returns true if the tag contains an attribute ``a``.
``t[a]``
Returns the value associated with attribute ``a`` in the tag. This expression will fail if ``a`` is not an attribute in ``t``.
You can also call ``find_all`` on tag objects to extract nested tags.
Your task
---------
A search engine relies on an index that maps words to the pages in
which they appear. This index is built by crawling the web and
indexing the information on the pages the crawler visits. The crawler
starts at a specified page and follows the links on that page to other
pages and from those pages to yet more pages and so on until it has
visited all pages reachable from the initial page.
Your task is to complete the function ``go`` in ``crawler.py``. This
function takes a number of pages to crawl, the name of a JSON file
that contains the course code to course identifier dictionary, and the
output filename as arguments. Your implementation should crawl the
course catalog to create an index that maps words to lists of course
identifiers and then generate a CSV file from the index.
**Mini crawler**
Your crawler should:
#. Start at the specified starting URL.
#. Only visit pages that have URLs that are OK to follow (see below).
#. Stop after it has visited the specified maximum number of pages.
#. Visit the pages in order of occurrence (first in, first out).
#. Visit a page only once to avoid cycles (for example, A->B->C->A).
#. Avoid calling ``get_request`` on the same URL multiple times.
We have provided a function, ``is_url_ok_to_follow(url,
limiting_domain)`` that takes a URL and a domain and returns true, if
the URL:
- is an absolute URL,
- falls within a specified domain,
- does not contain an "@" or "mailto:" symbol,
- does not direct the crawler to archived course catalog pages, and
- has a file name that ends with either no extension or the extension
"html".
The provided also code includes the appropriate starting URL and
limiting domain.
You should visit the links in "first-in, first-out" order, which means
that a queue is the appropriate data structure for managing the links.
A queue typically has operations to add values to the end, to remove
values from the front or head, and to determine whether the queue is
empty. You can implement a queue with a few simple list operations or
you can use the `Python Queue
library.
`_
Starting at the main page of our shadow catalog with a maximum number
of pages to visit of 1000, our implementation visited fewer than 100
pages. You can find a list of the links visited (in the order
visited) `here `_.
When tracking visited pages, make sure to handle the fact that, as
noted above, the URL returned by ``get_request_URL`` may differ from
the URL used in the call to ``get_request``.
**Mini indexer**
You will need to write code that produces an index that maps a word to
a list of course identifiers. A course identifier should be included
in the index entry for a word if the word appears in the title or the
description of the course in the course catalog. **Your implementation
should construct the index as your crawler visits the pages.**
Your indexer should be case *insensitive*. You can convert a
string, ``s``, to lower case using the expression ``s.lower()``.
For the purposes of this assignment, we will define a word to be a
string of length at least one that starts with a letter and contains
only letters and digits. We recommend using a regular expression and
the regular expression library (``re``) to extract a list of words
from course titles and descriptions.
Some words occur very frequently in normal English text ("a", "and",
"of", etc) and some occur very frequently in the catalog ("include",
"students", "units", etc). We have defined a constant
(``INDEX_IGNORE``) with a list of words that your indexer should
not include in the index.
If subsequences exist for a given ``courseblock main``, your indexer
should map words in the main title and description to all the courses
in the sequence. In addition, it should map words that appear only in
the description of a subsequence to the course identifier for the
subsequence, not all the identifiers for all courses in the sequence.
For example, the unique identifiers that correspond to "CMSC 12100",
"CMSC 12200", and "CMSC 12300" should appear in the index entries for
"sciences", "mathematics", and "economics" , while only the unique
identifier for "CMSC 12300" should appear in the index entry for
"hadoop".
We have provided a useful function, named
``util.find_sequence(tag)``, for working with subsequences. This
function takes a bs4 tag and checks for associated subsequences. If
subsequences exist, the function returns a list of the ``div`` tag objects
for the subsequence; otherwise, it returns an empty list.
**Output**
In part 2 of this assignment, you will be loading the index generated
in this part of the assignment into a relational database. To
facilitate this process, you need to create a CSV file from your index.
Each line in the file should contain one course identifier and a
word, in that order, and separated by a bar (``|``). Here are the
first few lines of the file generated by our implementation::
1866|aanl
1867|aanl
1868|aanl
1869|aanl
1870|aanl
1985|abandon
2194|abandon
484|abandoned
2336|abandoned
1175|abandonment
Notice that the same word may appear in multiple lines of the file
(corresponding to the word appearing in the descriptions of multiple
courses). Any given course identifier/word pair should occur at most
once. (FYI, ``aanl`` is a department code, not a true English word,
but it fits our definition of "word".)
Code organization
-----------------
Think carefully about the appropriate decomposition of work for this
task. You should **not** do everything in one function. Our
implementation of ``crawler.py`` is about 150 lines of code.
To test your code, we will run the Linux command::
python3 crawler.py
and then compare the resulting file ``catalog-index.csv`` with the
result of our implementation.
Getting started
---------------
Follow `these start-up instructions `_ if you plan to work alone.
Follow `these start-up instructions `_ if you are going to work in a pair.
Submission
---------------
Follow `these submission instructions `_ if you are working alone.
Follow `these submission instructions `_ if you are working in a pair.
*Acknowledgment: This assignment was originally inspired by an
assignment written by Mari Hearst at UC Berkeley. Trevor Coyle
suggested the idea of building a course search engine and helped with
the necessary revisions. The Django interface was built by Gustav Larsson.*