CS 122 Lab 1: BeautifulSoup¶
The goal of this lab is to learn how to scrape web pages using BeautifulSoup.
Introduction¶
BeautifulSoup is a Python library that parses HTML files and allows you to extract information from them. HTML files are the files that are used to represent web pages.
If you are interested in some question, whether it be related to finance, meteorology, sports, etc., there are almost certainly web sites that provide access to data that can be used to explore your question and gain insight.
Sometimes, you are lucky and find a website that both has the data you want and that helpfully provides the data in a machine-readable format. For instance, it may have a mechanism that allows you to submit queries and download CSV or JSON files back. A website that offers this service is said to “provide an API.”
Sadly, most websites are not so helpful. To gather data from them, you will instead need to go rogue and “scrape” them – submit requests as if they were coming from a web browser being operated by a human, but save the raw HTML output and interpret it programmatically in a Python script instead. When this is necessary, you will need to be able to parse the HTML code for the web page. And when you need to parse HTML, BeautifulSoup is the library to use.
Unfortunately, web pages are designed to present information to humans; web designers rely on the ability of humans to make sense of the data, no matter how it may be formatted. BeautifulSoup cannot analyze an entire web page and understand where the data is; it can only help you extract the data once you know where to find it. Therefore, applying BeautifulSoup to a scraping task involves:
- inspecting the source code of the web page in a text editor to infer its structure
- using information about the structure to write code that pulls the data out, employing BeautifulSoup
References¶
Here are links to reference material on HTML and BeautifulSoup that you may find useful during the course of this lab:
Example 1: Aviation Weather Observations¶
If you are interested in working with weather data, you may find METAR data to be useful.
Pilots rely on accurate weather information to operate aircraft safely. For instance, wind speeds and directions affect landing technique, because the pilot must compensate for the fact that the wind may be blowing the aircraft off course. And, the altitude of clouds and presence or absence of fog determines whether a pilot can safely approach an airport and visually identify the location of the runway without flying dangerously close to the ground.
Therefore, airports around the country publish hourly weather observations, called METARs. These reports are formatted as plain text in a highly abbreviated syntax that any trained pilot knows how to decode. Because they follow a standard formatting, they are suitable for use in programs, as well. Unfortunately, the web page that provides METARs places them as a single line of text amidst other elements, such as a request form and various weather-related links. We want to be able to extract just the weather observation from this cluttered page.
Step 1: Examining the Page¶
Start by going to the web site that provides this data.
On the right side of the page, there is a form field labelled “IDs” and a button named “Get METAR data”. In the IDs field, you need to fill in the ICAO code of a US airport. The ICAO code is simply the three-letter code you are probably used to (known as an IATA code), preceded by a K. For instance, the ICAO code for Midway is KMDW, and the ICAO code for O’Hare is KORD. Please enter in a single ICAO code for an airport of your choosing, leave the other options at their defaults, and press the button to get the current result.
If you entered in a valid airport code, you should get back a page that contains a line like:
KMDW 082353Z 17008KT 10SM SCT110 BKN180 BKN250 M09/M18 A3056
RMK AO2 SLP372 60000 T10891178 11067 21100 58016
Interpreting a METAR is obviously outside the scope of the course, but, for what it’s worth: This line gives the ICAO code, then the date and time of the observation (in the UTC time zone), then the wind direction and speed,, followed by the visibility (in miles), information on the current cloud cover, the temperature and dewpoint in Celsius, then the barometric pressure, followed by other remarks.
You could simply visit this web site by hand, and then copy and paste the pertinent line into a file. But, if you wanted to automatically collect these observations every hour, or for hundreds or thousands of airports, that does not sound tenable.
Let’s now explore the technical details of this web page.
First, take a look at the URL of the web page. It should look something like:
http://www.aviationweather.gov/metar/data?ids=kmdw&format=raw&date=0&hours=0
(Your URL may be different, depending upon what airport you used and whether you typed the ICAO code in lower- or upper-case letters.)
We have already learned something useful: for this web site, the URL encodes the specific airport being queried. So, if we wanted to write a Python function that generates a URL for a given airport, we do not need to somehow fill in and submit the form on the web site and see which page it sends us to; all we need to do is concatenate a couple of strings together.
Now, use your web browser’s File menu to save a copy of the web page to disk. Open this file up with a text editor to view the raw HTML formatting of the web page.
You can start by scanning through the file to get a sense of what it looks like in general. However, because it is fairly large and most of the content is irrelevant, you will quickly want to hone in on the portion of interest.
Referring back to your web browser window, determine the first few characters of the information we want to extract from the web page. For instance, for the above example, KMDW 08
. Use your text editor’s Find function to locate this text within the web page source.
Now, examine the HTML tags in the vicinity of this text. Look for distinctive landmarks that would help you programmatically identify the location of the weather observation without searching for the actual contents of the observation itself.
There are different ways to approach this task. Here is what I noticed, however. Before the observation, there is a paragraph tag with the attribute clear="both"
. I searched for any other paragraph tags with the same attribute in the page, but didn’t find any. Within this paragraph is a human-readable date, but then the paragraph is closed with a </p>
tag.
Following the paragraph is a newline, then a comment that the Data starts here
. Next is a newline, then the actual observation.
Step 2: Parsing the HTML¶
I could then locate the weather observation using this approach:
- Find all paragraph (
p
) tags with attributeclear="both"
. (There is only one.) - This gives me a list of length one. Select the first entry in this list.
- I could then navigate the tree of HTML, starting with this paragraph tag, to get to the next sibling. Critically, this skips over everything nested within the paragraph tag, including the bold (
b
) tag and the human-readable date. - It turns out that the next sibling is the newline character after the closing of the paragraph tag. Not what I want. I’ll move on to the next sibling of that.
- That entry turns out to be the comment. (
Data starts here
) I’ll move on to the next sibling another time. - I get the data i want now. But, it has an extraneous newline character, too. But I recall I can use the
strip
function to trim that off.
Using a Python interpreter, try writing lines of code to go through this process. Here are fragments of code that you will find useful:
import bs4
html = open(filename).read()
soup = bs4.BeautifulSoup(html)
tag_list = soup.find_all("p", clear="both")
...
tag = tag.next_sibling
...
Note that when you try the line that initializes the soup
object, you will get a warning message, along with advice on how to improve your code by adding an extra parameter to the function call. Go ahead and follow this advice.
Combine these snippets, the step-by-step process for getting the weather observations listed above, and your knowledge of Python to write a series of lines of code that successfully retrieve the weather observation string.
Step 3: Automating Queries for Any Airport¶
Now, here is a code snippet that loads HTML from a URL:
import urllib3
pm = urllib3.PoolManager()
...
html = pm.urlopen(url=myurl, method="GET").data
Take these snippets, and combine them with your BeautifulSoup-based parsing code and some additional code to build a URL through string concatenation, and create a single function current_weather
that takes in a parameter – an airport code – and returns back a string which is the latest weather observation from that airport.
Once you have this working, take a moment to understand the power of the function you just wrote: although you only inspected the format of the web page for one airport weather observation at one time, you found characteristic landmarks in the formatting of this page that will allow your function to work for any airport at any time, and to be used millions of times, if desired.
Example 2: Climate Data¶
This web page contains climate data for Chicago, formatted as a table.
Please review the web page as it is presented in your web browser, then save the file to disk and examine it again in your text editor. To find the code associated with the table, you can search for the word “January”. (Though be careful: it may appear more than once, in different contexts.)
Unlike the previous example, this page unsurprisingly uses a web table, with the table
, tr
, and td
tags. Write a Python function that, given the URL of a page formatted like this one as the first parameter (in other words, if you went to the web site and found the corresponding page for another city, you could use that URL as a parameter instead), and the name of a CSV file as the second parameter, writes the table out to a CSV file that can be opened in a spreadsheet program.
Your approach for this example will be fairly similar to the previous one. Here are some other pieces of BeautifulSoup syntax that you may find valuable:
- This table has a unique
id
attribute. You can search for all the tags of typet
with attributea
equal tov
with:soup.find_all(t, a=v)
- You can perform
find_all
on not just the entire web page, but also only on the part that is nested within a given tag. Call thefind_all
method on that tag rather than on the wholesoup
. This might be useful for finding all the rows in a given table and putting each row tag in a list? - Given a tag
t
that surrounds text (<t>interesting text here</t>
), writingt.text
yields the text. - Don’t forget to make your life easier when it comes to writing out csv by using the
csv
library.
Here are some other notes:
- The column headers have line break tags in the middle to place multi-word titles on consecutive lines. You won’t want HTML tags in your final CSV output. Does BeautifulSoup do something about these tags, or do you need to handle them yourself?
- The names of months in the table are links. You’ll want to just get the name of the month itself for your own purposes, so you’ll need to look inside the
a
tag and just get the text that is there. - To format special characters like the degree sign, the web page creator has used special directives. Do you need to interpret these somehow, or does BeautifulSoup convert them back into the appropriate symbols for you?
Example 3: Chicago ‘L’ Lines¶
The CTA maintains web pages for each of the ‘L’ lines in the city. For instance, here is the one for the Red Line.
Scroll down to the section titled “Route Diagram and Guide” and have a look. This section seems to be chock full of detail about this transit line: a list of stations, information about wheelchair accessibility and parking, and rail and bus transfers.
If you were doing a project on public transportation in Chicago, this page (and the corresponding ones for the other lines) seems like it would be a treasure trove of data. Unfortunately, much of it appears to be graphical. For instance, you can transfer to the Yellow and Purple lines at Howard, but short of writing code to interpret an image (a very complicated task), determining this in your code rather than by visual inspection seems out of arm’s reach. Similarly, while there are wheelchair and parking icons for some stations, they are icons, not text.
The HTML tag for images, however, allows a web site designer to provide an alternative textual representation of an image. This is accomplished by adding an alt
attribute to an img
tag. For various reasons, it is considered good practice to do this, and a well-designed web site will follow this protocol.
Save this page and open it in your text editor. Search for this table and take a look at the image tags. We’re in luck!
When we are navigating the tree of elements that results from parsing a web page with BeautifulSoup, we can retrieve the value of a tag’s attribute with the syntax tag["atr_name"]
. For instance, if t
is an img
tag, we could use t["alt"]
to retrieve its alt
text.
This table seems to be a little harder to distinctively identify. But there is a nearby a
tag that might serve as a landmark based on one of its attributes.
To take advantage of this, here’s what I tried:
soup.find_all("a", name="map")
Please try it yourself.
This didn’t work; Python gave me and, likely, you, an error message. What happened here?
It turns out that the name of the first argument, the one that stores the name of the tag itself (in this case a
), is already named name
. This conflicts with our attempt to specify the value of an attribute named name
.
This is a similar, but different, issue as the fact that you have to write class_
instead of class
when matching against an attribute named class
, because this keyword is a Python reserved word.
There is a workaround, however: you can make a dictionary of attributes to look for, and just have a single entry:
soup.find_all("a", attrs={"name":"map"})
Once you have found this a
tag, you can use .parent
and .next_sibling
to get to the table. Examine the nesting of tags carefully to understand what the tree of nested tags looks like in this vicinity.
Write code to scrape the table and turn it into a useful in-memory representation. Here is one reasonable choice:
- There is a list of stations in order down the line, in the order presented on the page. (For instance, for the Red Line, this is north to south, from Howard to 95th/Dan Ryan.)
- Each entry in this list is a dictionary. It has a key for the name, a key for the “amenities” (accessible, parking), a key for the URL for the station page, a key for the ‘L’ transfers (colored lines), and a key for the other connections (bus and Metra rail lines). The values associated with each of these keys could simply be text strings or lists of strings retrieved from the table, with processing to remove HTML coding. Here is an example for the Howard station, for instance:
{'L-transfers': ['transfer to yellow line', 'transfer to purple line'],
'amenities': ['accessible station', 'automobile parking available'],
'connections': '\n CTA Buses #22, #97, #147, #151, #201, #205, #206\n
Pace Buses #215, #290\n ',
'URL': '/travel_information/station.aspx?StopId=71', 'name': 'Howard'}
Conclusion¶
For any web page that holds data you want, you should start by seeing if the web site offers JSON or CSV data. But if it doesn’t, you have the option of retrieving HTML and using BeautifulSoup.
Doing so involves examining the HTML source code for the page, finding landmarks that allow you to programmatically locate the data of interest within the page, and then extracting it. Working through the page involves using navigation facilities like find_all
, parent
, and next_sibling
; extracting data involves unwrapping the tags around it to get to the text, while respecting and maintaining the structure of the data.
This process can be painstaking, but in the end, working with HTML is manageable, and made easier thanks to BeautifulSoup.
The experience you have gained in this lab is very likely to help you with the upcoming programming assignment, and in the course project.