CS 122 Lab 7: Graphs¶
The goal of this lab is to gain some experience working with graphs; learn and use the networkx
Python library; and see an example application of graphs.
Introduction¶
Graphs are a type of data that allows entities and their relationships to be modeled and analyzed. They consist of a set of vertices (singular: vertex; often informally referred to as nodes), which represent entities; and edges, which link two vertices that are somehow related.
For instance, vertices could represent different people, and edges could indicate who knows whom personally. Or, vertices could correspond to cities, and edges could indicate which cities have direct flights between them.
Graphs can be directed or undirected: a directed graph has edges that are not necessarily bidirectional, while an undirected graph always has reciprocal links. For instance, if nodes represent street intersections and edges, the streets that can be used to get from one intersection to another, then an undirected graph assumes that all streets are two-way, while a directed graph can accommodate one-way streets.
Graphs can be weighted or unweighted: a weighted graph has numeric weights assigned individually to each edge, capturing some aspect of the relationship, while an unweighted graph treats all relationships as equivalent. For instance, we could use edge weights to track the number of miles between two cities, or the number of journal articles on which two physicians have collaborated.
There are many ways to represent and work with graphs:
- We can use adjacency matrices, which store the presence or absence of edges between pairs of vertices, and possibly their weights. It turns out that mathematical matrix operations, such as matrix multiplication, and even determination of eigenvectors, allow valuable analyses to be performed. We will not consider this approach in this lab, but keep in mind that these tools exist.
- We can store nodes with adjacency lists: we can have node objects corresponding to each individual node, each of which has a list of adjacent nodes (i.e., those nodes with which the current node has a mutual edge). You may have seen this approach in class.
- We can use a graph library, such as
networkx
, that supports graph representation and analysis. We needn’t concern ourselves with how such a package internally represents graphs; we just learn and use its API to construct them and then perform analyses on them. We will follow this approach today.
The airline transportation network¶
As suggested earlier, graphs are suitable for modeling transportation networks: vertices naturally correspond to waypoints, and edges to connections between them. One such example is the airline transportation network.
We can represent airports as vertices in this model. Then, we can consider all routes on which at least one airline operates regularly-scheduled service, and create edges between the airports from which a given flight departs and to which it arrives.
This graph should be directed: an airline is not obligated to fly in both directions between a pair of cities, although obviously an airline must somehow ensure equipment makes its way back to the origin city in the case of a one-way relationship.
This graph could be weighted: we could label edges with the great-circle distances between corresponding airports, the single-segment airfares between cities with non-stop flights, the flights’ block time (duration from takeoff to landing), etc. In this lab, however, we will just use an unweighted graph and consider basic connectivity.
There are various sources of data on the air transportation network. One such source is OpenFlights, which has freely-downloadable information on airports, airlines, and scheduled routes from around the world, in CSV format. We have downloaded the airport and route data files and seeded your repository with them; please do a git pull upstream master
to pick up the lab7
directory.
To limit the scope and complexity of this lab, we will only consider airports in the contiguous 48 United States. This limitation then implies that we should only consider routes that remain within that geographic area, or else we would not have vertices for one or both of the airports associated with a flight. (You are more than welcome, even encouraged, to tinker with the entire data set on your own.)
Recall from an earlier lab that airports have two designators: an IATA code and an ICAO code. The IATA code is the three-letter code that you have doubtless used many times when booking flights: for instance, ORD
. (The International Air Transport Association is an industry trade group consisting of airlines, and promulgates the codes that they use in their passenger reservation systems.) The ICAO code is a four-letter code that is used by pilots, air traffic controllers, and others involved in the technical operation of flights (the International Civil Aviation Organization is a United Nations agency that coordinates technical aspects of aviation).
ICAO codes are assigned in a hierarchical manner; airports in the contiguous 48 US states begin with K
. (Ones in Alaska and Hawaii begin with P
.) The remaining three letters often, but not always, match the IATA code: KORD
, KMDW
. The significance of the first letter will make it trivial to identify airports within our desired region.
Each route that is actively being operated by an airline consists of information like the airline identifier, the source and destination airports, and the type of equipment typically used. For the purposes of this lab, we will only consider the source and destination airports, although you could imagine modeling different airlines separately and comparing their networks, or analyzing the redundancy between different carriers, etc.
The OpenFlights data¶
Although we have already downloaded and provided you with the CSV files we will be using, the site from whence we obtained them has information on their formats which you should examine (we are only using the airport and route data).
The creators of this data have tried to facilitate the use of this information in a relational database. Thus, they have assigned integer primary keys to each airport (which they refer to as a “unique identifier”). For the routes, the unique identifiers for the source and destination airports are again specified, as a foreign key.
The data on airports also includes both the IATA and ICAO codes. The data on routes always specifies the foreign key (unique airport ID) for the source and destination airports, but also one (but not both) of the IATA or ICAO codes for the source and destination airports.
We need to be able to reference a route back to its associated airports. We could track both sets of codes for each airport, and then, for a given route, resolve which airports it connects by comparing both types of codes for each airport to the ones specified for the route. But, the creators of this data have simplified this process by establishing consistent, unique identifiers for airports and using them religiously for each route. When the providers of data go out of their way to make our lives easier, let’s say “please and thank you” and use it – we will use IATA codes to understand what airport we are looking at, but we will use the unique IDs to associate airports and routes.
networkx¶
The networkx
library is a Python package that allows us to store, manipulate, and analyze graphs.
It is already installed on the CSIL machines we checked; if you receive an ImportError
, run the following to install it:
pip3 install --user networkx
If you want networkx
on your VM:
sudo pip3 install networkx
To import it, the convention is to:
import networkx as nx
We will now intersperse basic information on working with networkx
and tasks for you to perform. If you’d like to explore the full documentation for this library, it can be found here. (We have linked to one of the most helpful introductory pages, but you can follow links to further details.)
The networkx
package uses different types of objects to represent undirected and directed graphs. To make an empty undirected graph:
g = nx.Graph()
while, to make an initial directed graph (sometimes abbreviated as a digraph):
g = nx.DiGraph()
As we mentioned earlier, a digraph makes the most sense for this data, although at one point we will have occasion to convert our digraph to an undirected graph. (Some analyses in networkx
do not support digraphs.)
When working with networkx
, we need to decide how to refer to a given vertex. This is the identifier that we will use whenever we want to reference it; it is somewhat akin to the index in a list or the key in a dictionary. An obvious choice for airports would be the IATA or ICAO code, and indeed, networkx
supports strings in this role. But, as discussed earlier, we should use the unique integer ID that has been helpfully provided by OpenFlights. This has the benefit of avoiding any record linkage problems, but does complicate looking up an airport, or understanding which airport we are looking at in a human-readable manner. We’ll come back to this later.
To make a node:
g.add_node(17)
where 17
is the identifier for a particular node, such as the unique OpenFlights ID for an airport.
To construct an edge:
g.add_edge(23, 48)
where 23
and 48
are the identifiers for the nodes, created earlier, that should be linked with an edge; in the case of a digraph, the first parameter represents the source of the edge, and the second, the destination.
Task 1: Load data¶
Write a function to:
- Use the
csv
module to load the airport data. - Skip over any airport not in the 48 contiguous US states (IATA code begins with “K”).
- Add nodes to a digraph corresponding to each K-prefixed airport. The identifier should be the OpenFlights unique ID.
Recall that id
is a reserved word in Python; be careful not to try to store an airport identifier in a variable with this name, as tempting as it may seem. This can result in very-hard-to-debug behavior.
Then, write a function that takes the partially-constructed graph, and:
- Reads the list of routes using the
csv
library. - Skips over any route whose source airport, destination airport, or both, are not in the graph
- Adds edges between the airports for each route that does remain within our set of airports.
To determine if a given airport integer ID is in our set of created nodes:
x in g.nodes()
There is a subtle issue here: OpenFlights specifies routes on an airline-by-airline basis. But, multiple airlines often fly routes between the same pair of cities, and we are not distinguishing routes among airlines. This will result in duplicate edges between vertices, which are not allowed in standard graphs. But, if you attempt to add a redundant edge,, networkx
simply ignores the duplicate request, so we needn’t worry about this.
Here is a much bigger issue: you should convert these IDs to integers rather than keeping them as strings. But, there are some rows in the file for which no value is filled in. OpenFlights represents these with a backslash followed by the letter n
or N
. Skip these rows.
Task 2: Busy airports¶
Which airports in the US have the most arrivals? the most destinations?
In graph theory, the degree of a node is the number of edges coming into or going out of it. In a directed graph, the in-degree specifically refers to incoming edges, while the out-degree refers to outgoing ones.
The networkx
library allows us to retrieve the degrees of each node in the graph. If we wanted the in-degree of the airport with ID 3830
(which happens to be O’Hare):
g.in_degree(3830)
If we wanted to get the out-degrees of all the airports:
g.out_degree()
This gives us back a dictionary with keys (airport IDs) mapping to out-degrees (values). Note that it is not suspicious if many airports have degrees of zero; this data set lists so-called “general aviation” airports as well, but only includes scheduled airline service. General-aviation airports may exclusively serve flight training, hobbyist, and charter activity.
But, there is a broader problem: we need to know the ID number for an airport to look it up, and if we get a dictionary with data on all nodes, its keys are hard-to-interpret numbers. We need a way to map between airport IDs (integers) and IATA codes.
We can use a pair of dictionaries (one for each direction) to accomplish this. But, there is a feature of networkx
that comes in handy here, so we’ll do it a little differently.
We’ll still use one dictionary, though. Go back and modify your function that creates the nodes in the graph; have it maintain a dictionary mapping from IATA codes to integer airport IDs, and return this.
But, we’ll make a further modification to this function: we’ll use a feature of networkx
that allows us to assign attributes to vertices. In effect, networkx
provides a dictionary for each node, and we can store whatever keys and values we like in it.
To do so, add an additional parameter to your add_node
call:
g.add_node(airport_id, icao=icao_code)
In the above, airport_id
is a variable containing the integer ID (which you may have named differently), icao
is the name of the key we’ll use in the attributes dictionary, and icao_code
is the name of a variable (use the name you chose instead) with the actual code.
Then, later, whenever we have an airport integer ID x
, we can look up the code with the syntax:
g.node[x]["icao"]
You should now be able to programmatically convert back and forth between ICAO codes and integer airport IDs.
Now, write a function that takes the graph and dictionary, and a value of k, and prints the top-k airports in the 48 US states, based on arrivals; separately, the top-k based on departures. Both tables should be in reverse sorted order and specify the ICAO code and number of flight routes.
Task 3: Non-stop flights¶
The following code will give us back a list of all the destinations of flights leaving O’Hare:
g.successors(3830)
Replace successors
with predecessors
to get inbound flights.
Write a function that takes in the graph, dictionary, and an ICAO code, and prints (separately) the non-stop destinations to which, and from which, you can fly at that airport. Of course, this printout should give ICAO codes, not numbers.
(Note that the data set has a subtlety involving stops for some routes. Just assume that all routes given are non-stop, for simplicity.)
You can try this function with your favorite US airports, though if your favorite is a busy one, be prepared for a long printout. If you happen to be familiar with an airport in a smaller locale, you might want to try that. One interesting one is KPWM (Portland, Maine), which is particularly noteworthy in its asymmetry.
Task 4: Connected components¶
A connected component of a graph is the subset of vertices that are reachable, directly or indirectly (transitively via multi-node paths) from each other. For instance, the North / Central / South American street network is a connected component within the worldwide street network. The street network within the Azores (an island in the Atlantic) is a separate connected component, because you cannot drive between them (successfully).
One imagines that the US airline system is a connected component. Are there any small airlines that serve small airports without connecting flights operated by other carriers? Might this represent a separate connected component? Let’s find out.
The networkx
package can give you back a list of connected components. Unfortunately, it only provides this feature for undirected graphs. To convert:
ug = nx.Graph(g)
Then, you can:
nx.connected_components(ug)
This gives you back a list of the one or more connected components. Each entry in this list is a list of the node IDs in that component.
Try this out. You’ll find a inconvenient fact: the data set contains many general-aviation airports, as previously discussed. These airports have no routes. Each, technically, is a (very small) connected component of one.
Throw out all the connected components of size 1. How many connected components remain? For any other than the main one that represents the huge US aviation network, what airports are served by the smaller, isolated network(s)? (If you are curious, you can investigate the airports and airline. Wikipedia is a good source.)
Task 5: Routings¶
Earlier, we looked at non-stop flights between airports. But, for small airports, it is likely we will need to connect via a hub to get to another destination. In some cases, we might even need multiple layovers. (Ugh.)
In a graph-theoretic sense, this corresponds to paths in the graph: routes that use one, or more, edges. A simple path is one that doesn’t do anything silly, like double back to the same airport again. (It might still do something silly, like fly from between the east and west coasts multiple times. We should incorporate information about distances and consider shortest paths of we want to avoid this. Coming soon to a lecture near you.)
The networkx
function:
nx.all_simple_paths(g, src, dst, cutoff=k)
finds all paths in the graph g
from airport ID src
to ID dst
with number of edges no more than k
. It yields a list of paths, each a list of airports visited in order.
Write a function that takes in a pair of ICAO codes and a value of k and prints out all the paths available.
You will probably want to test this function with a small airport that has limited destinations. Portland is again a suitable choice, if you don’t have another one in mind.
Conclusion¶
Many types of data concern relationships between entities. Graphs allow us to model this type of data, just as trees allow us to model hierarchical data.
There are many graph theoretic concepts and analyses. We have seen a very small subset of them, but already, the ones we have seen allow us to answer real and interesting questions about data. In class, we will learn about some others, including those that allow us to reason about relationships between people and their importance.
The networkx
package allows us to store, manipulate, and analyze graphs in Python in a straightforward manner.
Many types of questions concern data that should be represented as a graph, and can be answered using graph analysis. Knowing the concepts, tools, and approaches is valuable and important.