NumPy¶
Throughout this course, we have seen a number of examples of manipulating lists of numbers. NumPy is a Python library that makes such manipulations easy. NumPy supports numerical multi-dimensional arrays (standing in for lists of lists, lists of lists of lists, etc.) where each axis has a fixed dimension. This lab will introduce NumPy and overview some of the key functionality.
You need to import NumPy before you can use it. We usually give the library a shorter name by using the import-as mechanism:
import numpy as np
Once this import is done, you can use to functions from the numpy
library using np
as the qualifier.
Getting started¶
To get started, open up a terminal and navigate (cd
) to your
cmsc12100-aut-20-username
directory. Run git pull upstream
master
to collect the lab materials and git pull
to sync with
your personal repository. The lab5
directory contains a file
named lab5.py
.
This file includes a function, read_file
, that takes the name of a
CSV file as an argument and returns a list of the column names and a
two dimensional array of data, and a call to the function that loads
a dataset you will use in PA #5. This dataset contains one row for every
region of a city (the exact regions are unimportant), with all but the
last column containing the number of complaints of a given type reported
through 311 (e.g., graffiti, pot holes, etc.) The last column contains
the total number of crimes reported in that region of the city.
Fire up ipython3
and run lab5.py
to get started. This run will
print out some output which you can ignore for now.
One-dimensional arrays in NumPy¶
We’ll start by looking at one-dimensional arrays in NumPy. Unlike
Python lists, all of the values in a NumPy array must have the same
type. We can create a one-dimensional NumPy array from a list using
the function np.array
. For example,
In [10]: a1 = np.array([10, 20, 30, 40])
In [11]: a1
Out[11]: array([10, 20, 30, 40])
NumPy arrays are distinct from lists and you should use NumPy’s built-in
functions and attributes to determine sizes. For example, if we call
a1.shape
, we get:
In [12]: a1.shape
Out[12]: (4,)
In [13]: a1.shape[0]
Out[13]: 4
Note: that the result of a1.shape
is a tuple that describes the
size of the array in all dimensions (a1 is a one-dimensional array, so
it is a tuple of size 1).
We can access/update the ith element of the array using [] notation:
In [14]: a1[0]
Out[14]: 10
In [15]: a1[2]
Out[15]: 30
In [17]: a1[2] = 50
In [18]: a1
Out[18]: array([10, 20, 50, 40])
Operations on NumPy arrays are element-wise. For example, the expression:
In [23]: a1 * 2
Out[23]: array([ 20, 40, 100, 80])
yields a new NumPy array where the ith element of the result is equal
to the ith element of a1
times 2. Note that a given operator
(e.g., *
) can have a different meaning depending on the data type to which
is it applied. For example, try making a1
a list, rather than a NumPy array,
and repeat the same operation.
Similarly,
In [25]: a1
Out[25]: array([10, 20, 50, 40])
In [26]: a2 = np.array([100, 200, 300, 400])
In [27]: a1 + a2
Out[27]: array([110, 220, 350, 440])
yields a new array where the ith element is the sum of the ith
elements of a1
and a2
. Again, the plus operator has a very
different meaning for lists. Try applying the +
operator to two
lists to compare what happens. Be careful about Boolean operations on
NumPy arrays as they are also applied element-wise:
In [28]: a1 > 20
Out[28]: array([False, False, True, True])
In [29]: a2 = np.array([10,20,30,40])
In [30]: a1 == a2
Out[30]: array([ True, True, True, True])
NumPy also provides useful methods for operating on arrays, such as
sum
and mean
:
In [28]: a1.sum()
Out[28]: 120
In [29]: a1.mean()
Out[29]: 30.0
which add up the values in the array and compute its mean respectively. These operations can also be written using notation that looks more like a function call:
In [32]: np.mean(a1)
Out[32]: 30.0
In [33]: np.sum(a1)
Out[33]: 120
Task 1: Write a function:
def var(y):
that computes the variance of y
, where y
is a NumPy array. We
will define variance to be:
where \(\mathbf{y} = (y_1, y_2, \dots, y_N)\), and \(\bar y\) denotes the mean of all of the \(y_n\). Your solution should not include an explicit loop.
The code in lab5.py
will call var
on a array called graffiti
,
which contains the graffiti column from the city data set that you will use in Programming Assignment #5,
and on a garbage
array, which contains the garbage
column from the same data set. Here’s the output of our
implementation:
GRAFFITI 409854.475818
GARBAGE 3159.33473311
Two-dimensional arrays¶
One-dimensional arrays are useful, but the real power of NumPy becomes more apparent when working with data that looks more like a matrix. For example, here’s a matrix represented using a list-of-lists:
m = [[0, 1, 4, 9],
[16, 25, 36, 49],
[64, 81, 100, 121],
[144, 169, 196, 225],
[256, 289, 324, 361],
[400, 441, 484, 529]]
We can convert this data into a two-dimensional array as follows:
In [34]: b = np.array(m)
where the value of b
will be:
In [34]: b
array([[ 0, 1, 4, 9],
[ 16, 25, 36, 49],
[ 64, 81, 100, 121],
[144, 169, 196, 225],
[256, 289, 324, 361],
[400, 441, 484, 529]])
Accessing elements of a two-dimensional (2D) NumPy array can be done using the same
syntax as a list-of-lists, that is, the expression b[i][j]
will yield
the jth element of the ith row of b
. More conveniently, you can
use a tuple to access the elements of a NumPy array. That is, the
expression b[i, j]
will also yield the jth element of the ith row
of b
.
NumPy arrays also support slicing and other more advanced forms of
indexing. For example, the expression b[1:4]
will yield:
In [35]: b[1:4]
array([[ 16, 25, 36, 49],
[ 64, 81, 100, 121],
[144, 169, 196, 225]])
rows 1, 2, and 3 from b
. The expression, b[1:4, 2:4]
will
yield columns 2 and 3 from rows 1, 2, and 3 of b
:
In [36]: b[1:4, 2:4]
array([[ 36, 49],
[100, 121],
[196, 225]])
As with slicing and lists, a colon (:
) can be used to indicate
that you wish to include all the indices in a particular dimension.
For example, b[:,2:4]
will yield a slice of b
with columns 2
and 3 from all the rows. Recall that slice excludes the endpoint.
In addition to slicing, you can also specifies a list of indices as an
index. For example, the expression: b[:, [1, 3]]
will yield
columns 1 and 3 from b
:
In [37]: b[:, [1, 3]]
array([[ 1, 9],
[ 25, 49],
[ 81, 121],
[169, 225],
[289, 361],
[441, 529]])
One thing to keep in mind with NumPy arrays is that you will lose a dimension if you specify a single column or row as an index. For example, notice that the results of the following two expressions are both one-dimensional arrays:
In [38]: b[1, :]
Out[38]: array([16, 25, 36, 49])
In [39]: b[:, 1]
Out[39]: array([ 1, 25, 81, 169, 289, 441])
If you wish to retain the dimension, you can use list indexing:
In [40]: b[:, [1]]
Out[40]:
array([[ 1],
[ 25],
[ 81],
[169],
[289],
[441]])
In [41]: b[[1], :]
Out[41]: array([[16, 25, 36, 49]])
Task 2: In the task2
function in lab5.py
,
write expressions to extract the following subarrays of b
:
rows 0, 1, and 2.
rows 0, 1, and 5
columns 0, 1, and 2
columns 0, 1, and 3
columns 0, 1, and 2 from rows 2 and 3.
Task 3: We have imported the functions linear_regression
and prepend_ones_column
(which you
will use in PA #5)
in lab5.py
. Write code to call linear_regression
using a column of all ones
and columns 2 (RODENTS
) and 3 (GARBAGE
) of city_data
as the value for X
and column
7 (CRIME_TOTALS
) as the value for y
. This function expects a
two-dimensional NumPy array for the value of X
and a
one-dimensional NumPy array for the value of y
.
Hint: you can do this task in a single line of code.
The result should be:
array([ 66.60834501 0.58072845 16.82863941])
Do not worry if you do not understand what this result means. The goal is to ensure you understand how to perform basic array operations. In PA #5 we will dig deeper into the linear regression involved in this computation.
Task 4: Write code to call linear_regression
using a column of all ones and column 0
(GRAFFITI
) of city_data
as the value for X
and column 7 (CRIME_TOTALS
) as the
value for y
.
The result should be:
array([ 593.77754996 0.76632378])
Other useful operations¶
You can find the number of dimensions, shape, and the number of
elements in a NumPy array using the ndim
, shape
and size
properties respectively.
In [42]: b.ndim
Out[42]: 2
In [43]: b.shape
Out[43]: (6, 4)
In [44]: b.size
Out[44]: 24
As noted above, you can compute the mean of the elements using the
mean
method.
In [54]: b.mean()
Out[54]: 180.16666666666666
You can also compute the per-column mean and the per-row mean using the mean method by specifying an axis, where 0 is the column axis and 1 is the row axis:
In [55]: b.mean(0)
Out[55]: array([ 146.66666667, 167.66666667, 190.66666667, 215.66666667])
In [56]: b.mean(1)
Out[56]: array([ 3.5, 31.5, 91.5, 183.5, 307.5, 463.5])
In [57]: np.mean(b, axis=0)
Out[57]: array([ 146.66666667, 167.66666667, 190.66666667, 215.66666667])
In [58]: np.mean(b, axis=1)
Out[58]: array([ 3.5, 31.5, 91.5, 183.5, 307.5, 463.5])
When Finished¶
When finished with the lab please check in your work (assuming you are inside the lab directory):
git add lab5.py
git commit -m "Finished with lab5"
git push
No, we’re not grading this, we just want to look for common errors.