NumPy¶
Throughout this course, we have seen a number of examples of manipulating lists of numbers. NumPy is a Python library that makes such manipulations easy. NumPy supports numerical multi-dimensional arrays (standing in for lists of lists, lists of lists of lists, etc.) where each axis has a fixed dimension. This lab will introduce NumPy and overview some of the key functionality.
You need to import NumPy before you can use it. We usually give the library a shorter name by using the import-as mechanism:
import numpy as np
Once this import is done, you can use to functions from the numpy
library using np
as the qualifier.
Getting started¶
To get started, open up a terminal and navigate (cd
) to your
cmsc12100-aut-19-username
directory. Run git pull upstream
master
to collect the lab materials and git pull
to sync with
your personal repository. The lab6
directory contains a file
named lab6.py
.
This file includes a function, read_file
, that takes the name of a
CSV file as an argument and returns a list of the column names and a
two dimensional array of data, and a call to the function that loads
the training data from the city dataset for PA #5.
Fire up ipython3
and run lab6.py
to get started. This run will
print out some output which you can ignore for now.
One-dimensional arrays in NumPy¶
We’ll start by looking at one-dimensional arrays in NumPy. Unlike
Python lists, all of the values in a NumPy array must have the same
type. We can create a one-dimensional NumPy array from a list using
the function np.array
. For example,
In [10]: a1 = np.array([10, 20, 30, 40])
In [11]: a1
Out[11]: array([10, 20, 30, 40])
NumPy arrays are distinct from lists and you should use NumPy’s built-in
functions and attributes to determine sizes. For example, if we call
a1.shape
, we get:
In [12]: a1.shape
Out[12]: (4,)
In [13]: a1.shape[0]
Out[13]: 4
Note: that the result of a1.shape
is a tuple that describes the
size of the array in all dimensions (a1 is a one-dimensional array, so
it is a tuple of size 1).
We can access/update the ith element of the array using [] notation:
In [14]: a1[0]
Out[14]: 10
In [15]: a1[2]
Out[15]: 30
In [17]: a1[2] = 50
In [18]: a1
Out[18]: array([10, 20, 50, 40])
Operations on NumPy arrays are element-wise. For example, the expression:
In [23]: a1 * 2
Out[23]: array([ 20, 40, 100, 80])
yields a new NumPy array where the ith element of the result is equal
to the ith element of a1
times 2. Note that a given operator
(e.g., *
) can have a different meaning depending on the data type to which
is it applied. For example, try making a1
a list, rather than a NumPy array,
and repeat the same operation.
Similarly,
In [25]: a1
Out[25]: array([10, 20, 50, 40])
In [26]: a2 = np.array([100, 200, 300, 400])
In [27]: a1 + a2
Out[27]: array([110, 220, 350, 440])
yields a new array where the ith element is the sum of the ith
elements of a1
and a2
. Again, the plus operator has a very
different meaning for lists. Try applying the +
operator to two
lists to compare what happens. Be careful about Boolean operations on
NumPy arrays as they are also applied element-wise:
In [28]: a1 > 20
Out[28]: array([False, False, True, True])
In [29]: a2 = np.array([10,20,30,40])
In [30]: a1 == a2
Out[30]: array([ True, True, True, True])
NumPy also provides useful methods for operating on arrays, such as
sum
and mean
:
In [28]: a1.sum()
Out[28]: 120
In [29]: a1.mean()
Out[29]: 30.0
which add up the values in the array and compute its mean respectively. These operations can also be written using notation that looks more like a function call:
In [32]: np.mean(a1)
Out[32]: 30.0
In [33]: np.sum(a1)
Out[33]: 120
Task 1: Write a function:
def var(y):
that computes the variance of y
, where y
is a NumPy array. We
will define variance to be:
where \(\mathbf{y} = (y_1, y_2, \dots, y_N)\), and \(\bar y\) denotes the mean of all of the \(y_n\). Your solution should not include an explicit loop.
The code in lab6.py
will call var
on a array called graffiti
,
which contains the graffiti column from the city data set that you will use in Programming Assignment #5,
and on a garbage
array, which contains the garbage
column from the same data set. Here’s the output of our
implementation:
GRAFFITI 409854.475818
GARBAGE 3159.33473311
Two-dimensional arrays¶
One-dimensional arrays are useful, but the real power of NumPy becomes more apparent when working with data that looks more like a matrix. For example, here’s a matrix represented using a list-of-lists:
m = [[0, 1, 4, 9],
[16, 25, 36, 49],
[64, 81, 100, 121],
[144, 169, 196, 225],
[256, 289, 324, 361],
[400, 441, 484, 529]]
We can convert this data into a two-dimensional array as follows:
In [34]: b = np.array(m)
where the value of b
will be:
In [34]: b
array([[ 0, 1, 4, 9],
[ 16, 25, 36, 49],
[ 64, 81, 100, 121],
[144, 169, 196, 225],
[256, 289, 324, 361],
[400, 441, 484, 529]])
Accessing elements of a two-dimensional (2D) NumPy array can be done using the same
syntax as a list-of-lists, that is, the expression b[i][j]
will yield
the jth element of the ith row of b
. More conveniently, you can
use a tuple to access the elements of a NumPy array. That is, the
expression b[i, j]
will also yield the jth element of the ith row
of b
.
NumPy arrays also support slicing and other more advanced forms of
indexing. For example, the expression b[1:4]
will yield:
In [35]: b[1:4]
array([[ 16, 25, 36, 49],
[ 64, 81, 100, 121],
[144, 169, 196, 225]])
rows 1, 2, and 3 from b
. The expression, b[1:4, 2:4]
will
yield columns 2 and 3 from rows 1, 2, and 3 of b
:
In [36]: b[1:4, 2:4]
array([[ 36, 49],
[100, 121],
[196, 225]])
As with slicing and lists, a colon (:
) can be used to indicate
that you wish to include all the indices in a particular dimension.
For example, b[:,2:4]
will yield a slice of b
with columns 2
and 3 from all the rows. Recall that slice excludes the endpoint.
In addition to slicing, you can also specifies a list of indices as an
index. For example, the expression: b[:, [1, 3]]
will yield
columns 1 and 3 from b
:
In [37]: b[:, [1, 3]]
array([[ 1, 9],
[ 25, 49],
[ 81, 121],
[169, 225],
[289, 361],
[441, 529]])
One thing to keep in mind with NumPy arrays is that you will lose a dimension if you specify a single column or row as an index. For example, notice that the results of the following two expressions are both one-dimensional arrays:
In [38]: b[1, :]
Out[38]: array([16, 25, 36, 49])
In [39]: b[:, 1]
Out[39]: array([ 1, 25, 81, 169, 289, 441])
If you wish to retain the dimension, you can use list indexing:
In [40]: b[:, [1]]
Out[40]:
array([[ 1],
[ 25],
[ 81],
[169],
[289],
[441]])
In [41]: b[[1], :]
Out[41]: array([[16, 25, 36, 49]])
Task 2: In the task2
function in lab6.py
,
write expressions to extract the following subarrays of b
:
- rows 0, 1, and 2.
- rows 0, 1, and 5
- columns 0, 1, and 2
- columns 0, 1, and 3
- columns 0, 1, and 2 from rows 2 and 3.
Task 3: We have imported the functions linear_regression
and prepend_ones_column
from PA #5
in lab6.py
. Write code to call linear_regression
using a column of all ones
and columns 2 (RODENTS
) and 3 (GARBAGE
) of city_data
as the value for X
and column
7 (CRIME_TOTALS
) as the value for y
. This function expects a
two-dimensional NumPy array for the value of X
and a
one-dimensional NumPy array for the value of y
.
Hint: you can do this task in a single line of code.
The result should be:
array([ 66.60834501 0.58072845 16.82863941])
Task 4: Write code to call linear_regression
using a column of all ones and column 0
(GRAFFITI
) of city_data
as the value for X
and column 7 (CRIME_TOTALS
) as the
value for y
.
The result should be:
array([ 593.77754996 0.76632378])
Other useful operations¶
You can find the number of dimensions, shape, and the number of
elements in a NumPy array using the ndim
, shape
and size
properties respectively.
In [42]: b.ndim
Out[42]: 2
In [43]: b.shape
Out[43]: (6, 4)
In [44]: b.size
Out[44]: 24
As noted above, you can compute the mean of the elements using the
mean
method.
In [54]: b.mean()
Out[54]: 180.16666666666666
You can also compute the per-column mean and the per-row mean using the mean method by specifying an axis, where 0 is the column axis and 1 is the row axis:
In [55]: b.mean(0)
Out[55]: array([ 146.66666667, 167.66666667, 190.66666667, 215.66666667])
In [56]: b.mean(1)
Out[56]: array([ 3.5, 31.5, 91.5, 183.5, 307.5, 463.5])
In [57]: np.mean(b, axis=0)
Out[57]: array([ 146.66666667, 167.66666667, 190.66666667, 215.66666667])
In [58]: np.mean(b, axis=1)
Out[58]: array([ 3.5, 31.5, 91.5, 183.5, 307.5, 463.5])
When Finished¶
When finished with the lab please check in your work (assuming you are inside the lab directory):
git add lab6.py
git commit -m "Finished with lab6"
git push
No, we’re not grading this, we just want to look for common errors.