NumPy

Throughout this course, we have seen a number of examples of manipulating lists of numbers. NumPy is a Python library that makes such manipulations easy. NumPy supports numerical multi-dimensional arrays (standing in for lists of lists, lists of lists of lists, etc.) where each axis has a fixed dimension. This lab will introduce NumPy and overview some of the key functionality.

You need to import NumPy before you can use it. We usually give the library a shorter name by using the import-as mechanism:

import numpy as np

Once this import is done, you can use to functions from the numpy library using np as the qualifier.

Getting started

To get started, open up a terminal and navigate (cd) to your cmsc12100-aut-19-username directory. Run git pull upstream master to collect the lab materials and git pull to sync with your personal repository. The lab6 directory contains a file named lab6.py.

This file includes a function, read_file, that takes the name of a CSV file as an argument and returns a list of the column names and a two dimensional array of data, and a call to the function that loads the training data from the city dataset for PA #5.

Fire up ipython3 and run lab6.py to get started. This run will print out some output which you can ignore for now.

One-dimensional arrays in NumPy

We’ll start by looking at one-dimensional arrays in NumPy. Unlike Python lists, all of the values in a NumPy array must have the same type. We can create a one-dimensional NumPy array from a list using the function np.array. For example,

In [10]: a1 = np.array([10, 20, 30, 40])

In [11]: a1
Out[11]: array([10, 20, 30, 40])

NumPy arrays are distinct from lists and you should use NumPy’s built-in functions and attributes to determine sizes. For example, if we call a1.shape, we get:

In [12]: a1.shape
Out[12]: (4,)

In [13]: a1.shape[0]
Out[13]: 4

Note: that the result of a1.shape is a tuple that describes the size of the array in all dimensions (a1 is a one-dimensional array, so it is a tuple of size 1).

We can access/update the ith element of the array using [] notation:

In [14]: a1[0]
Out[14]: 10

In [15]: a1[2]
Out[15]: 30

In [17]: a1[2] = 50

In [18]: a1
Out[18]: array([10, 20, 50, 40])

Operations on NumPy arrays are element-wise. For example, the expression:

In [23]: a1 * 2
Out[23]: array([ 20,  40, 100,  80])

yields a new NumPy array where the ith element of the result is equal to the ith element of a1 times 2. Note that a given operator (e.g., *) can have a different meaning depending on the data type to which is it applied. For example, try making a1 a list, rather than a NumPy array, and repeat the same operation.

Similarly,

In [25]: a1
Out[25]: array([10, 20, 50, 40])

In [26]: a2 = np.array([100, 200, 300, 400])

In [27]: a1 + a2
Out[27]: array([110, 220, 350, 440])

yields a new array where the ith element is the sum of the ith elements of a1 and a2. Again, the plus operator has a very different meaning for lists. Try applying the + operator to two lists to compare what happens. Be careful about Boolean operations on NumPy arrays as they are also applied element-wise:

In [28]: a1 > 20
Out[28]: array([False, False,  True,  True])

In [29]: a2 = np.array([10,20,30,40])
In [30]: a1 == a2
Out[30]: array([ True,  True,  True,  True])

NumPy also provides useful methods for operating on arrays, such as sum and mean:

In [28]: a1.sum()
Out[28]: 120

In [29]: a1.mean()
Out[29]: 30.0

which add up the values in the array and compute its mean respectively. These operations can also be written using notation that looks more like a function call:

In [32]: np.mean(a1)
Out[32]: 30.0

In [33]: np.sum(a1)
Out[33]: 120

Task 1: Write a function:

def var(y):

that computes the variance of y, where y is a NumPy array. We will define variance to be:

\[\mathrm{Var}(\mathbf{y}) = \frac{1}{N}{\sum_{n=1}^N} (y_n - \bar y)^2,\]

where \(\mathbf{y} = (y_1, y_2, \dots, y_N)\), and \(\bar y\) denotes the mean of all of the \(y_n\). Your solution should not include an explicit loop.

The code in lab6.py will call var on a array called graffiti, which contains the graffiti column from the city data set that you will use in Programming Assignment #5, and on a garbage array, which contains the garbage column from the same data set. Here’s the output of our implementation:

GRAFFITI 409854.475818
GARBAGE 3159.33473311

Two-dimensional arrays

One-dimensional arrays are useful, but the real power of NumPy becomes more apparent when working with data that looks more like a matrix. For example, here’s a matrix represented using a list-of-lists:

m = [[0, 1, 4, 9],
     [16, 25, 36, 49],
     [64, 81, 100, 121],
     [144, 169, 196, 225],
     [256, 289, 324, 361],
     [400, 441, 484, 529]]

We can convert this data into a two-dimensional array as follows:

In [34]: b = np.array(m)

where the value of b will be:

In [34]: b
array([[  0,   1,   4,   9],
       [ 16,  25,  36,  49],
       [ 64,  81, 100, 121],
       [144, 169, 196, 225],
       [256, 289, 324, 361],
       [400, 441, 484, 529]])

Accessing elements of a two-dimensional (2D) NumPy array can be done using the same syntax as a list-of-lists, that is, the expression b[i][j] will yield the jth element of the ith row of b. More conveniently, you can use a tuple to access the elements of a NumPy array. That is, the expression b[i, j] will also yield the jth element of the ith row of b.

NumPy arrays also support slicing and other more advanced forms of indexing. For example, the expression b[1:4] will yield:

In [35]: b[1:4]
array([[ 16,  25,  36,  49],
       [ 64,  81, 100, 121],
       [144, 169, 196, 225]])

rows 1, 2, and 3 from b. The expression, b[1:4, 2:4] will yield columns 2 and 3 from rows 1, 2, and 3 of b:

In [36]: b[1:4, 2:4]
array([[ 36,  49],
       [100, 121],
       [196, 225]])

As with slicing and lists, a colon (:) can be used to indicate that you wish to include all the indices in a particular dimension. For example, b[:,2:4] will yield a slice of b with columns 2 and 3 from all the rows. Recall that slice excludes the endpoint.

In addition to slicing, you can also specifies a list of indices as an index. For example, the expression: b[:, [1, 3]] will yield columns 1 and 3 from b:

In [37]: b[:, [1, 3]]
array([[  1,   9],
       [ 25,  49],
       [ 81, 121],
       [169, 225],
       [289, 361],
       [441, 529]])

One thing to keep in mind with NumPy arrays is that you will lose a dimension if you specify a single column or row as an index. For example, notice that the results of the following two expressions are both one-dimensional arrays:

In [38]: b[1, :]
Out[38]: array([16, 25, 36, 49])

In [39]: b[:, 1]
Out[39]: array([  1,  25,  81, 169, 289, 441])

If you wish to retain the dimension, you can use list indexing:

In [40]: b[:, [1]]
Out[40]:
array([[  1],
       [ 25],
       [ 81],
       [169],
       [289],
       [441]])

In [41]: b[[1], :]
Out[41]: array([[16, 25, 36, 49]])

Task 2: In the task2 function in lab6.py, write expressions to extract the following subarrays of b:

  • rows 0, 1, and 2.
  • rows 0, 1, and 5
  • columns 0, 1, and 2
  • columns 0, 1, and 3
  • columns 0, 1, and 2 from rows 2 and 3.

Task 3: We have imported the functions linear_regression and prepend_ones_column from PA #5 in lab6.py. Write code to call linear_regression using a column of all ones and columns 2 (RODENTS) and 3 (GARBAGE) of city_data as the value for X and column 7 (CRIME_TOTALS) as the value for y. This function expects a two-dimensional NumPy array for the value of X and a one-dimensional NumPy array for the value of y.

Hint: you can do this task in a single line of code.

The result should be:

array([ 66.60834501   0.58072845  16.82863941])

Task 4: Write code to call linear_regression using a column of all ones and column 0 (GRAFFITI) of city_data as the value for X and column 7 (CRIME_TOTALS) as the value for y.

The result should be:

array([ 593.77754996    0.76632378])

Other useful operations

You can find the number of dimensions, shape, and the number of elements in a NumPy array using the ndim, shape and size properties respectively.

In [42]: b.ndim
Out[42]: 2

In [43]: b.shape
Out[43]: (6, 4)

In [44]: b.size
Out[44]: 24

As noted above, you can compute the mean of the elements using the mean method.

In [54]: b.mean()
Out[54]: 180.16666666666666

You can also compute the per-column mean and the per-row mean using the mean method by specifying an axis, where 0 is the column axis and 1 is the row axis:

In [55]: b.mean(0)
Out[55]: array([ 146.66666667,  167.66666667,  190.66666667,  215.66666667])

In [56]: b.mean(1)
Out[56]: array([   3.5,   31.5,   91.5,  183.5,  307.5,  463.5])

In [57]: np.mean(b, axis=0)
Out[57]: array([ 146.66666667,  167.66666667,  190.66666667,  215.66666667])

In [58]: np.mean(b, axis=1)
Out[58]: array([   3.5,   31.5,   91.5,  183.5,  307.5,  463.5])

When Finished

When finished with the lab please check in your work (assuming you are inside the lab directory):

git add lab6.py
git commit -m "Finished with lab6"
git push

No, we’re not grading this, we just want to look for common errors.