NumPy

Throughout this course, we have seen a number of examples of manipulating lists of numbers. NumPy is a Python library that makes such manipulations easy. NumPy supports numerical multi-dimensional arrays (standing in for lists of lists, lists of lists of lists, etc.) where each axis has a fixed dimension. This lab will introduce NumPy and overview some of the key functionality.

You need to import NumPy before you can use it. We usually give the library a shorter name by using the import-as mechanism:

import numpy as np

Once this import is done, you can use to functions from the numpy library using np as the qualifier.

Getting started

To get started, open up a terminal and navigate (cd) to your cmsc12100-aut-20-username directory. Run git pull upstream master to collect the lab materials and git pull to sync with your personal repository. The lab5 directory contains a file named lab5.py.

This file includes a function, read_file, that takes the name of a CSV file as an argument and returns a list of the column names and a two dimensional array of data, and a call to the function that loads a dataset you will use in PA #5. This dataset contains one row for every region of a city (the exact regions are unimportant), with all but the last column containing the number of complaints of a given type reported through 311 (e.g., graffiti, pot holes, etc.) The last column contains the total number of crimes reported in that region of the city.

Fire up ipython3 and run lab5.py to get started. This run will print out some output which you can ignore for now.

One-dimensional arrays in NumPy

We’ll start by looking at one-dimensional arrays in NumPy. Unlike Python lists, all of the values in a NumPy array must have the same type. We can create a one-dimensional NumPy array from a list using the function np.array. For example,

In [10]: a1 = np.array([10, 20, 30, 40])

In [11]: a1
Out[11]: array([10, 20, 30, 40])

NumPy arrays are distinct from lists and you should use NumPy’s built-in functions and attributes to determine sizes. For example, if we call a1.shape, we get:

In [12]: a1.shape
Out[12]: (4,)

In [13]: a1.shape[0]
Out[13]: 4

Note: that the result of a1.shape is a tuple that describes the size of the array in all dimensions (a1 is a one-dimensional array, so it is a tuple of size 1).

We can access/update the ith element of the array using [] notation:

In [14]: a1[0]
Out[14]: 10

In [15]: a1[2]
Out[15]: 30

In [17]: a1[2] = 50

In [18]: a1
Out[18]: array([10, 20, 50, 40])

Operations on NumPy arrays are element-wise. For example, the expression:

In [23]: a1 * 2
Out[23]: array([ 20,  40, 100,  80])

yields a new NumPy array where the ith element of the result is equal to the ith element of a1 times 2. Note that a given operator (e.g., *) can have a different meaning depending on the data type to which is it applied. For example, try making a1 a list, rather than a NumPy array, and repeat the same operation.

Similarly,

In [25]: a1
Out[25]: array([10, 20, 50, 40])

In [26]: a2 = np.array([100, 200, 300, 400])

In [27]: a1 + a2
Out[27]: array([110, 220, 350, 440])

yields a new array where the ith element is the sum of the ith elements of a1 and a2. Again, the plus operator has a very different meaning for lists. Try applying the + operator to two lists to compare what happens. Be careful about Boolean operations on NumPy arrays as they are also applied element-wise:

In [28]: a1 > 20
Out[28]: array([False, False,  True,  True])

In [29]: a2 = np.array([10,20,30,40])
In [30]: a1 == a2
Out[30]: array([ True,  True,  True,  True])

NumPy also provides useful methods for operating on arrays, such as sum and mean:

In [28]: a1.sum()
Out[28]: 120

In [29]: a1.mean()
Out[29]: 30.0

which add up the values in the array and compute its mean respectively. These operations can also be written using notation that looks more like a function call:

In [32]: np.mean(a1)
Out[32]: 30.0

In [33]: np.sum(a1)
Out[33]: 120

Task 1: Write a function:

def var(y):

that computes the variance of y, where y is a NumPy array. We will define variance to be:

\[\mathrm{Var}(\mathbf{y}) = \frac{1}{N}{\sum_{n=1}^N} (y_n - \bar y)^2,\]

where \(\mathbf{y} = (y_1, y_2, \dots, y_N)\), and \(\bar y\) denotes the mean of all of the \(y_n\). Your solution should not include an explicit loop.

The code in lab5.py will call var on a array called graffiti, which contains the graffiti column from the city data set that you will use in Programming Assignment #5, and on a garbage array, which contains the garbage column from the same data set. Here’s the output of our implementation:

GRAFFITI 409854.475818
GARBAGE 3159.33473311

Two-dimensional arrays

One-dimensional arrays are useful, but the real power of NumPy becomes more apparent when working with data that looks more like a matrix. For example, here’s a matrix represented using a list-of-lists:

m = [[0, 1, 4, 9],
     [16, 25, 36, 49],
     [64, 81, 100, 121],
     [144, 169, 196, 225],
     [256, 289, 324, 361],
     [400, 441, 484, 529]]

We can convert this data into a two-dimensional array as follows:

In [34]: b = np.array(m)

where the value of b will be:

In [34]: b
array([[  0,   1,   4,   9],
       [ 16,  25,  36,  49],
       [ 64,  81, 100, 121],
       [144, 169, 196, 225],
       [256, 289, 324, 361],
       [400, 441, 484, 529]])

Accessing elements of a two-dimensional (2D) NumPy array can be done using the same syntax as a list-of-lists, that is, the expression b[i][j] will yield the jth element of the ith row of b. More conveniently, you can use a tuple to access the elements of a NumPy array. That is, the expression b[i, j] will also yield the jth element of the ith row of b.

NumPy arrays also support slicing and other more advanced forms of indexing. For example, the expression b[1:4] will yield:

In [35]: b[1:4]
array([[ 16,  25,  36,  49],
       [ 64,  81, 100, 121],
       [144, 169, 196, 225]])

rows 1, 2, and 3 from b. The expression, b[1:4, 2:4] will yield columns 2 and 3 from rows 1, 2, and 3 of b:

In [36]: b[1:4, 2:4]
array([[ 36,  49],
       [100, 121],
       [196, 225]])

As with slicing and lists, a colon (:) can be used to indicate that you wish to include all the indices in a particular dimension. For example, b[:,2:4] will yield a slice of b with columns 2 and 3 from all the rows. Recall that slice excludes the endpoint.

In addition to slicing, you can also specifies a list of indices as an index. For example, the expression: b[:, [1, 3]] will yield columns 1 and 3 from b:

In [37]: b[:, [1, 3]]
array([[  1,   9],
       [ 25,  49],
       [ 81, 121],
       [169, 225],
       [289, 361],
       [441, 529]])

One thing to keep in mind with NumPy arrays is that you will lose a dimension if you specify a single column or row as an index. For example, notice that the results of the following two expressions are both one-dimensional arrays:

In [38]: b[1, :]
Out[38]: array([16, 25, 36, 49])

In [39]: b[:, 1]
Out[39]: array([  1,  25,  81, 169, 289, 441])

If you wish to retain the dimension, you can use list indexing:

In [40]: b[:, [1]]
Out[40]:
array([[  1],
       [ 25],
       [ 81],
       [169],
       [289],
       [441]])

In [41]: b[[1], :]
Out[41]: array([[16, 25, 36, 49]])

Task 2: In the task2 function in lab5.py, write expressions to extract the following subarrays of b:

  • rows 0, 1, and 2.

  • rows 0, 1, and 5

  • columns 0, 1, and 2

  • columns 0, 1, and 3

  • columns 0, 1, and 2 from rows 2 and 3.

Task 3: We have imported the functions linear_regression and prepend_ones_column (which you will use in PA #5) in lab5.py. Write code to call linear_regression using a column of all ones and columns 2 (RODENTS) and 3 (GARBAGE) of city_data as the value for X and column 7 (CRIME_TOTALS) as the value for y. This function expects a two-dimensional NumPy array for the value of X and a one-dimensional NumPy array for the value of y.

Hint: you can do this task in a single line of code.

The result should be:

array([ 66.60834501   0.58072845  16.82863941])

Do not worry if you do not understand what this result means. The goal is to ensure you understand how to perform basic array operations. In PA #5 we will dig deeper into the linear regression involved in this computation.

Task 4: Write code to call linear_regression using a column of all ones and column 0 (GRAFFITI) of city_data as the value for X and column 7 (CRIME_TOTALS) as the value for y.

The result should be:

array([ 593.77754996    0.76632378])

Other useful operations

You can find the number of dimensions, shape, and the number of elements in a NumPy array using the ndim, shape and size properties respectively.

In [42]: b.ndim
Out[42]: 2

In [43]: b.shape
Out[43]: (6, 4)

In [44]: b.size
Out[44]: 24

As noted above, you can compute the mean of the elements using the mean method.

In [54]: b.mean()
Out[54]: 180.16666666666666

You can also compute the per-column mean and the per-row mean using the mean method by specifying an axis, where 0 is the column axis and 1 is the row axis:

In [55]: b.mean(0)
Out[55]: array([ 146.66666667,  167.66666667,  190.66666667,  215.66666667])

In [56]: b.mean(1)
Out[56]: array([   3.5,   31.5,   91.5,  183.5,  307.5,  463.5])

In [57]: np.mean(b, axis=0)
Out[57]: array([ 146.66666667,  167.66666667,  190.66666667,  215.66666667])

In [58]: np.mean(b, axis=1)
Out[58]: array([   3.5,   31.5,   91.5,  183.5,  307.5,  463.5])

When Finished

When finished with the lab please check in your work (assuming you are inside the lab directory):

git add lab5.py
git commit -m "Finished with lab5"
git push

No, we’re not grading this, we just want to look for common errors.