Team Tutorial #5: NumPy

Throughout this course, we have seen a number of examples of manipulating lists of numbers. NumPy is a Python library that makes such manipulations easy. NumPy supports numerical multi-dimensional arrays (standing in for lists of lists, lists of lists of lists, etc.) where each axis has a fixed dimension. This tutorial will introduce NumPy and overview some of the key functionality.

You need to import NumPy before you can use it. We usually give the library a shorter name by using the import-as mechanism:

import numpy as np

Once this import is done, you can use functions from the numpy library with np as the qualifier.

Getting started

If this is your first time working through a Team Tutorial, please see the “Getting started” section of Team Tutorial #1 to set up your Team Tutorials repository.

To get the files for this tutorial, set an variable GITHUB_USERNAME with your GitHub username, navigate to your Team Tutorials repository, and then pull the new material from the upstream repository:

cd ~/capp30121
cd team-tutorials-$GITHUB_USERNAME
git pull upstream main

You will find the files you need in the tt5 directory.

One-dimensional arrays in NumPy

We’ll start by looking at one-dimensional arrays in NumPy. Unlike Python lists, all of the values in a NumPy array must have the same type. We can create a one-dimensional NumPy array from a list using the function np.array. For example,

In [2]: a1 = np.array([10, 20, 30, 40])

In [3]: a1
Out[3]: array([10, 20, 30, 40])

We can also create one-dimensional Numpy arrays in the following ways:

In [4]: a2 = np.arange(5)

In [5]: a2
Out[5]: array([0, 1, 2, 3, 4])
In [6]: a3 = np.arange(1,10,2)

In [7]: a3
Out[7]: array([1, 3, 5, 7, 9])
In [8]: a4 = np.ones(5)

In [9]: a4
Out[9]: array([1., 1., 1., 1., 1.])
In [10]: a5 = np.zeros(5)

In [11]: a5
Out[11]: array([0., 0., 0., 0., 0.])

NumPy arrays are distinct from lists and you should use NumPy’s built-in functions and attributes to determine sizes. For example, if we call a1.shape, we get:

In [12]: a1.shape
Out[12]: (4,)

In [13]: a1.shape[0]
Out[13]: 4

Note: that the result of a1.shape is a tuple that describes the size of the array in all dimensions (a1 is a one-dimensional array, so it is a tuple of size 1).

Similarly, you can find the number of dimensions and the number of elements in a NumPy array using the ndim and size properties respectively.

In [14]: a1.ndim
Out[14]: 1

In [15]: a1.size
Out[15]: 4

In [16]: len(a1)
Out[16]: 4

We can access/update the ith element of the array using [] notation:

In [17]: a1[0]
Out[17]: 10

In [18]: a1[2]
Out[18]: 30

In [19]: a1[2] = 50

In [20]: a1
Out[20]: array([10, 20, 50, 40])

Operations on NumPy arrays are element-wise. For example, the expression:

In [21]: a1 * 2
Out[21]: array([ 20,  40, 100,  80])

yields a new NumPy array where the ith element of the result is equal to the ith element of a1 times 2.

Note that a given operator (e.g., *) can have a different meaning depending on the data type to which is it applied. For example, try making a1 a list, rather than a NumPy array, and repeat the same operation.

In [22]: a1_list = [10, 20, 50, 40]

In [23]: a1_list * 2
Out[23]: [10, 20, 50, 40, 10, 20, 50, 40])

Similarly,

In [24]: a1
Out[24]: array([10, 20, 50, 40])

In [25]: a2 = np.array([100, 200, 300, 400])

In [26]: a1 + a2
Out[26]: array([110, 220, 350, 440])

yields a new array where the ith element is the sum of the ith elements of a1 and a2.

Again, the plus operator has a very different meaning for lists. Try applying the + operator to two lists to compare what happens.

In [27]: a1_list = [10, 20, 50, 40]

In [28]: a2_list = [100, 200, 300, 400]

In [29]: a1_list + a2_list
Out[29]: [10, 20, 50, 40, 100, 200, 300, 400]

Be careful about Boolean operations on NumPy arrays as they are also applied element-wise:

In [30]: a1 > 20
Out[30]: array([False, False,  True,  True])

In [31]: a2 = np.array([10,20,30,40])

In [32]: a1 == a2
Out[32]: array([ True,  True,  True,  True])

However, here is the Boolean operations with lists:

In [33]: a1_list
Out[33]: [10, 20, 50, 40]

In [34]: a1_list > 20
TypeError: '>' not supported between instances of 'list' and 'int'

In [35]: a2_list
Out[35]: [100, 200, 300, 400]

In [36]: a1_list == a2_list
Out[36]: False

NumPy also provides useful methods for operating on arrays, such as sum() and mean():

In [37]: a1.sum()
Out[37]: 120

In [38]: a1.mean()
Out[38]: 30.0

which add up the values in the array and compute its mean respectively. These operations can also be written using notation that looks more like a function call:

In [39]: np.mean(a1)
Out[39]: 30.0

In [40]: np.sum(a1)
Out[40]: 120

You can also try these methods with lists:

In [41]: np.mean(a1_list)
Out[41]: 30.0

In [42]: np.sum(a1_list)
Out[42]: 120

You will get errors when you apply these methods with lists in the following way:

In [43]: a1_list.sum()
AttributeError: 'list' object has no attribute 'sum'

In [44]: a1_list.mean()
AttributeError: 'list' object has no attribute 'mean'

Task 1: Write a function:

def var(y):

that computes the variance of y, where y is a NumPy array. We will define variance to be:

\[\mathrm{Var}(\mathbf{y}) = \frac{1}{N}{\sum_{n=1}^N} (y_n - \bar y)^2,\]

where \(\mathbf{y} = (y_1, y_2, \dots, y_N)\), and \(\bar y\) denotes the mean of all of the \(y_n\). Your solution should not include an explicit loop.

The code in tt5.py will call var on a array called graffiti, which contains the graffiti column from the city data set that you will use in Programming Assignment #5, and on a garbage array, which contains the garbage column from the same data set. Here’s the output of our implementation:

GRAFFITI 409854.475818
GARBAGE 3159.33473311

Here is a simple test example you can use for manual testing:

In [44]: y_arr = np.arange(10)
In [45]: y_arr
Out[45]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [46]: var(y_arr)
Out[46]: 8.25

Two-dimensional arrays

One-dimensional arrays are useful, but the real power of NumPy becomes more apparent when working with data that looks more like a matrix. For example, here’s a matrix represented using a list-of-lists:

m = [[0, 1, 4, 9],
     [16, 25, 36, 49],
     [64, 81, 100, 121],
     [144, 169, 196, 225],
     [256, 289, 324, 361],
     [400, 441, 484, 529]]

We can convert this data into a two-dimensional array as follows:

In [47]: b = np.array(m)

where the value of b will be:

In [48]: b
Out[48]: array([[  0,   1,   4,   9],
                [ 16,  25,  36,  49],
                [ 64,  81, 100, 121],
                [144, 169, 196, 225],
                [256, 289, 324, 361],
                [400, 441, 484, 529]])

Accessing elements of a two-dimensional (2D) NumPy array can be done using the same syntax as a list-of-lists, that is, the expression b[i][j] will yield the jth element of the ith row of b. More conveniently, you can use a tuple to access the elements of a NumPy array. That is, the expression b[i, j] will also yield the jth element of the ith row of b.

NumPy arrays also support slicing and other more advanced forms of indexing. For example, the expression b[1:4] will yield:

In [49]: b[1:4]
Out[49]: array([[ 16,  25,  36,  49],
                [ 64,  81, 100, 121],
                [144, 169, 196, 225]])

rows 1, 2, and 3 from b. The expression, b[1:4, 2:4] will first select rows 1,2, 3, and then select columns 2 and 3 from b:

In [50]: b[1:4, 2:4]
Out[50]: array([[ 36,  49],
                [100, 121],
                [196, 225]])

As with slicing and lists, a colon (:) can be used to indicate that you wish to include all the indices in a particular dimension. For example, b[:,2:4] will yield a slice of b with columns 2 and 3 from all the rows. Recall that slice excludes the endpoint.

In [51]: b[1:4, :]
Out[51]: array([[ 16,  25,  36,  49],
                [ 64,  81, 100, 121],
                [144, 169, 196, 225]])

In [52]: b[:, 2:4]
Out[52]: array([[  4,   9],
                [ 36,  49],
                [100, 121],
                [196, 225],
                [324, 361],
                [484, 529]])

In addition to slicing, you can also specify a list of indices as an index. For example, the expression: b[:, [1, 3]] will yield columns 1 and 3 from b:

In [53]: b[:, [1, 3]]
Out[53]: array([[  1,   9],
                [ 25,  49],
                [ 81, 121],
                [169, 225],
                [289, 361],
                [441, 529]])

One thing to keep in mind with NumPy arrays is that you will lose a dimension if you specify a single column or row as an index. For example, notice that the results of the following two expressions are both one-dimensional arrays:

In [54]: b[1, :]
Out[54]: array([16, 25, 36, 49])

In [55]: b[:, 1]
Out[55]: array([  1,  25,  81, 169, 289, 441])

If you wish to retain the dimension, you can use list indexing:

In [56]: b[:, [1]]
Out[56]: array([[  1],
                [ 25],
                [ 81],
                [169],
                [289],
                [441]])

In [57]: b[[1], :]
Out[57]: array([[16, 25, 36, 49]])

Task 2: In the task2 function in tt5.py, write expressions to extract the following subarrays of b:

  • rows 0, 1, and 2.

  • rows 0, 1, and 5

  • columns 0, 1, and 2

  • columns 0, 1, and 3

  • columns 0, 1, and 2 from rows 2 and 3.

Task 3: We have imported the functions linear_regression and prepend_ones_column (which you will use in PA #5) in tt5.py. The linear_regression function expects a two-dimensional NumPy array for the value of X and a one-dimensional NumPy array for the value of y. To construct X, you need to call prepend_ones_column function to prepend a column of all ones to columns 2 (RODENTS) and 3 (GARBAGE) of city_data (a NumPy array containing a dataset you’ll be using in Programming Assignment #5). To construct y, you need to select column 7 (CRIME_TOTALS) as the value for y.

Hint: you can do this task in a single line of code.

The result should be:

array([ 66.60834501   0.58072845  16.82863941])

Do not worry if you do not understand what this result means. The goal is to ensure you understand how to perform basic array operations. In PA #5 we will dig deeper into the linear regression involved in this computation.

Task 4: Write code to call linear_regression using a column of all ones and column 0 (GRAFFITI) of city_data as the value for X and column 7 (CRIME_TOTALS) as the value for y.

The result should be:

array([ 593.77754996    0.76632378])

Practice other useful operations

You can find the number of dimensions, shape, and the number of elements in a NumPy array using the ndim, shape and size properties respectively.

In [58]: b.ndim
Out[58]: 2

In [59]: b.shape
Out[59]: (6, 4)

In [60]: b.size
Out[60]: 24

You can perform many arithmetic operations on arrays, as well as common mathematical functions:

In [86]: aa = np.array([1, 3, 5])

In [87]: aa**2
Out[87]: array([ 1,  9, 25])

In [88]: 2**aa
Out[88]: array([ 2,  8, 32])

In [89]: np.sqrt(aa)
Out[89]: array([1.        , 1.73205081, 2.23606798])

In [90]: np.exp(aa)
Out[90]: array([  2.71828183,  20.08553692, 148.4131591 ])

In [91]: np.log2(aa)
Out[91]: array([0.        , 1.5849625 , 2.32192809])

As noted above, you can compute the mean of the elements using the mean method.

In [61]: b.mean()
Out[61]: 180.16666666666666

You can also compute the per-column mean and the per-row mean using the mean method by specifying an axis, where 0 is the column axis and 1 is the row axis:

In [62]: b.mean(0)
Out[62]: array([ 146.66666667,  167.66666667,  190.66666667,  215.66666667])

In [63]: b.mean(1)
Out[63]: array([   3.5,   31.5,   91.5,  183.5,  307.5,  463.5])

In [64]: np.mean(b, axis=0)
Out[64]: array([ 146.66666667,  167.66666667,  190.66666667,  215.66666667])

In [65]: np.mean(b, axis=1)
Out[65]: array([   3.5,   31.5,   91.5,  183.5,  307.5,  463.5])

Unlike lists, operations like slicing don’t create a copy of the array. So, if you want to create copies of an array (or of slices of an array), you’ll need to use the np.copy() function:

In [66]: a1 = np.array([1, 2, 3])
In [67]: a2 = a1
In [68]: a3 = np.copy(a1)

In [69]: a1[0] = 10

In [70]: a2
Out[70]: array([10,  2,  3])

In [71]: a3
Out[71]: array([1,  2,  3])

In [72]: id(a1), id(a2), id(a3)
Out[72]: (140232405246480, 140232405246480, 140232405249744)

The NumPy library is huge, and we can’t possibly teach you every single feature it provides. So, you will often find yourself having to look up new methods and examples in the Numpy Reference. To help you get acquainted with the NumPy documentation, we have included below a few examples of functions we haven’t covered in class. It will be up to you to look them up in the documentation to understand exactly what they’re doing.

What is the difference between np.min() and np.argmin()?

In [73]: a1 = np.array([3, 2, 6])
In [74]: np.min(a1)
Out[74]: 2

In [75]: np.argmin(a1)
Out[75]: 1

What is the difference between np.max() and np.argmax()?

In [76]: a1 = np.array([3, 2, 6])
In [77]: np.max(a1)
Out[77]: 6

In [78]: np.argmax(a1)
Out[78]: 2

What is the difference between np.where() and np.argwhere()?

In [79]: np.where(a1>2)
Out[79]: array([3, 6])

In [80]: a1[np.where(a1>2)]
Out[80]: array([3, 6])

In [81]: a1[a1>2]
Out[81]: array([3, 6])

In [82]: np.argwhere(a1>2)
Out[82]: array([[0],
                [2]])

In [83]: a = np.array([10, 20, 30])
In [83]: b = np.array([10, 30, 30])

In [84]: np.argwhere(a == b)
Out[84]: array([[0],
                [2]])


In [85]: np.argwhere(a == b).flatten()
Out[85]: array([0, 2])

When Finished

When you are finished with the tutorial, run the following commands from the linux command-line inside of your tt5 directory:

git add tt5.py
git commit -m "Finished with tt5"
git push

Again, following these steps will help ensure your repository is in a clean state and that your work is saved to GitHub.