Numpy¶

Numpy is a Python library that supports multi-dimensional arrays.

You need to import Numpy, before you can use it. It is traditional to give the library a shorter name using the import-as mechanism:

import numpy as np

Once this import is done, you can use to functions from the numpy library using np as the qualifier.

Getting started¶

To get started, open up a terminal and navigate (cd) to your cmsc12100-aut-17-username directory. Run git pull upstream master to collect the lab materials and git pull to sync with your personal repository. The lab6 directory contains a file named lab6.py.

This file includes a function, read_file, that takes the name of a CSV file as an argument and returns a list of the column names and a two dimensional array of data, and a call to the function that loads the training data from the city dataset for PA #5.

Fire up ipython3 and run lab6.py to get started. This run will print out some output which you can ignore for now.

One-dimensional arrays in numpy¶

We’ll start by looking at one-dimensional arrays in Numpy. Unlike Python lists, all of the values in a Numpy array must have the same type. We can create a one-dimensional numpy array from a list using the function np.array. For example,

In [10]: a1 = np.array([10, 20, 30, 40])

In [11]: a1
Out[11]: array([10, 20, 30, 40])

We can compute length and shape of of this array as follows:

In [12]: len(a1)
Out[12]: 4

In [13]: a1.shape
Out[13]: (4,)

And we can access/update the ith element of the array using the [] notation:

In [14]: a1[0]
Out[14]: 10

In [15]: a1[2]
Out[15]: 30

In [17]: a1[2] = 50

In [18]: a1
Out[18]: array([10, 20, 50, 40])

Operations on numpy arrays are element-wise. For example, the expression:

In [23]: a1*2
Out[23]: array([ 20,  40, 100,  80])

yields a new numpy array where the ith element of the result is equal to the ith element of a1 times 2. Note that a given operator (e.g., *) can have a different meaning depending on the data type to which is it applied. For example, try making a1 a list, rather than a Numpy array, and repeat the same operation.

Similarly,

In [25]: a1
Out[25]: array([10, 20, 50, 40])

In [26]: a2 = np.array([100, 200, 300, 400])

In [27]: a1+a2
Out[27]: array([110, 220, 350, 440])

yields a new array where the ith element is the sum of the ith elements of a1 and a2. Again, the plus operator has a very different meaning for lists. Try applying the + operator to two lists to compare what happens.

Numpy also provides useful methods for operating on arrays, such as sum and mean:

In [28]: a1.sum()
Out[28]: 120

In [29]: a1.mean()
Out[29]: 30.0

which add up the values in the array and compute its mean respectively. These operations can also be written using notation that looks more like a function call:

In [32]: np.mean(a1)
Out[32]: 30.0

In [33]: np.sum(a1)
Out[33]: 120

Task 1: Write a function:

def var(y):

that computes the variance of y, where y is a numpy array. We will define variance to be:

\[\mathrm{var}(y) = \frac{1}{N}{\sum_{n=1}^N} (y_n - \bar y)^2,\]

where \(\bar y\) denotes the mean of all y’s. Your solution should not include an explicit loop.

And then run it on graffiti, which contains the graffiti column from the city data set and garbage which contains the garbage column from the city data set . Here’s the output of our implementation:

GRAFFITI 409854.475818
GARBAGE 3159.33473311

Two-dimensional arrays¶

One-dimensional arrays are useful, but the real power of numpy becomes more apparent when working with data that looks more like a matrix. For example, here’s a matrix represented using a list-of-lists:

m = [[0, 1, 4, 9],
     [16, 25, 36, 49],
     [64, 81, 100, 121],
     [144, 169, 196, 225],
     [256, 289, 324, 361],
     [400, 441, 484, 529]]

We can convert this data into a two-dimensional array as follows:

In [34]: b = np.array(m)

where the value of b will be:

In [34]: b
array([[  0,   1,   4,   9],
       [ 16,  25,  36,  49],
       [ 64,  81, 100, 121],
       [144, 169, 196, 225],
       [256, 289, 324, 361],
       [400, 441, 484, 529]])

Accessing elements of a 2D numpy array can be done using the same syntax as a 2D list, that is, the expression b[i][j] will yield the jth element of the ith row of b. More conveniently, you can use a tuple to access the elements of a numpy array. That is, the expression b[i, j] will also yield the jth element of the ith row of b.

Numpy arrays also support slicing and other more advanced forms of indexing. For example, the expression b[1:4] will yield:

In [35]: b[1:4]
array([[ 16,  25,  36,  49],
       [ 64,  81, 100, 121],
       [144, 169, 196, 225]])

rows 1, 2, and 3 from b. The expression, b[1:4, 2:4] will yield columns 2 and 3 from rows 1, 2, and 3 of b:

In [36]: b[1:4, 2:4]
array([[ 36,  49],
       [100, 121],
       [196, 225]])

As with slicing and lists, a colon (:) can be used to indicate that you wish to include all the indices in a particular dimension. For example, b[:,2:4] will yield a slice of b with columns 2 and 3 from all the rows. Recall that slice excludes the endpoint.

In addition to slicing, you can also specifies a list of indices as an index. For example, the expression: b[:, [1,3]] will yield columns 1 and 3 from b:

In [37]: b[:, [1,3]]
array([[  1,   9],
       [ 25,  49],
       [ 81, 121],
       [169, 225],
       [289, 361],
       [441, 529]])

One thing to keep in mind with Numpy arrays is that you will lose a dimension if you specify a single column or row as an index. For example, notice that the results of the following two expressions are both one-dimensional arrays:

In [38]: b[1,:]
Out[38]: array([16, 25, 36, 49])

In [39]: b[:,1]
Out[39]: array([  1,  25,  81, 169, 289, 441])

If you wish to retain the dimension, you can use list indexing:

In [40]: b[:,[1]]
Out[40]:
array([[  1],
       [ 25],
       [ 81],
       [169],
       [289],
       [441]])

In [41]: b[[1], :]
Out[41]: array([[16, 25, 36, 49]])

Task 2: Write expressions to extract the following subarrays of b, which is defined for you in lab6.py:

rows 0, 1, and 2.
rows 0, 1, and 5
columns 0, 1, and 2
columns 0, 1, and 3
columns 0, 1, and 2 from rows 2 and 3.

Task 3: We have imported the linear_regression function from PA #5 in lab6.py. Write code to call linear_regression using columns 2 (RODENTS) and 3 (GARBAGE) as the value for X and column 7 (CRIME_TOTALS) as the value for Y. This function expects a two-dimensional numpy array for the value of X and a one-dimensional numpy array for the value of Y.

Hint: you can do this task in a single line of code.

The result should be:

array([ 66.60834501   0.58072845  16.82863941])

Task 4: Write code to call linear_regression using column 0 (GRAFFITI) as the value for X and column 7 (CRIME_TOTALS) as the value for Y.

The result should be:

array([ 593.77754996    0.76632378])

Other useful operations¶

You can find the number of dimensions, shape, and the number of elements in a numpy array using the ndim, shape and size properties respectively.

In [42]: b.ndim
Out[42]: 2

In [43]: b.shape
Out[43]: (6, 4)

In [44]: b.size
Out[44]: 24

As noted above, you can compute the mean of the elements using the mean method.

In [54]: b.mean()
Out[54]: 180.16666666666666

You can also compute the per-column mean and the per-row mean using the mean method by specifying an axis, where 0 is the column axis and 1 is the row axis:

In [55]: b.mean(0)
Out[55]: array([ 146.66666667,  167.66666667,  190.66666667,  215.66666667])

In [56]: b.mean(1)
Out[56]: array([   3.5,   31.5,   91.5,  183.5,  307.5,  463.5])

When Finished¶

When finished with the lab please check in your work (assuming you are inside the lab directory):

git add lab6.py
git commit -m "Finished with lab6"
git push

No, we’re not grading this, we just want to look for common errors.