Introduction

In [1]:
import numpy as np

NumPy is a Python library that supplies two major features: better support for large multi-dimensional arrays and high-level mathematical functions (most of which operate on NumPy arrays). We’re going to focus mostly on how to use arrays in NumPy.

How are arrays different from what we’ve seen so far?

Native array type

Python doesn’t include an array type. Instead, we have to use lists, lists-of-lists, lists-of-lists-of-lists, etc. Although these can be used to represent multi-dimensional arrays, they can get pretty messy. For example, here's a matrix as a list-of-lists:

In [2]:
m1 = [[  0,   1,   4], [  9,  16,  25], [ 36,  49,  64], [ 81, 100, 121]]

Let's say I want to create a new matrix, which is just m1 with all its numbers multiplied by 2. Maybe this will work?

In [3]:
print(m1*2)
[[0, 1, 4], [9, 16, 25], [36, 49, 64], [81, 100, 121], [0, 1, 4], [9, 16, 25], [36, 49, 64], [81, 100, 121]]

That just concatenated the list with itself! If I want to actually multiply each element by 2 (in place) I have to do this:

In [4]:
m2 = []
for i in range(len(m1)):
    l = []
    for j in range(len(m1[0])):
        l.append(m1[i][j]*2)
    m2.append(l)

print(m2)
[[0, 2, 8], [18, 32, 50], [72, 98, 128], [162, 200, 242]]

Or, if I want to be a bit fancier:

In [5]:
m2 = [ [x*2 for x in row] for row in m1]

print(m2)
[[0, 2, 8], [18, 32, 50], [72, 98, 128], [162, 200, 242]]

However, with NumPy arrays, this is much simpler:

In [6]:
ma1 = np.array(m1)
print(ma1*2)
[[  0   2   8]
 [ 18  32  50]
 [ 72  98 128]
 [162 200 242]]

Other operations are also much simpler:

In [7]:
print(ma1-1)
[[ -1   0   3]
 [  8  15  24]
 [ 35  48  63]
 [ 80  99 120]]
In [8]:
print(ma1**2)
[[    0     1    16]
 [   81   256   625]
 [ 1296  2401  4096]
 [ 6561 10000 14641]]
In [9]:
print(np.sin(ma1))
[[ 0.          0.84147098 -0.7568025 ]
 [ 0.41211849 -0.28790332 -0.13235175]
 [-0.99177885 -0.95375265  0.92002604]
 [-0.62988799 -0.50636564  0.99881522]]

Basically, NumPy allows us to work with arrays more "naturally"

However, NumPy arrays do have some restrictions that lists don’t: all elements must be of the same type and, once created, an array cannot grow in size.

Better Performance

NumPy’s implementation of arrays (and the operations on them) are very efficient (almost as good as writing them in C).

For example, let's create an array with 10,000,000 integers and a list with 10,000,000 integers.

In [10]:
X = np.arange(10000000)
L = range(10000000)

If we add up all the integers in each, the array is an order of magnitude faster:

In [11]:
timeit X.sum()
100 loops, best of 3: 5.6 ms per loop
In [12]:
timeit sum(L)
10 loops, best of 3: 84.6 ms per loop

Creating and accessing arrays

As we saw earlier, we can create an array simply calling array() with a list or list-of-lists as a parameter. We can use as many levels of list nesting as dimensions in the array.

In [13]:
# One-dimensional array
a1 = np.array([1,2,3])
print(a1)
[1 2 3]
In [16]:
# Two-dimensional array
a2 = np.array([ [1,2,3] , [4,5,6] ])
print(a2)
[[1 2 3]
 [4 5 6]]
In [17]:
# Three-dimensional array
a3 = np.array([ [ [1,2,3] , [4,5,6] ] , [ [10,20,30] , [40,50,60] ] ])
print(a3)
[[[ 1  2  3]
  [ 4  5  6]]

 [[10 20 30]
  [40 50 60]]]

Accessing individual elements is the done the same way as with lists

In [18]:
print(a1[2])
3
In [19]:
print(a2[1][2])
6
In [20]:
print(a3[1][1][2])
60

We can also create ranges of numbers. Numpy provides arange() and linspace(), which are more versatile than Python’s range().

arange creates ranges of numbers, including floats

In [21]:
print(np.arange(5))
[0 1 2 3 4]
In [22]:
print(np.arange(1, 3, 0.4))
[ 1.   1.4  1.8  2.2  2.6]

linspace is similar, but we specify the amount of numbers in the range, instead of the interval

In [23]:
print(np.linspace(0,100,7) )
[   0.           16.66666667   33.33333333   50.           66.66666667
   83.33333333  100.        ]

We can also take a range of numbers and create a multidimensional array of a given shape

In [24]:
print(np.arange(12).reshape(3,4))
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

Basic indexing/slicing

Indexing arrays is similar to indexing lists. We just use the brackets operator. However, indexing multidimensional arrays is a bit simpler (no need for multiple brackets, just use a tuple). Additionally, we can specify slices in each dimension.

Let's take this array with 12 numbers

In [25]:
a = np.array([  0,   1,   4,   9,  16,  25,  36,  49,  64,  81, 100, 121])

Indexing and slicing is the same as with lists

In [26]:
print(a[4])
16
In [27]:
print(a[1:7])
[ 1  4  9 16 25 36]
In [28]:
print(a[3:10:2])
[ 9 25 49 81]

Indexing/slicing gets interesting when we have multi-dimensional arrays

In [29]:
b = np.array([[  0,   1,   4,   9],
              [ 16,  25,  36,  49],
              [ 64,  81, 100, 121],
              [144, 169, 196, 225],
              [256, 289, 324, 361],
              [400, 441, 484, 529]])

Instead of multiple brackets [x][y][z] we can just use a tuple

In [30]:
print(b[2,3])
121
In [31]:
print(b[1,1])
25

We can use slicing with rows...

In [32]:
print(b[1:4])
[[ 16  25  36  49]
 [ 64  81 100 121]
 [144 169 196 225]]

Or with as many dimensions as we want:

In [33]:
print(b[1:4,0:2])
[[ 16  25]
 [ 64  81]
 [144 169]]
In [34]:
print(b[1:4,1])
[ 25  81 169]

Note how this is actually the "second column of rows 1,2,3"

Operators

As we saw in the first example, we can apply operators like + to arrays. When the operands are an array and a number, the operator (with the number) is applied to every element of the array. Note that this also works with relational operators like >, <, ==, etc.

We can use arithmetic operators...

In [35]:
print(a1 + 2)
[3 4 5]
In [36]:
print(a1 * 2)
[2 4 6]

...and relational operators.

In [37]:
print(b>100)
[[False False False False]
 [False False False False]
 [False False False  True]
 [ True  True  True  True]
 [ True  True  True  True]
 [ True  True  True  True]]

We also use these operators on two arrays:

In [38]:
c = np.array([10,20,30])
In [39]:
print(a1 + c)
[11 22 33]

What do you think is the result of using the * operator?

In [40]:
d = np.array([[1, 2], [3, 4]])
e = np.array([[1, 0], [0, 1]])
In [41]:
print(d)
[[1 2]
 [3 4]]
In [42]:
print(e)
[[1 0]
 [0 1]]
In [43]:
print(d*e)
[[1 0]
 [0 4]]

It's element-wise multiplication, not matrix multiplication!

If we want to do a matrix product, you need to use the dot() method

In [44]:
print(np.dot(d,e))
[[1 2]
 [3 4]]

We can also apply NumPy math functions (again: elementwise) on the array

In [45]:
f = np.array([[1,-1],[np.pi,-np.pi]])
print(f)
[[ 1.         -1.        ]
 [ 3.14159265 -3.14159265]]
In [46]:
print(np.cos(f))
[[ 0.54030231  0.54030231]
 [-1.         -1.        ]]

Fancy Indexing

NumPy provides fancier ways of indexing that are more powerful than those for regular Python lists.

We can specify the indexes we want with a list

In [47]:
print( a[ [1,3,6] ] )
[ 1  9 36]

Do the same but, if we provide a multidimensional array, the values at those indices are returned in an array of that same size.

In [48]:
print( a[ np.array([ [1,3] , [10, 7] ]) ] )
[[  1   9]
 [100  49]]

Specify an array of booleans. This returns a flattened array (1-dimensional) with all the values in the array where the specified boolean array had a True value.

In [49]:
c = np.array([100,200,300])
print( c[np.array([True, False, True])] )
[100 300]

This is mostly useful in combination with relational operators

In [50]:
print( b[b>100] )
[121 144 169 196 225 256 289 324 361 400 441 484 529]

When we use this with assignments, it acts as a filter for what indices the assignment should apply to.

Set all elements greater than 100 to 0

In [51]:
b[b>100] = 0
print(b)
[[  0   1   4   9]
 [ 16  25  36  49]
 [ 64  81 100   0]
 [  0   0   0   0]
 [  0   0   0   0]
 [  0   0   0   0]]
In [52]:
b[b%2==1] = -1
print(b)
[[  0  -1   4  -1]
 [ 16  -1  36  -1]
 [ 64  -1 100   0]
 [  0   0   0   0]
 [  0   0   0   0]
 [  0   0   0   0]]

Reduction Methods

Reduction methods allow us to “reduce” an array to a single value. For example, the mean of all the values, the stdev of all the values, etc. We can also do this only for specific dimensions (e.g., the mean of each row, of each column, etc.)

In [53]:
# Let's reset b...
b = (np.arange(24)**2).reshape(6,4)
print(b)
[[  0   1   4   9]
 [ 16  25  36  49]
 [ 64  81 100 121]
 [144 169 196 225]
 [256 289 324 361]
 [400 441 484 529]]

Mean of all elements of b

In [54]:
b.mean()
Out[54]:
180.16666666666666

Mean of the 0-th dimension (or "axis 0" in NumPy lingo) i.e. for each column, add up all the values and divide by the number of rows

In [55]:
print(b.mean(0))
[ 146.66666667  167.66666667  190.66666667  215.66666667]

Mean of the 1-th dimension (or "axis 1" in NumPy lingo) i.e. for each row, add up all the values and divide by the number of columns

In [56]:
print(b.mean(1))
[   3.5   31.5   91.5  183.5  307.5  463.5]

Standard deviation

In [57]:
b.std()
Out[57]:
164.84883648023995

Bringing it all together

If we put all of the above together, we can do some pretty elaborate computations with arrays/matrices in just a few lines

In [59]:
# Normalize to mean 0, stdev 1
b2 = b - b.mean()
z = b2 / b2.std()
In [60]:
print(z)
[[-1.09292046 -1.0868543  -1.06865581 -1.03832499]
 [-0.99586185 -0.94126637 -0.87453858 -0.79567845]
 [-0.704686   -0.60156122 -0.48630411 -0.35891468]
 [-0.21939291 -0.06773883  0.09604759  0.27196633]
 [ 0.4600174   0.6602008   0.87251652  1.09696457]
 [ 1.33354495  1.58225765  1.84310269  2.11608005]]

All values one standard deviations above the mean:

In [61]:
print( b[b>(b.mean() + b.std())] )
[361 400 441 484 529]

Or:

In [62]:
print( b[b2/b2.std() >= 1] )
[361 400 441 484 529]

Filter out all outliers (more than one stdev away from the mean)

In [63]:
print( b[abs(b2/b2.std()) < 1] )
[ 16  25  36  49  64  81 100 121 144 169 196 225 256 289 324]

Linear Algebra

NumPy also includes a linear algebra library.

In [64]:
import numpy.linalg as la
In [65]:
X = np.array([ [1,3] , [2, 8] ])
print(X)
[[1 3]
 [2 8]]

Computing the determinant of a matrix:

In [66]:
# |X| = (1*8) - (3*2) = 2
la.det(X)
Out[66]:
2.0

Computing the inverse of a matrix:

In [68]:
iX = la.inv(X)
print(iX)
[[ 4.  -1.5]
 [-1.   0.5]]
In [70]:
print(np.dot(X,iX))
[[ 1.  0.]
 [ 0.  1.]]

Solving a system of equations:

\begin{align} 1\cdot x_0 + 3\cdot x_1 & = 11 \\ 2\cdot x_0 + 8\cdot x_1 & = 28 \end{align}\begin{equation*} \mathbf{X} = \left( \begin{matrix} 1 & 3 \\ 2 & 8\end{matrix} \right) \;\;\;\; \mathbf{Y} = \left( \begin{matrix} 11 \\ 28\end{matrix} \right) \end{equation*}

The solution should be $x_0=2, x_1=3$

In [71]:
Y = np.array([ [11], [28] ])
print(Y)
[[11]
 [28]]

We can solve the system with matrix operations:

In [72]:
print(np.dot(iX,Y))
[[ 2.]
 [ 3.]]

Or using linalg.solve:

In [73]:
print(la.solve(X,Y))
[[ 2.]
 [ 3.]]

In this case, they both perform pretty much the same, but linalg.solve is preferable since it can detect when a system of linear equations can be solved more efficiently that by doing the above matrix operation.

What about matrices?

Numpy has a matrix type, which has the advantage that some of the operators "make more sense" with matrices. e.g., * will do matrix multiplication, not element-wise multiplication

In [74]:
d = np.matrix([[1, 2], [3, 4]])
e = np.matrix([[1, 0], [0, 1]])
In [75]:
print(d*e)
[[1 2]
 [3 4]]

However, most NumPy developers recommend using the more general array type (anything you can do with a matrix, you can do with an array; e.g., matrix multiplication is just the dot() method)