===== NumPy ===== Throughout this course, we have seen a number of examples of manipulating lists of numbers. NumPy is a Python library that makes such manipulations easy. NumPy supports numerical multi-dimensional arrays (standing in for lists of lists, lists of lists of lists, etc.) where each axis has a fixed dimension. This lab will introduce NumPy and overview some of the key functionality. You need to import NumPy before you can use it. We usually give the library a shorter name by using the import-as mechanism: :: import numpy as np Once this import is done, you can use to functions from the ``numpy`` library using ``np`` as the qualifier. Getting started --------------- To get started, open up a terminal and navigate (``cd``) to your |repo_name| directory. Run ``git pull upstream master`` to collect the lab materials and ``git pull`` to sync with your personal repository. The ``lab6`` directory contains a file named ``lab6.py``. This file includes a function, ``read_file``, that takes the name of a CSV file as an argument and returns a list of the column names and a two dimensional array of data, and a call to the function that loads the training data from the city dataset for PA #5. Fire up ``ipython3`` and run ``lab6.py`` to get started. This run will print out some output which you can ignore for now. One-dimensional arrays in NumPy ------------------------------- We'll start by looking at one-dimensional arrays in NumPy. Unlike Python lists, all of the values in a NumPy array must have the same type. We can create a one-dimensional NumPy array from a list using the function ``np.array``. For example, :: In [10]: a1 = np.array([10, 20, 30, 40]) In [11]: a1 Out[11]: array([10, 20, 30, 40]) NumPy arrays are distinct from lists and you should use NumPy's built-in functions and attributes to determine sizes. For example, if we call ``a1.shape``, we get: :: In [12]: a1.shape Out[12]: (4,) In [13]: a1.shape[0] Out[13]: 4 Note: that the result of ``a1.shape`` is a tuple that describes the size of the array in all dimensions (a1 is a one-dimensional array, so it is a tuple of size 1). We can access/update the ith element of the array using [] notation: :: In [14]: a1[0] Out[14]: 10 In [15]: a1[2] Out[15]: 30 In [17]: a1[2] = 50 In [18]: a1 Out[18]: array([10, 20, 50, 40]) Operations on NumPy arrays are element-wise. For example, the expression: :: In [23]: a1 * 2 Out[23]: array([ 20, 40, 100, 80]) yields a new NumPy array where the ith element of the result is equal to the ith element of ``a1`` times 2. Note that a given operator (e.g., ``*``) can have a different meaning depending on the data type to which is it applied. For example, try making ``a1`` a list, rather than a NumPy array, and repeat the same operation. Similarly, :: In [25]: a1 Out[25]: array([10, 20, 50, 40]) In [26]: a2 = np.array([100, 200, 300, 400]) In [27]: a1 + a2 Out[27]: array([110, 220, 350, 440]) yields a new array where the ith element is the sum of the ith elements of ``a1`` and ``a2``. Again, the plus operator has a very different meaning for lists. Try applying the ``+`` operator to two lists to compare what happens. Be careful about Boolean operations on NumPy arrays as they are also applied element-wise: :: In [28]: a1 > 20 Out[28]: array([False, False, True, True]) In [29]: a2 = np.array([10,20,30,40]) In [30]: a1 == a2 Out[30]: array([ True, True, True, True]) NumPy also provides useful methods for operating on arrays, such as ``sum`` and ``mean``: :: In [28]: a1.sum() Out[28]: 120 In [29]: a1.mean() Out[29]: 30.0 which add up the values in the array and compute its mean respectively. These operations can also be written using notation that looks more like a function call: :: In [32]: np.mean(a1) Out[32]: 30.0 In [33]: np.sum(a1) Out[33]: 120 **Task 1:** Write a function: :: def var(y): that computes the variance of ``y``, where ``y`` is a NumPy array. We will define variance to be: .. math:: \mathrm{Var}(\mathbf{y}) = \frac{1}{N}{\sum_{n=1}^N} (y_n - \bar y)^2, where :math:`\mathbf{y} = (y_1, y_2, \dots, y_N)`, and :math:`\bar y` denotes the mean of all of the :math:`y_n`. Your solution should **not** include an explicit loop. The code in ``lab6.py`` will call ``var`` on a array called ``graffiti``, which contains the graffiti column from the city data set that you will use in Programming Assignment #5, and on a ``garbage`` array, which contains the garbage column from the same data set. Here's the output of our implementation: :: GRAFFITI 409854.475818 GARBAGE 3159.33473311 Two-dimensional arrays ---------------------- One-dimensional arrays are useful, but the real power of NumPy becomes more apparent when working with data that looks more like a matrix. For example, here's a matrix represented using a list-of-lists: :: m = [[0, 1, 4, 9], [16, 25, 36, 49], [64, 81, 100, 121], [144, 169, 196, 225], [256, 289, 324, 361], [400, 441, 484, 529]] We can convert this data into a two-dimensional array as follows: :: In [34]: b = np.array(m) where the value of ``b`` will be: :: In [34]: b array([[ 0, 1, 4, 9], [ 16, 25, 36, 49], [ 64, 81, 100, 121], [144, 169, 196, 225], [256, 289, 324, 361], [400, 441, 484, 529]]) Accessing elements of a two-dimensional (2D) NumPy array can be done using the same syntax as a list-of-lists, that is, the expression ``b[i][j]`` will yield the jth element of the ith row of ``b``. More conveniently, you can use a tuple to access the elements of a NumPy array. That is, the expression ``b[i, j]`` will also yield the jth element of the ith row of ``b``. NumPy arrays also support slicing and other more advanced forms of indexing. For example, the expression ``b[1:4]`` will yield: :: In [35]: b[1:4] array([[ 16, 25, 36, 49], [ 64, 81, 100, 121], [144, 169, 196, 225]]) rows 1, 2, and 3 from ``b``. The expression, ``b[1:4, 2:4]`` will yield columns 2 and 3 from rows 1, 2, and 3 of ``b``: :: In [36]: b[1:4, 2:4] array([[ 36, 49], [100, 121], [196, 225]]) As with slicing and lists, a colon (``:``) can be used to indicate that you wish to include all the indices in a particular dimension. For example, ``b[:,2:4]`` will yield a slice of ``b`` with columns 2 and 3 from all the rows. Recall that slice excludes the endpoint. In addition to slicing, you can also specifies a list of indices as an index. For example, the expression: ``b[:, [1, 3]]`` will yield columns 1 and 3 from ``b``: :: In [37]: b[:, [1, 3]] array([[ 1, 9], [ 25, 49], [ 81, 121], [169, 225], [289, 361], [441, 529]]) One thing to keep in mind with NumPy arrays is that you will lose a dimension if you specify a single column or row as an index. For example, notice that the results of the following two expressions are both one-dimensional arrays: :: In [38]: b[1, :] Out[38]: array([16, 25, 36, 49]) In [39]: b[:, 1] Out[39]: array([ 1, 25, 81, 169, 289, 441]) If you wish to retain the dimension, you can use list indexing: :: In [40]: b[:, [1]] Out[40]: array([[ 1], [ 25], [ 81], [169], [289], [441]]) In [41]: b[[1], :] Out[41]: array([[16, 25, 36, 49]]) **Task 2:** In the ``task2`` function in ``lab6.py``, write expressions to extract the following subarrays of ``b``: * rows 0, 1, and 2. * rows 0, 1, and 5 * columns 0, 1, and 2 * columns 0, 1, and 3 * columns 0, 1, and 2 from rows 2 and 3. **Task 3:** We have imported the functions ``linear_regression`` and ``prepend_ones_column`` from PA #5 in ``lab6.py``. Write code to call ``linear_regression`` using a column of all ones and columns 2 (``RODENTS``) and 3 (``GARBAGE``) of ``city_data`` as the value for ``X`` and column 7 (``CRIME_TOTALS``) as the value for ``y``. This function expects a two-dimensional NumPy array for the value of ``X`` and a one-dimensional NumPy array for the value of ``y``. Hint: you can do this task in a single line of code. The result should be: :: array([ 66.60834501 0.58072845 16.82863941]) **Task 4:** Write code to call ``linear_regression`` using a column of all ones and column 0 (``GRAFFITI``) of ``city_data`` as the value for ``X`` and column 7 (``CRIME_TOTALS``) as the value for ``y``. The result should be: :: array([ 593.77754996 0.76632378]) Other useful operations ----------------------- You can find the number of dimensions, shape, and the number of elements in a NumPy array using the ``ndim``, ``shape`` and ``size`` properties respectively. :: In [42]: b.ndim Out[42]: 2 In [43]: b.shape Out[43]: (6, 4) In [44]: b.size Out[44]: 24 As noted above, you can compute the mean of the elements using the ``mean`` method. :: In [54]: b.mean() Out[54]: 180.16666666666666 You can also compute the per-column mean and the per-row mean using the mean method by specifying an *axis*, where 0 is the column axis and 1 is the row axis: :: In [55]: b.mean(0) Out[55]: array([ 146.66666667, 167.66666667, 190.66666667, 215.66666667]) In [56]: b.mean(1) Out[56]: array([ 3.5, 31.5, 91.5, 183.5, 307.5, 463.5]) In [57]: np.mean(b, axis=0) Out[57]: array([ 146.66666667, 167.66666667, 190.66666667, 215.66666667]) In [58]: np.mean(b, axis=1) Out[58]: array([ 3.5, 31.5, 91.5, 183.5, 307.5, 463.5]) When Finished ------------- When finished with the lab please check in your work (assuming you are inside the lab directory): .. code:: git add lab6.py git commit -m "Finished with lab6" git push No, we're not grading this, we just want to look for common errors.