=====    
NumPy
=====

Throughout this course, we have seen a number of examples of manipulating
lists of numbers. NumPy is a Python library that makes such manipulations
easy. NumPy supports numerical multi-dimensional arrays (standing in for
lists of lists, lists of lists of lists, etc.) where each axis has a fixed
dimension. This lab will introduce NumPy and overview some of the key 
functionality.

You need to import NumPy before you can use it.  We usually give the library a
shorter name by using the import-as mechanism:

::

    import numpy as np

Once this import is done, you can use to functions from the ``numpy``
library using ``np`` as the qualifier.

Getting started
---------------

To get started, open up a terminal and navigate (``cd``) to your
|repo_name| directory. Run ``git pull upstream
master`` to collect the lab materials and ``git pull`` to sync with
your personal repository.  The ``lab6`` directory contains a file
named ``lab6.py``.

This file includes a function, ``read_file``, that takes the name of a
CSV file as an argument and returns a list of the column names and a
two dimensional array of data, and a call to the function that loads
the training data from the city dataset for PA #5.

Fire up ``ipython3`` and run ``lab6.py`` to get started. This run will
print out some output which you can ignore for now.


One-dimensional arrays in NumPy
-------------------------------

We'll start by looking at one-dimensional arrays in NumPy.  Unlike
Python lists, all of the values in a NumPy array must have the same
type.  We can create a one-dimensional NumPy array from a list using
the function ``np.array``.  For example,

::

    In [10]: a1 = np.array([10, 20, 30, 40])

    In [11]: a1
    Out[11]: array([10, 20, 30, 40])

NumPy arrays are distinct from lists and you should use NumPy's built-in
functions and attributes to determine sizes. For example, if we call
``a1.shape``, we get:

::

    In [12]: a1.shape
    Out[12]: (4,)
    
    In [13]: a1.shape[0]
    Out[13]: 4

Note: that the result of ``a1.shape`` is a tuple that describes the 
size of the array in all dimensions (a1 is a one-dimensional array, so 
it is a tuple of size 1). 

We can access/update the ith element of the array using []
notation:

::

    In [14]: a1[0]
    Out[14]: 10

    In [15]: a1[2]
    Out[15]: 30

    In [17]: a1[2] = 50

    In [18]: a1
    Out[18]: array([10, 20, 50, 40])


Operations on NumPy arrays are element-wise.  For example, the
expression:

::

    In [23]: a1 * 2
    Out[23]: array([ 20,  40, 100,  80])

yields a new NumPy array where the ith element of the result is equal
to the ith element of ``a1`` times 2. Note that a given operator 
(e.g., ``*``) can have a different meaning depending on the data type to which 
is it applied. For example, try making ``a1`` a list, rather than a NumPy array,
and repeat the same operation. 

Similarly,

::

    In [25]: a1
    Out[25]: array([10, 20, 50, 40])

    In [26]: a2 = np.array([100, 200, 300, 400])

    In [27]: a1 + a2
    Out[27]: array([110, 220, 350, 440])

yields a new array where the ith element is the sum of the ith
elements of ``a1`` and ``a2``.  Again, the plus operator has a very
different meaning for lists. Try applying the ``+`` operator to two
lists to compare what happens. Be careful about Boolean operations on
NumPy arrays as they are also applied element-wise:

::
                                                              
    In [28]: a1 > 20                                                                                    
    Out[28]: array([False, False,  True,  True])

    In [29]: a2 = np.array([10,20,30,40])
    In [30]: a1 == a2
    Out[30]: array([ True,  True,  True,  True])


NumPy also provides useful methods for operating on arrays, such as
``sum`` and ``mean``:

::

    In [28]: a1.sum()
    Out[28]: 120

    In [29]: a1.mean()
    Out[29]: 30.0


which add up the values in the array and compute its mean
respectively.  These operations can also be written using notation
that looks more like a function call:

::

    In [32]: np.mean(a1)
    Out[32]: 30.0

    In [33]: np.sum(a1)
    Out[33]: 120


**Task 1:** Write a function:

::

    def var(y):

that computes the variance of ``y``, where ``y`` is a NumPy array.  We
will define variance to be:

.. math:: \mathrm{Var}(\mathbf{y}) = \frac{1}{N}{\sum_{n=1}^N} (y_n - \bar y)^2,

where :math:`\mathbf{y} = (y_1, y_2, \dots, y_N)`, and :math:`\bar y` denotes the mean of all of the :math:`y_n`.  Your solution
should **not** include an explicit loop.

The code in ``lab6.py`` will call ``var`` on a array called ``graffiti``,
which contains the graffiti column from the city data set that you will use in Programming Assignment #5,
and on a ``garbage`` array, which contains the garbage
column from the same data set.  Here's the output of our
implementation:

::

    GRAFFITI 409854.475818
    GARBAGE 3159.33473311


Two-dimensional arrays
----------------------

One-dimensional arrays are useful, but the real power of NumPy becomes
more apparent when working with data that looks more like a matrix.
For example, here's a matrix represented using a list-of-lists:

::

    m = [[0, 1, 4, 9],
         [16, 25, 36, 49],
         [64, 81, 100, 121],
         [144, 169, 196, 225],
         [256, 289, 324, 361],
         [400, 441, 484, 529]]

We can convert this data into a two-dimensional array as follows:

::

    In [34]: b = np.array(m)

where the value of ``b`` will be:

::

    In [34]: b
    array([[  0,   1,   4,   9],
           [ 16,  25,  36,  49],
           [ 64,  81, 100, 121],
           [144, 169, 196, 225],
           [256, 289, 324, 361],
           [400, 441, 484, 529]])


Accessing elements of a two-dimensional (2D) NumPy array can be done using the same
syntax as a list-of-lists, that is, the expression ``b[i][j]`` will yield
the jth element of the ith row of ``b``.  More conveniently, you can
use a tuple to access the elements of a NumPy array.  That is, the
expression ``b[i, j]`` will also yield the jth element of the ith row
of ``b``.

NumPy arrays also support slicing and other more advanced forms of
indexing.  For example, the expression ``b[1:4]`` will yield:

::

    In [35]: b[1:4]
    array([[ 16,  25,  36,  49],
           [ 64,  81, 100, 121],
           [144, 169, 196, 225]])

rows 1, 2, and 3 from ``b``.  The expression, ``b[1:4, 2:4]`` will
yield columns 2 and 3 from rows 1, 2, and 3 of ``b``:

::

    In [36]: b[1:4, 2:4]
    array([[ 36,  49],
           [100, 121],
           [196, 225]])


As with slicing and lists, a colon (``:``) can be used to indicate
that you wish to include all the indices in a particular dimension.
For example, ``b[:,2:4]`` will yield a slice of ``b`` with columns 2
and 3 from all the rows. Recall that slice excludes the endpoint.

In addition to slicing, you can also specifies a list of indices as an
index.  For example, the expression: ``b[:, [1, 3]]`` will yield
columns 1 and 3 from ``b``:

::

    In [37]: b[:, [1, 3]]
    array([[  1,   9],
           [ 25,  49],
           [ 81, 121],
           [169, 225],
           [289, 361],
           [441, 529]])

One thing to keep in mind with NumPy arrays is that you will lose a dimension
if you specify a single column or row as an index.  For example,
notice that the results of the following two expressions are both
one-dimensional arrays:

::

    In [38]: b[1, :]
    Out[38]: array([16, 25, 36, 49])

    In [39]: b[:, 1]
    Out[39]: array([  1,  25,  81, 169, 289, 441])

If you wish to retain the dimension, you can use list indexing:

::

    In [40]: b[:, [1]]
    Out[40]: 
    array([[  1],
           [ 25],
           [ 81],
           [169],
           [289],
           [441]])

    In [41]: b[[1], :]
    Out[41]: array([[16, 25, 36, 49]])


**Task 2:** In the ``task2`` function in ``lab6.py``,
write expressions to extract the following subarrays of ``b``:

* rows 0, 1, and 2.
* rows 0, 1, and 5
* columns 0, 1, and 2
* columns 0, 1, and 3
* columns 0, 1, and 2 from rows 2 and 3.

**Task 3:** We have imported the functions ``linear_regression`` and ``prepend_ones_column`` from PA #5
in ``lab6.py``.  Write code to call ``linear_regression`` using a column of all ones
and columns 2 (``RODENTS``) and 3 (``GARBAGE``) of ``city_data`` as the value for ``X`` and column
7 (``CRIME_TOTALS``) as the value for ``y``.  This function expects a
two-dimensional NumPy array for the value of ``X`` and a
one-dimensional NumPy array for the value of ``y``.

Hint: you can do this task in a single line of code.

The result should be:

::

    array([ 66.60834501   0.58072845  16.82863941])


**Task 4:** Write code to call ``linear_regression`` using a column of all ones and column 0
(``GRAFFITI``) of ``city_data`` as the value for ``X`` and column 7 (``CRIME_TOTALS``) as the
value for ``y``.

The result should be:

::

    array([ 593.77754996    0.76632378])
     

Other useful operations
-----------------------

You can find the number of dimensions, shape, and the number of
elements in a NumPy array using the ``ndim``, ``shape`` and ``size``
properties respectively.

::

    In [42]: b.ndim
    Out[42]: 2

    In [43]: b.shape
    Out[43]: (6, 4)

    In [44]: b.size
    Out[44]: 24


As noted above, you can compute the mean of the elements using the
``mean`` method.

::

    In [54]: b.mean()
    Out[54]: 180.16666666666666


You can also compute the per-column mean and the per-row mean using
the mean method by specifying an *axis*, where 0 is the column axis
and 1 is the row axis:

::


    In [55]: b.mean(0)
    Out[55]: array([ 146.66666667,  167.66666667,  190.66666667,  215.66666667])

    In [56]: b.mean(1)
    Out[56]: array([   3.5,   31.5,   91.5,  183.5,  307.5,  463.5])
    
    In [57]: np.mean(b, axis=0)
    Out[57]: array([ 146.66666667,  167.66666667,  190.66666667,  215.66666667])

    In [58]: np.mean(b, axis=1)
    Out[58]: array([   3.5,   31.5,   91.5,  183.5,  307.5,  463.5])


When Finished
-------------

When finished with the lab please check in your work (assuming you are
inside the lab directory):

.. code::

    git add lab6.py
    git commit -m "Finished with lab6"
    git push

No, we're not grading this, we just want to look for common errors.