Linear Regression¶

Due: Friday, November 20 at 3pm CST.

In this assignment, you will fit linear regression models and implement a few simple variable selection algorithms. The assignment will give you experience with NumPy and more practice with using classes and functions to support code reuse.

You must work alone on this assignment.

Introduction¶

At the heart of the assignment is a table, where each column is a variable and each row is a sample unit. As an example, in a health study, each sample unit might be a person, with variables like height, weight, sex, etc. In your analysis, you will build models that, with varying levels of accuracy, can predict the value of one of the variables as a function of the others.

Good predictions are only possible if variables are related somehow. As an example, look at this plot of recorded crimes against logged complaint calls about garbage to 311.

Each point describes a sample unit, which in this example represents a geographical region of Chicago. Each region is associated with variables, such as the number of crimes or complaint calls during a fixed time frame. Given this plot, if you got the question of how many crimes you think were recorded for a region that had 150 complaint calls about garbage, you would follow the general trend and probably say something like 3000 recorded crimes. To formalize this prediction, we need a model for the data that relates a dependent variable (e.g., crimes) to a set of predictor variables (e.g., complaint calls). Our model will assume a linear dependence between dependent and predictor variables.

To make this precise, we will use the following notation:

\(N\): the total number of sample units.
\(K\): the total number of predictor variables. In the example above, \(K = 1\).
\(n\): the sample unit that we are currently considering (an integer from \(0\) to \(N - 1\)).
\(k\): the predictor variable that we are currently considering (an integer from \(0\) to \(K - 1\)).
\(x_{nk}\): an observation of a predictor variable \(k\) for sample unit \(n\), e.g., the number of complaint calls about garbage.
\(y_n\): an observation of the dependent variable for sample unit \(n\), e.g., the total number of crimes.
\(\hat y_n\): our prediction for the dependent variable for sample unit \(n\), based on our observation of the predictor variables. This value corresponds to a point on the red line.
\(\varepsilon_n = y_n - \hat y_n\): The residual or observed error, that is, the difference between the actual observed value of the dependent variable, and our prediction for it. Ideally, our predictions would match the observations, so that \(\varepsilon_n\) would always be zero. In practice, there will be some discrepancy, for several reasons. First, there may be noise in our data, which arises from errors measuring the independent or dependent variables. Additionally, our model will assume a linear dependence between the predictor variables and the dependent variable, while in reality the relationship is unlikely to be exactly linear. So, even when we do have direct access to the observations of the dependent variable, we will not have \(\varepsilon_n\) equal to zero. That said, linear models are a surprisingly good fit to a wide variety of real-world data, and are an excellent baseline to consider before moving on to more advanced models.

Our prediction for the dependent variable will be given by a linear equation:

(1)¶\[\hat y_n = \beta_0 + \beta_1 x_{n1} + \dots + \beta_K x_{nK},\]

where the coefficients \(\beta_0, \beta_1, \dots, \beta_K\) are real numbers. We would like to select values for these coefficients that result in small residuals \(\varepsilon_n\).

We can rewrite this equation more concisely using vector notation. We define:

\(\newcommand{\vbeta}{\pmb{\beta}} \vbeta = \begin{pmatrix} \beta_0 & \beta_1 & \beta_2 & \cdots & \beta_K\end{pmatrix}^T\): a column vector of the regression coefficients, where \(\beta_0\) is the intercept and \(\beta_k\) (for \(1 \leq k \leq K\)) is the coefficient associated with the \(k\mathrm{th}\) predictor. This vector describes the red line in the figure above. Note that a positive value of a coefficient suggests a positive correlation with the dependent variable. The same is true for a negative value and a negative correlation.
\(\mathbf{x}_n = \begin{pmatrix} 1 & x_{n1} & x_{n2} & \cdots & x_{nK}\end{pmatrix}^T\): a column vector representation of all the predictors for a given sample unit. Note that a 1 has been prepended to the vector. This will allow us to rewrite equation (1) in vector notation without having to treat \(\beta_0\) separately from the other coefficients \(\beta_k\).

We can then rewrite equation (1) as:

(2)¶\[\hat y_n = \mathbf{x}_n^T \vbeta.\]

This equation can be written for all sample units at the same time using matrix notation. We define:

\(\newcommand{\vy}{\mathbf{y}} \vy = \begin{pmatrix} y_0 & y_1 & \cdots & y_{N - 1}\end{pmatrix}^T\): a column vector of observations of the dependent variable.
\(\newcommand{\vyhat}{\mathbf{\hat y}} \vyhat = \begin{pmatrix} \hat y_0 & \hat y_1 & \cdots & \hat y_{N - 1}\end{pmatrix}^T\): a column vector of predictions for the dependent variable.
\(\newcommand{\veps}{\pmb{\varepsilon}} \veps = \begin{pmatrix} \varepsilon_0 & \varepsilon_1 & \cdots & \varepsilon_{N - 1}\end{pmatrix}^T\): a column vector of the residuals (observed errors).
\(\mathbf{X}\): an \(N \times (K + 1)\) matrix where each row is one sample unit. The first column of this matrix is all ones.

We can then write equations (1) and (2) for all sample units at once as

(3)¶\[\mathbf{\hat y} = \mathbf{X} \vbeta,\]

And, we can express the residuals as

(4)¶\[\veps = \vy - \vyhat.\]

Matrix multiplication

Equations (2) and (3) above involve matrix multiplication. If you are unfamiliar with matrix multiplication, you will still be able to do this assignment. Just keep in mind that to make the calculations less messy, the matrix \(\mathbf{X}\) contains not just the observations of the predictor variables, but also an initial column of all ones. The data we provide does not yet have this column of ones, so you will need to prepend it.

Model fitting

There are many possible candidates for \(\vbeta,\) some that fit the data better than others. Finding the best value of \(\vbeta\) is referred to as fitting the model. For our purposes, the “best” value of \(\vbeta\) is the one that minimizes the residuals \(\veps = \vy - \vyhat\) in the least-squared sense. That is, we want the value for \(\vbeta\) such that the predicted values \(\mathbf{\hat y} = \mathbf{X} \vbeta\) are as close to the observed values \(\vy\) as possible (in a statistically-motivated way using maximum likelihood). We will provide a function that computes this value of \(\vbeta\); see “The linear_regression function” below.

The `linear_regression` function¶

We are interested in the value of \(\vbeta\) that best fits the data. To simplify your job, we have provided code for fitting a linear model. The function linear_regression(X, y) in util.py finds the best value of \(\vbeta\) as described in the introduction. This function accepts as input a two-dimensional NumPy array of floats X containing the observations of the predictor variables (\(\mathbf{X}\)) and a one-dimensional NumPy array of floats y containing the observations of the dependent variable (\(\vy\)). It returns \(\vbeta\) as a one-dimensional NumPy array beta.

We have also provided a function, apply_beta(beta, X) for making predictions using the model. It takes beta, the array generated by our linear regression function, and X as above, applies the linear function described by beta to each row in X, and returns a one-dimensional NumPy array of the resulting values (that is, it returns \(\mathbf{X} \vbeta\)).

Finally, we have provided a function prepend_ones_column(X) to prepend the necessary column of ones that allows linear_regression to fit an intercept \(\beta_0\) as discussed above.

Here are sample calls to prepend_ones_column, linear_regression, apply_beta:

In [1]: import numpy as np

In [2]: from util import *

In [3]: predictors = np.array([[5, 2], [3, 2], [6, 2.1], [7, 3]])

In [4]: X = prepend_ones_column(predictors)

In [5]: X
Out[5]:
array([[1. , 5. , 2. ],
       [1. , 3. , 2. ],
       [1. , 6. , 2.1],
       [1. , 7. , 3. ]])

In [6]: y = np.array([5, 2, 6, 6]) # observations of dependent variable

In [7]: beta = linear_regression(X, y)

In [8]: beta
Out[8]: array([ 1.20104895,  1.41083916, -1.6958042 ])

In [9]: apply_beta(beta, X) # yhat_n = 1.2 + 1.4 * x_n1 - 1.7 * x_n2
Out[9]: array([4.86363636, 2.04195804, 6.1048951 , 5.98951049])

Getting started¶

Before you start working on the assignment’s tasks, please take a moment to follow the steps described in Coursework Basics page to get the files for this assignment (these steps will be the same as the ones you followed to get the files previous assignments. Please note that you will not be able to start working on the assignment until you fetch the assignment files (which will appear in a pa5 directory in your repository)

The pa5 directory contains the following files:

regression.py: Python file where you will write your code.
util.py: Python file with several helper functions, some of which you will need to use in your code.
output.py: This file is described in detail in the “Evaluating your model output” section below.
test_regression.py: Python file with the automated tests for this assignment.

The pa5 directory also contains a data directory which, in turn, contains two sub-directories: city and lifeexpect.

Data¶

In this assignment you will write code that can be used on many different datasets. We have provided a couple of sample datasets for testing purposes.

City: Predicting crime rate from the number of complaint calls to 311.
Life Expectancy: Predicting average life expectancy for a country based on other health metrics.

More information about each dataset can be found by clicking the links above.

For this assignment, a dataset is stored in a directory that contains two files: a CSV (comma-separated values) file called data.csv and a JSON file called parameters.json. The CSV file contains a table where each column corresponds to a variable and each row corresponds to a sample unit. The first row contains the column names. The JSON file contains a few parameters:

name: the name of the dataset,
predictor_vars: the column indices of the predictor variables,
dependent_var: the column index of the dependent variable,
training_fraction: the fraction of the data that should be used to train models, and
seed: a random number generator seed.

The last two parameters will be used to help split the data into two sets: one that you will use to fit or train models and one that you will use to evaluate how well different models predict outcomes on new-to-the-model data. We describe this process below.

Task 0: The `Model` class¶

In this assignment, you will be working with datasets and models, so it will make sense for us to have a DataSet class and a Model class, the latter which you will implement in regression.py (we have provided basic skeleton for the Model class in that file)

This task describes the requirements for the Model class and, while you should give some thought to the design of this class before moving on to the rest of the assignment, it is very likely you will have to reassess your design throughout the assignment (and you should not get discouraged if you find yourself doing this). We suggest you start by writing a first draft of the Model class, and then fine-tune it as you work through the rest of the assignment.

Representing Datasets

The DataSet class we provide for you handles loading in the above-mentioned data files (CSV and JSON). It reads the data into a Numpy array and uses the JSON file to set various attributes. Additionally, it partitions the data into two different numpy arrays, a training_data array and a testing_data array.

Why not use all the data to fit the models? It is easy to fit a model that does a good job of predicting the dependent variable for the sample units that were used to train it, but does a terrible job of predicting the dependent variable for other data. How do we determine how well a model does with new-to-it data? We can hold out some of the data and use the model to predict the values of the dependent variable for the held-out sample units. Since we know the actual value of the dependent variable for this held-out data, we can use it to evaluate the effectiveness of the model on previously-unseen data.

Representing models

Implementing a Model class will allow us to write functions that take Model objects as parameters, and which can return Model objects as well, instead of having to pass around lists or dictionaries with the information for a given model.

Each model will be constructed from a dataset and a subset of the dataset’s predictor variables. This means that your Model class will need to take a dataset and a list of the indices of predictor vars as arguments to its constructor. We encourage you to think carefully about what attributes and methods to include in your class.

Take into account that, for the tests to work, your Model class must have at least the following public attributes (don’t worry if you don’t understand what these mean right now; they will become clearer in subsequent tasks):

dep_var: The index of the dependent variable.
pred_vars: A list of integers containing the indices of the predictor variables.
beta: A NumPy array with the model’s \(\vbeta\)

To compute beta you will use the linear_regression function that we discuss above. As we will describe later, it can also be very helpful to include a __repr__ method.

Testing your code¶

As usual, you will be able to test your code from IPython and by using py.test. When using IPython, make sure to enable autoreload before importing the regression.py module:

In [1]: %load_ext autoreload

In [2]: %autoreload 2

In [3]: import regression

You might then try creating a new Model object by creating a DataSet object (let’s assume we create a dataset variable for this) and running the following:

In [3]: dataset = regression.DataSet('data/city')

In [4]: model = regression.Model(dataset, [0, 1, 2])

You can also use the automated tests. For example, you might run:

$ py.test -vk task0

A note on array dimensions

The parameter of the function prepend_ones_column (as well as the first parameter to the functions linear_regression and apply_beta described below) is required to be a two-dimensional array, even when that array only has one column. NumPy makes a distinction between one-dimensional and two-dimensional arrays, even if they contain the same information. For instance, if we want to extract the second column of A as a one-dimensional array, we do the following:

>>> A
array([[5. , 2. ],
       [3. , 2. ],
       [6. , 2.1],
       [7. , 3. ]])
>>> A[:, 1]
array([2. , 2. , 2.1, 3. ])
>>> A[:, 1].shape
(4,)

The resulting shape will not be accepted by prepend_ones_column. To retrieve a 2D column subset of A, you can use a list of integers as the index. This mechanism keeps A two-dimensional, even if the list of indices has only one value:

>>> A[:, [1]]
array([[2. ],
       [2. ],
       [2.1],
       [3. ]])
>>> A[:, [1]].shape
(4, 1)

In general, you can specify a slice containing a specific subset of columns as a list. For example, let Z be:

>>> Z = np.array([[1, 2, 3, 4], [11, 12, 13, 14], [21, 22, 23, 24]])
>>> Z
array([[ 1,  2,  3,  4],
       [11, 12, 13, 14],
       [21, 22, 23, 24]])

Evaluating the expression Z[:, [0, 2, 3]] will yield a new 2D array with columns 0, 2, and 3 from the array Z:

>>> Z[:, [0, 2, 3]]
array([[ 1,  3,  4],
       [11, 13, 14],
       [21, 23, 24]])

or more generally, we can specify any expression for the slice that yields a list:

>>> l = [0, 2, 3]
>>> Z[:, l]
array([[ 1,  3,  4],
       [11, 13, 14],
       [21, 23, 24]])

Task 1: Computing the coefficient of Determination (R²)¶

Blindly regressing on all available predictors is bad practice and can lead to over-fit models. Over-fitting happens when your model picks up on random structure in the training data that does not represent the underlying phenomenon (Wikipedia has a nice explanation ). This situation is made even worse if some of your predictors are highly correlated (see multicollinearity), since it can ruin the interpretation of your coefficients. As an example, consider a study that looks at the effects of smoking and drinking on developing lung cancer. Smoking and drinking are highly correlated, so the regression can arbitrarily pick up either as the explanation for high prevalence of lung cancer. As a result, drinking could be assigned a high coefficient while smoking a low one, leading to the questionable conclusion that drinking causes lung cancer while smoking does not.

This example illustrates the importance of variable selection. In order to compare two different models (subsets of predictors) against each other, we need a measurement of their goodness of fit, i.e., how well they explain the data. A popular choice is the coefficient of determination, R², which first considers the variance of the dependent variable, \(\DeclareMathOperator{\Var}{Var} \Var(\vy)\):

In this example we revisit the regression of the total number of crimes, \(y_n\), on the number of calls about garbage to 311, \(x_n\). To the right, a histogram over the \(y_n\) is shown and their variance is calculated. The variance is a measure of how spread out the data is. It is calculated as follows:

(5)¶\[\Var(\vy) = \frac{1}{N}{\sum_{n=1}^N} (y_n - \bar y)^2,\]

where \(\bar y\) denotes the mean of all of the \(y_n\).

We now subtract the red line from each point. These new points are the residuals and represent the deviations from the predicted values under our model. If the variance of the residuals, \(\Var(\vy - \vyhat),\) is zero, it means that all residuals have to be zero and thus all sample units lie perfectly on the fitted line. On the other hand, if the variance of the residuals is the same as the variance of \(\vy\), it means that the model did not explain anything away and we can consider the model useless. These two extremes represent R² values of 1 and 0, respectively. All R² values in between these two extremes will be cases where the variance was reduced some, but not entirely. Our example is one such case:

As we have hinted, the coefficient of determination, R², measures how much the variance was reduced by subtracting the model. More specifically, it calculates this reduction as a percentage of the original variance:

(6)¶\[R^2 = \frac{\Var(\vy) - \Var(\vy - \vyhat)}{\Var(\vy)} = 1 - \frac{\Var(\vy - \vyhat)}{\Var(\vy)}.\]

which finally becomes:

(7)¶\[R^2 = 1 - \frac{\sum_{n=1}^N (y_n - \hat y_n)^2}{\sum_{n=1}^N (y_n - \bar y)^2}.\]

In equation (7) we omitted the normalization constants N since they cancel each other out. We also did not subtract the mean when calculating the variance of the residuals, since it can be shown mathematically that if the model has been fit by least squares, the sum of the residuals is always zero.

There are two properties of R² that are good to keep in mind when checking the correctness of your calculations:

\(0 \leq R^2 \leq 1\).
If model A contains a superset of predictors of model B, then the R² value of model A is greater than or equal to the R² value of model B.

These properties are only true if R² is computed using the same data that was used to train the model parameters. If you calculate R² on a held-out testing set, the R² value could decrease with more predictors (due to over-fitting) and R² is no longer guaranteed to be greater than or equal to zero. Furthermore, equations (6) and (7) would no longer be equivalent (and the intuition behind R² would need to be reconsidered, but we omit the details here). You should always use equation (7) to compute R².

A good model should explain as much of the variance as possible, so we will be favoring higher R² values. However, since using all predictors gives the highest R², we must balance this goal with a desire to use few predictors. In general it is important to be cautious and not blindly interpret R²; take a look at these two generated examples:

This data was generated by selecting points on the dashed blue function, and adding random noise. To the left, we see that the fitted model is close to the one we used to generate the data. However, the variance of the residuals is still high, so the R² is rather low. To the right is a model that clearly is not right for the data, but still manages to record a high R² value. This example also shows that that linear regression is not always the right model choice (although, in many cases, a simple transformation of the predictors or the dependent variable can resolve this problem, making the linear regression paradigm still relevant).

Task 1¶

You must add the following attribute to your model class:

R2: The value of R² for the model

and compute it. Note that when we ask for the R² value for a model, we mean the value obtained by computing R² using the model’s \(\vbeta\) on the data that was used to train it.

Note

You may not use the NumPy var method to compute R² for two reasons: (1) our computation uses the biased variance while the var method computes the unbiased variance and will lead you to generate the wrong answers and (2) we want you to get practice working with NumPy arrays.

Task 2: Model evaluation using the Coefficient of Determination (R²)¶

Task 2a¶

You must implement this function in regression.py:

def compute_single_var_models(dataset):
    '''
    Computes all the single-variable models for a dataset

    Inputs:
        dataset: (DataSet object) a dataset

    Returns:
        List of Model objects, each representing a single-variable model
    '''

More specifically, given a dataset with P predictor variables, you will construct P univariate (i.e., one variable) models of the dataset’s dependent variable, one for each of the possible predictor variables (in the same order that the predictor variables are listed), and compute their associated R² values.

Most of the work in this task will be in your Model class. In fact, if you’ve designed Model correctly, your implementation of compute_single_var_models shouldn’t be longer than three lines (ours is actually just one line long).

Task 2b¶

You must implement this function in regression.py:

def compute_all_vars_model(dataset):
    '''
    Computes a model that uses all the predictor variables in the dataset

    Inputs:
        dataset: (DataSet object) a dataset

    Returns:
        A Model object that uses all the predictor variables
    '''

More specifically, you will construct a single model that uses all of the dataset’s predictor variables to predict the dependent variable. According to the second property of R², the R² for this model should be the largest R² value you will see for the training data.

At this point, you should make sure that your code to calculate a model’s R² value is general enough to handle models with multiple predictor variables (i.e., multivariate models).

Take into account that, if you have a good design for Model, then implementing Task 2b should be simple after implementing Task 2a. If it doesn’t feel that way, ask on Piazza or come to office hours so we can give you some quick feedback on your design.

Evaluating your model output¶

Note that IPython can be incredibly helpful when testing Task 2 and subsequent tasks. For example you can test Task 2a code by creating a DataSet object (let’s assume we create a dataset variable for this) and running the following:

In [3]: dataset = regression.DataSet('data/city')
In [4]: univar_models = regression.compute_single_var_models(dataset)

univar_models should then contain a list of Model objects. However, checking the R² values manually from IPython can be cumbersome, so we have included a Python program, output.py, that will print out these (and other) values for all of the tasks. You can run it from the command-line like this:

$ python3 output.py data/city

If your implementation of Task 2a is correct, you will see this:

City Task 2a
    !!! You haven't implemented the Model __repr__ method !!!
    R2: 0.1402749161031348

    !!! You haven't implemented the Model __repr__ method !!!
    R2: 0.6229070858532733

    !!! You haven't implemented the Model __repr__ method !!!
    R2: 0.5575360783921093

    !!! You haven't implemented the Model __repr__ method !!!
    R2: 0.7831498392992615

    !!! You haven't implemented the Model __repr__ method !!!
    R2: 0.7198560514392482

    !!! You haven't implemented the Model __repr__ method !!!
    R2: 0.32659079486818354

    !!! You haven't implemented the Model __repr__ method !!!
    R2: 0.6897288976957778

You should be producing the same R² values shown above, but we can’t tell what model they each correspond to! For output.py to be actually useful, you will need to implement the __repr__ method in Model.

We suggest your string representation be in the following form: the name of the dependent variable followed by a tilde (~) and the regression equation with the constant first. If you format the floats to have six decimal places, the output for Task 2a will now look like this:

City Task 2a
    CRIME_TOTALS ~ 575.687669 + 0.678349 * GRAFFITI
    R2: 0.1402749161031348

    CRIME_TOTALS ~ -22.208880 + 5.375417 * POT_HOLES
    R2: 0.6229070858532733

    CRIME_TOTALS ~ 227.414583 + 7.711958 * RODENTS
    R2: 0.5575360783921093

    CRIME_TOTALS ~ 11.553128 + 18.892669 * GARBAGE
    R2: 0.7831498392992615

    CRIME_TOTALS ~ -65.954319 + 13.447459 * STREET_LIGHTS
    R2: 0.7198560514392482

    CRIME_TOTALS ~ 297.222082 + 10.324616 * TREE_DEBRIS
    R2: 0.32659079486818354

    CRIME_TOTALS ~ 308.489056 + 10.338500 * ABANDONED_BUILDINGS
    R2: 0.6897288976957778

For example, the first model uses only one variable, GRAFFITI to predict crime, and has an R² of only 0.14027. The last model uses only ABANDONED_BUILDINGS, and the higher R² (0.6897) tells us that ABANDONED_BUILDINGS is likely a better predictor than GRAFFITI.

Take into account that you can also run output.py with the Life Expectancy dataset:

$ python3 output.py data/lifeexpect

The full expected output of output.py can be found in the City and Life Expectancy pages. However, please note that you do not need to check all this output manually. We have also included automated tests that will do these checks for you. You can run them using this command:

$ py.test -vk task2

Warning: No hard-coding!

Since we only have two datasets, it can be very easy in some tasks to write code that will pass the tests by hard-coding the expected values in the function (and returning one or the other depending on the dataset that is being used)

If you do this, you will receive ZERO POINTS on your entire test score, regardless of whether other tasks are implemented correctly without hard-coding

Task 3: Building and selecting bivariate models¶

If you look at the output for Task 2a, none of the predictor variables individually are particularly good predictors for the dependent variable (they all had low R² values compared to when using all predictors). In this and subsequent tasks, we will construct better models using multiple variables without using all of them and risking over-fitting.

For example, we could predict crime using not only complaints about garbage, but graffiti as well. For example, we may want to find \(\vbeta = (\beta_0, \beta_\texttt{GARBAGE}, \beta_\texttt{GRAFFITI})\) in the equation

\[\hat y_n = \beta_0 + \beta_\texttt{GARBAGE} x_{n \, \texttt{GARBAGE}} + \beta_\texttt{GRAFFITI} x_{n \, \texttt{GRAFFITI}},\]

where \(\hat y_n\) is a prediction of the number of crimes given the number of complaint calls about garbage (\(x_{n \, \texttt{GARBAGE}}\)) and graffiti (\(x_{n \, \texttt{GRAFFITI}}\)).

For this task, you will test all possible bivariate models (\(K = 2\)) and determine the one with the highest R² value. We suggest that you use two nested for-loops to iterate over all possible combinations of two predictors, calculate the R² value for each combination and keep track of one with the highest R² value.

Hint: Given three variables A, B, C, we only need to check (A, B), (A, C), and (B, C). Take into account that a model with the pair (A, B) of variables is the same as the model with (B, A).

To do this task, you will implement this function in regression.py:

def compute_best_pair(dataset):
    '''
    Find the bivariate model with the best R2 value

    Inputs:
        dataset: (DataSet object) a dataset

    Returns:
        A Model object for the best bivariate model
    '''

Unlike the functions in Task 2, you can expect to write more code in this function. In particular, you should not include the code to loop over all possible bivariate models within your Model class; it should instead be in this function.

You can test this function by running:

$ py.test -vk task3

You should also look at the output produced by output.py. How does the bivariate model compare to the single-variable models in Task 2? Which pair of variables perform best together? Does this result make sense given the results from Task 2? (You do not need to submit your answers to these questions; they’re just food for thought!)

Task 4: Building models of arbitrary complexity¶

How do we efficiently determine how many and which variables will generate the best model? It turns out that this question is unsolved and is of interest to both computer scientists and statisticians. To find the best model of three variables we could repeat the process in Task 3 with three nested for-loops instead of two. For K variables we would have to use K nested for-loops. Nesting for-loops quickly becomes computationally expensive and infeasible for even the most powerful computers. Keep in mind that models in genetics research easily reach into the thousands of variables. How then do we find optimal models of several variables?

We’ll approach this problem by using heuristics, which will give us an approximate (as opposed to an exact) result. We are going to split this problem into two tasks. In Task 4, you will determine the best K variables to use (for each possible value of K).

How do we determine which K variables will yield the best model for a fixed K? As noted, the naive approach (test all possible combinations as in Task 3), is intractable for large K. An alternative simpler approach is to choose the K variables with the K highest R² values in the table from Task 2a. This approach does not work well if your predictors are correlated. As you’ve already seen, neither of the top two individual predictors for crime (GARBAGE and STREET_LIGHTS) are included in the best bivariate model (CRIME_TOTALS ~ -36.151629 + 3.300180 * POT_HOLES + 7.129337 * ABANDONED_BUILDINGS).

Instead, you’ll implement a heuristic known as Forward selection, which is an example of a greedy algorithm. Forward selection starts with an empty set of predictor variables variables and then adds one at a time until the set contains K variables. At each step, the algorithm identifies the variable in the model that, when added, yields the model with the best R² value. That variable is added to the list of predictor variables.

To do this task, you will implement this function in regression.py:

def forward_selection(dataset):
  '''
  Given a dataset with P predictor variables, uses forward selection to
  select models for every value of K between 1 and P.

  Inputs:
      dataset: (DataSet object) a dataset

  Returns:
      A list (of length P) of Model objects. The first element is the
      model where K=1, the second element is the model where K=2, and so on.
  '''

You can test this function by running this:

$ py.test -vk task4

A word of caution

In this task, you have multiple models, each with their own list of predictor variables. Recall that lists are stored by reference, so a bug in your code might lead to two models pointing to the same list! These sorts of bugs can be very tricky to debug, so remember to be careful.

Task 5: Training vs. test data¶

Please read this section very carefully

Until now, you have evaluated a model using the data that was used to train it. The resulting model may be quite good for that particular dataset, but it may not be particularly good at predicting novel data. This is the problem that we have been referring to as over-fitting.

Up to this point, we have only computed R² values for the training data that was used to construct the model. That is, after training the model using the training data, we used that same data to compute an R² value. It is also valid to compute an R² value for a model when applied to other data (in our case, the data we set aside as testing data). When you do this, you still train the model (that is, you compute \(\vbeta\)) using the training data, but then evaluate an R² value using the testing data.

Warning

When training a model, you should only ever use the training data, even if you will later evaluate the model using other data. You will receive zero credit for this task if you train a new model using the testing data.

To do this task, you will implement the following function in regression.py. We will apply this function to evaluate the models generated in Task 4.

def validate_model(dataset, model):
    '''
    Given a dataset and a model trained on the training data,
    compute the R2 of applying that model to the testing data.

    Inputs:
        dataset: (DataSet object) a dataset
        model: (Model object) A model that must have been trained
           on the dataset's training data.

    Returns:
        (float) An R2 value
    '''

If your Model class is well designed, this function should not require more than one line (or, at most, a few lines) to implement. If that is not the case, please come to office hours or ask on Piazza so we can give you some quick feedback on your design.

You can test this function by running this:

$ py.test -vk task5

We can see below how the training versus testing R² value increases as we add more variables to our model for the City dataset. There are two interesting trends visible in these curves.

The R² for the test set is consistently lower than the training set R². Our model performs worse at predicting new data than it did on the training data. This is a common feature when training models with data, and is to be expected (why?).
The improvement in R² slows as we add more variables. In essence, the first few varibles provide a lot of predictive power, but each additional variable explains less and less. For some datasets (not the small ones in this exercise), it’s possible to add so many variables that the R² of the test set starts to decrease! This is the essence of “over-fitting” : building a model that fits the training set too closely, and does not work well on novel (test) data.

Slight numerical errors

If you are consistently getting answers that are “close” but not correct, you may have inadvertently used the wrong equation. Make sure you are using (7) to compute R² and the associated variance.

Grading¶

Programming assignments will be graded according to a general rubric. Specifically, we will assign points for completeness, correctness, design, and style. (For more details on the categories, see our PA Rubric page.)

The exact weights for each category will vary from one assignment to another. For this assignment, the weights will be:

Completeness: 50%
Correctness: 15%
Design: 20%
Style: 15%

You must include header comments in all the methods you implement.

Obtaining your test score¶

Like previous assignments, you can obtain your test score by running py.test followed by ../common/grader.py.

Submission¶

You must submit your work through Gradescope (linked from our Canvas site). In the “Programming Assignment #5” assignment, simply upload the file regression.py (do not upload any other files!). Please note:

You are allowed to make as many submissions as you want before the deadline.
Please make sure you have read and understood our Late Submission Policy
Your completeness score is determined solely based on the automated tests, but we may adjust your score if you attempt to pass tests by rote (e.g., by writing code that hard-codes the expected output for each possible test input).
Gradescope will report the test score it obtains when running your code. If there is a discrepancy between the score you get when running our grader script, and the score reported by Gradescope, please let us know so we can take a look at it.

Acknowledgments: This assignment has been worked on by many people over the years, including Matthew Rocklin and Gustav Larsson.