Linear Regression

Due: Friday, Nov 11 at 4pm

In this assignment, you will fit linear regression models and implement simple variable selection algorithms. The assignment will give you experience with Numpy and more practice with using functions to support code reuse. You must work alone on this assignment.

At the heart of the assignment is a table of data, where each column is a variable and each row is a sample. As an example, in a health study, each sample might be a person, with variables like height, weight, sex, etc. In your analysis, you will build models that, with varying levels of accuracy, can predict the value of one of the variables as a function of the others.

Predictions are only possible if variables are related somehow. As an example, look at this plot of recorded crimes against logged complaint calls about garbage to 311.

Each point describes a sample, which in this example represents a geographical region of Chicago. Each region is associated with variables, such as the number of crimes or complaint calls during a fixed time frame. Given this plot, if you got the question of how many crimes you think were recorded for a region that had 150 complaint calls about garbage, you would follow the general trend and probably say something like 3000 recorded crimes. To formalize this prediction, we need a model for the data that relates a dependent variable (e.g. crimes) to a set of predictor variables (e.g. complaint calls). We will assume a linear dependence as follows

\[y_n = \beta_0 + \beta_1 x_{n1} + \dots + \beta_K x_{nK} + \varepsilon_n,\]

or using vector notation

\[y_n = \mathbf{x}_n^T \beta + \varepsilon_n.\]

We have introduced the following notation:

\(n\)
The sample that we are considering out of \(N\) total.
\(y_n\)
An observation of the dependent variable, e.g. the total number of crimes.
\(x_{nk}\)
An observation of a predictor variable \(k\) for sample \(n,\) e.g. the complaint calls about garbage. We assume a total of \(K\) different predictors.
\(\beta = (\beta_0, \beta_1, \dots, \beta_K)^T\)
A column vector of the regression coefficients, where \(\beta_0\) is the intercept and \(\beta_k\) is the coefficient associated with the \(k\mathrm{th}\) predictor. This vector describes the red line in the figure above. Note that a positive value of a coefficient suggests a positive correlation with the dependent variable. The same is true for a negative value and a negative correlation.
\(\varepsilon_n\)
Not all sample points lie on the red line, which means the equation \(y_n = \mathbf{x}_n^T \beta\) is rarely satisfied. In our model, this discrepancy is assumed to be caused by random error. We assume that the random error is zero-mean and drawn from a normal distribution of a fixed variance. We want as much as possible to be explained deterministically, which means we will favor values of \(\beta\) that result in small errors.
\(\mathbf{x}_n = (1, x_{n1}, \dots, x_{nK})^T\)
A column vector representation of all the predictors for a given sample. As a convenience, a 1 has been prepended to the vector so as to incorporate the intercept \(\beta_0\) into the vector notation.

When making predictions, what we are really doing is asking what is the weighted average value for a given set of predictors. Since we have assumed the errors have zero mean, to make a prediction, we just remove the error term as

\[\hat y = f(\mathbf{x}) = \mathbf{x}^T \beta.\]

This equation can be written for several samples at the same time as

\[\mathbf{\hat y} = \mathbf{X} \beta,\]

where \(\mathbf{\hat y}\) is a column vector of predictions and \(\mathbf{X}\) is a \(N \times K\) matrix where each row is one sample.

Model fitting

There are many possible candidates for \(\beta,\) some that fit the data better than others. Finding the best value of \(\beta\) is referred to as fitting the model. To simplify your job, we have provided code for fitting a linear model. The function linear_regression finds a value \(\beta\) that minimizes \(\mathbf{\hat y} - \mathbf{y}\) in a least squared sense. That is, it finds values for \(\beta\) such that the predicted values \(\mathbf{\hat y} = \mathbf{X} \beta\) are as close to the observed variables as possible (in a statistically motivated way using maximum likelihood). This function accepts a two-dimensional numpy array of floats containing the predictor variables (X) and a one-dimensional numpy array of floats (y) containing the dependent variable as input and returns \(\beta.\) Note that X does not correspond exactly with \(\mathbf{X},\) since it should not have the column of ones (we will add that for you).

We have also provided a function, apply_beta(beta, X) for making predictions using the model. It takes beta, the array generated by our linear regression function, and X, a two-dimensional numpy array of floats that contains values for the predictor variables, applies the function described by beta to each row in X, and returns a one-dimensional numpy array of the resulting values.

Here are sample calls to linear_regression and apply_beta:

>>> import numpy as np
>>> from model import *
>>> X = np.array([[5,2], [3,2], [6,2.1], [7, 3]]) # predictors
>>> y = np.array([5,2,6,6]) # dependent
>>> beta = linear_regression(X, y)
>>> beta
array([ 1.20104895,  1.41083916, -1.6958042 ]) # f(x) = 1.2 + 1.4*x1 - 1.7*x2
>>> apply_beta(beta, X)
array([ 4.86363636,  2.04195804,  6.1048951 ,  5.98951049]) # predictions

The input to both of these functions require X to be two-dimensional, even in the case of K = 1. Numpy makes a distinction between a one-dimensional and a two-dimensional array, even if they contain the same information. For instance, if we want to extract the second column of X as a one-dimensional array, we do the following:

>>> X
array([[ 5. ,  2. ],
       [ 3. ,  2. ],
       [ 6. ,  2.1],
       [ 7. ,  3. ]])
>>> X[:,1]
array([ 2. ,  2. ,  2.1,  3. ])
>>> X[:,1].shape
(4,)

However, this shape will not be accepted by linear_regression and apply_beta. To retrieve a column subset of X, you can use a list of integers to index the columns instead. This mechanism also has the benefit of keeping X two-dimensional, even if the list of indices has only one index:

>>> X[:,[1]]
array([[ 2. ],
       [ 2. ],
       [ 2.1],
       [ 3. ]])
>>> X[:,[1]].shape
(4, 1)

In general, you can specify a slice containing a specific subset of columns as a list. For example, let Z be:

>>> Z = np.array([[1, 2, 3, 4], [11, 12, 13, 14], [21, 22, 23, 24]])

Evaluating the expresssion Z[:, [0,2,3]] will yield a new 2D array with columns 0, 2, and 3 from the array Z:

>>> Z[:, [0, 2, 3]]
array([[ 1,  3,  4],
       [11, 13, 14],
       [21, 23, 24]])

or more generally, we can specify an expression that yields a list as in the slice.

>>> l = [0, 2, 3]

>>> X[:, l]
array([[ 1,  3,  4],
       [11, 13, 14],
       [21, 23, 24]])

Data

In this assignment you will perform the tasks below on two datasets. The first is the city data from the example in the introduction:

City:Predicting crime rate from the number of complaint calls to 311.

We have included the output generated by our solution using this data on the city data page to help with testing.

For the second one, you have a choice between the following two datasets:

Stock:Predicting a stock index from a set of stocks.
Election:Predicting election results for counties based on population statistics.

More information about each dataset and which columns to use, can be found by clicking the links. Regardless of which dataset you work on, you will get two CSV (comma-separated values) files: training.csv and testing.csv. You will use the first to train models and the second to evaluate how well your models predict outcomes on unseen testing data.

We have provided a function, read_file in model.py that takes the name of a file and returns a pair containing the column labels as a list of strings and the table of data as a two-dimensional numpy array of floats. Here is a sample call:

col_names, data = read_file("data/city/training.csv")

To use the election data replace city with election. To use the stock data, replace city with stock.

Looking at the data sets you may be tempted to assume the data the first N-1 columns of each data set contain the independent variables and the last column contains the dependent variable. Do not make this assumption. Each of the output_X.py files included definitions for constants that specify the indices (column numbers) of independent variables and the index of the dependent variable for the relevant data set.

Also, do not read the data from any given file more than once.

Submission requirements

Each task will list submission requirements. Code to generate the required output text for the City dataset should be added to output_city.py. When we test your code, we will run the command:

$ python3 output_city.py

We will then manually inspect the output to the terminal. Submissions that do not include the necessary code in output_city.py will lose substantial credit.

For your second dataset, you will repeat this process, updating the filenames accordingly. If you choose the Stock data as your second data set, you should modify the output_stock.py file to generate the required output. If you choose the Election data as your second data set, you should modify the output_election.py file to generate the required output. Leave the output file for the dataset that you did not choose unaltered.

Here are some general rules for the required output:

  • All output must be labeled with the task number and should be output using print; and finally,
  • Column labels (not the column indices) should be output, when a task requires you to print out the predictors used in a model.

Getting started

We have seeded your repository with a directory for this assignment. To pick it up, change to your cs121-aut-16-username directory (where the string username should be replaced with your username) and then run the command: git pull upstream master. You should also run git pull to make sure your local copy of your repository is in sync with the server.

The pa5 directory contains the following files:

model.py:Python file that contains code for reading the data from a file, computing linear regressions, and applying them to data. You should also add general-purpose functions to this file. This file should in no way be shaped by your datasets choices.
output_city.py:Python file where you will put your code to compute the required output for the City dataset. This file should rely heavily on the general functions that you wrote in model.py.
output_stock.py/output_election.py:
 Python files for the second dataset. You only need to modify one of these two files.

The pa5 directory contains a data directory which, in turn, contains three directories: city, election, stock. These directories correspond to the three datasets we provide, and each of them contains the following files:

training.csv:Data that will be used to train your models. This data will be used in Tasks 1-4.
testing.csv:Data that will be used to evaluate your models. This data will be used only in Task 4. The exact locations of the CSV files is found under the links specific to your datasets.

As said, we will only run your two output files and expect the correct output. This means that you have free reins of how you structure your code in terms of functions, data structures, etc. Think carefully about these things, since they are important aspects that will largely determine the quality of your code. We suggest that you create functions that are as general as possible and break them up into several smaller function if they get too big. We ask you to do each task on two datasets, so that you will appreciate the benefits of general and re-usable functions. These functions should be placed in model.py and leveraged from your output files in such a way that as little code as possible is repeated between your output files.

Task 1a: Model evaluation using Coefficient of Determination, R2

Blindly regressing on all available predictors is bad practice and can lead to over-fit models. Over-fitting happens when your model picks up on random structure in the training data that does not represent the underlying model. This situation is made even worse if some of your predictors are highly correlated (see multicollinearity), since it can ruin the interpretation of your coefficients. As an example, consider a study that looks at the effects of smoking and drinking on developing lung cancer. Smoking and drinking are highly correlated, so the regression can arbitrarily pick up either as the explanation for high prevalence of lung cancer. As a result, drinking could be assigned a high coefficient while smoking a low one, leading to the questionable conclusion that drinking causes lung cancer while smoking does not.

This example illustrates the importance of variable selection. In order to compare two different models (subsets of predictors) against each other, we need a measurement of their goodness of fit, i.e. how well they explain the data. A popular choice is the coefficient of determination, R2, which first considers the variance of the dependent variable, \(\mathrm{Var}(y)\):

In this example we revisit the regression of the total number of crimes, y, on the number of calls about garbage to 311, x. To the right, a histogram over y is shown and its variance is calculated. The variance is a measure of how spread out the data is and is estimated from samples as

\[\mathrm{Var}(y) = \frac{1}{N}{\sum_{n=1}^N} (y_n - \bar y)^2,\]

where \(\bar y\) denotes the mean of all y‘s.

We now subtract the red line from each point. These new points are called the residuals and represent the deviations from the expected values under our model. If the variance of the residuals, \(\mathrm{Var}(y - \hat y),\) is zero, it means that all residuals had to be zero and thus all samples had to lie perfectly on the fitted line. On the other hand, if the variance of the residuals is the same as the variance of y, it means that the model did not explain anything away and we can consider the model useless. These two extremes represent R2 values of 1 and 0, respectively. All R2 values in between these two extremes will be cases where the variance was reduced some, but not entirely. Our example is one such case:

As we have hinted, the coefficient of determination, R2, measures how much the variance was reduced by subtracting the model. More specifically, it calculates this reduction as a percentage of the original variance:

\[R^2 = \frac{\mathrm{Var}(y) - \mathrm{Var}(y - \hat y)}{\mathrm{Var}(y)} = 1 - \frac{\mathrm{Var}(y - \hat y)}{\mathrm{Var}(y)},\]

which finally becomes

\[R^2 = 1 - \frac{\sum_{n=1}^N (y_n - \hat y_n)^2}{\sum_{n=1}^N (y_n - \bar y)^2}. \quad (1)\]

In these steps we omitted the normalization constants N since they cancel each other out. We also did not subtract the mean when calculating the variance of the residuals, since it can be shown mathematically that if the model has been fit by least squares, the sum of the residuals is always zero.

In the example above, this computation comes out to \(R^2 = 1 - \frac{0.73}{2.09} = 0.65.\)

There are two properties of R2 that are good to keep in mind when checking the correctness of your calculations:

  1. 0 ≤ R2 ≤ 1
  2. If model A contains a superset of predictors of model B, then model A cannot have a lower R2 value than B.

These properties are only true if R2 is applied to the same data that was used to train your model parameters. If you calculate R2 on a held-out testing set, the R2 value could decrease with more predictors (due to over-fitting) and R2 is no longer guaranteed to be greater than or equal to zero. The intuition behind R2 would also need to be reconsidered, but since the final formula in (1) remains the same we will omit the details here.

A good model should explain as much of the variance as possible, so we will be favoring higher R2 values. However, since using all predictors gives the highest R2, we must balance this goal with a desire to use few predictors. In general it is important to be cautious and not blindly interpret R2; take a look at these two generated examples:

To the left, we see that the fitted model is close to the one we used to generate the data. However, the variance of the error is still high, so the R2 is rather low. To the right is a model that clearly is not right for the data, but still manages to record a high R2 value. This example also shows that that linear regression is not always the right model choice. However, in many cases, a simple transformation of the predictors or the dependent variable can resolve this problem, making the linear regression paradigm still relevant.

Note

In our calculations we used biased estimates of the variance, which means we divided by N both times and thus canceling. In practice, it is more common to use unbiased estimates, which results in N - 1 in the numerator and N - P - 1 in the denominator, where P is the number of predictors. The result is called the Adjusted R-squared and is commonly reported by statistical software. This breaks the second property of R2 listed above, which means that the value can decrease when adding additional predictors.

Do not use the Numpy var method, because it computes the unbiased variance and will lead you to generate the wrong answers!

Note

In the text, we often refer to “the fitted model” or “the model”. Think about what information is part of “a model”. It may be helpful, specially as you move on to the other tasks, to encapsulate all the information relevant to a model into a data structure, so that any function that requires working with “the model” will only require a single parameter to receive all the information about “the model”. It may be helpful to also have a function (or method) that, given the data, the predictor variables, and the dependent variables, produces the data structure that encapsulates all the information about a model.

Submission requirement

  • Output a table of R2 values for how well each predictor variable separately predicts your dependent variable. Note that when we ask for the R2 value for a model, we mean the value obtained by computing R2 using \(\beta\) from the model on the data that was used to train the model.

Task 1b: Evaluating multiple-variable models using R2 values

From this table, you will see that none of the variables individually are particularly good predictors for the dependent variable. In subsequent tasks, we will construct better models using multiple variables. For example, we could predict crime using not only complaints about garbage, but graffiti as well. In this example, we want to find \(\beta = (\beta_0, \beta_\mathrm{garbage}, \beta_\mathrm{graffiti})\) in the equation

\[f(\mathbf{x}) = \beta_0 + \beta_\mathrm{garbage} x_\mathrm{garbage} + \beta_\mathrm{graffiti} x_\mathrm{graffiti},\]

where \(f(\mathbf{x})\) is a prediction of the number of crimes given the number of complaint calls about garbage \((x_\mathrm{garbage})\) and graffiti \((x_\mathrm{graffiti}).\)

In terms of the R2 value, this new equation only changes how each prediction \(\hat y_n\) is calculated. At this point, you should make sure that your code to calculate a model’s R2 value is general enough to handle multivariate models of any K.

Submission requirement

  • Output the R2 value for a model that uses all of the predictors for your dataset. According to the second property of R2, this value should be the largest R2 value you will ever see for this set of predictors and this data.

Task 2: Building and selecting bivariate models

In Task 1 we found that none of our single variable models were particularly accurate (they all had low R2 values compared to when using all predictors). In this and future tasks we’re going to build more complex models that use multiple variables as predictors, without using all of them and risking over-fitting.

For this task, you will test all possible bivariate models (K = 2) and determine the one with the highest R2 value. We suggest that you use two nested for-loops to iterate over all combinations of two predictors, calculate the R2 value for each combination and keep track of one with the highest R2 value.

How does your result compare to the single-variable models in Task 1? Which pair of variables perform best together? Does this result make sense given the results from Task 1?

Submission requirement

  • Output the names of the two predictors that yield the bivariate model with the highest R2 value, along with its R2 value.

Task 3a: Building models of arbitrary complexity

How do we determine how many and which variables will generate the best model? It turns out that this question is unsolved and is of interest to both computer scientists and statisticians. To find the best model of three variables we could repeat the process in Task 2 with three nested for-loops instead of two. For K variables we would have to use K nested for-loops. Nesting for-loops quickly becomes computationally expensive and infeasible for even the most powerful computers. Keep in mind that models in genetics research easily reach into the thousands of variables. How then do we find optimal models of several variables?

We’ll approach this problem by using heuristics, which will give us an approximate (as opposed to an exact) result. We are going to split this task into two pieces: (1) determine the best K variables to use for a fixed K and (2) determine the best value for K.

How do we determine which K variables will yield the best model for a fixed K? As noted, the naive approach, test all possible combinations as in Task 2, is intractable for large K. An alternative simpler approach is to choose the K variables with the K highest R2 values in the table from Task 1. This approach does not work well if your predictors are correlated. Instead, you will start with a set that contains the variable with the highest R2 value and then repeatedly add variables until the set contains K variables. At each step, we identify the unused variable that when added to the set of variables in the model, yields the model with the best R2 value and add it to the set. This algorithm is called Forward selection and is an example of a greedy algorithm.

Submission requirement

  • Let N be the number of predictor variables in your data set. Output a table that lists K, the predictor variables in the best K-variable model chosen by the forward selection algorithm, and R2 for this model for \(1 \leq K \leq N\).

Task 3b: Selecting the best K

The table from Task 3a will show that at some point increasing K yields diminishing returns. Since we want to avoid over-fitting, we will choose among the different models using a simple approach: stop increasing K, once the improvement in R2 between the model with K variables and the model with K-1 variables is strictly less than a specified threshold. Choose the model with K-1 variables.

We suggest creating a function that can be used to find the best model for any given value of this threshold.

Submission requirements

  • Run this function with a threshold of 0.1 and output the returned model’s R2 value and the names of the selected predictors.
  • Run this function with a threshold of 0.01 and output the returned model’s R2 value and the names of the selected predictors.

Task 4: Training vs. test data

Please read this section carefully

Until now, you have evaluated a model using the data that was used to train it. The resulting model may be quite good for that particular dataset, but it may not be particularly good at predicting novel data. This is the problem that we have been referring to as over-fitting. The file testing.csv contains a random selection of samples that were held aside and not included in the training set.

Up to this point, we have only computed R2 values for the training data that was used to construct the model. It is also valid to compute an R2 value for a model when applied to other data. For this task, you will construct models using the training data and then evaluate each model on the testing data. You will receive *zero credit* for this task if you train new models using the testing data. In particular, the models you will choose for evaluation on the held-out testing data in this task will be the models you found in Task 3a.

Submission requirements:

  • Let N be the number of predictor variables in your data set. Output a table that lists K, the predictor variables in the best K-variable model chosen by the forward selection algorithm using the training data, and R2 for this using the model on the testing data for \(1 \leq K \leq N\).

Submission

To submit your assignment, make sure that you have:

  • put your name at the top of your files,
  • registered for the assignment using chisubmit,
  • added, committed, and pushed your code to the git server, and
  • run the chisubmit submission command.
chisubmit student assignment register pa5

git add model.py
git add output_*.py

git commit -m "final version ready for submission"
git push

chisubmit student assignment submit pa5

Acknowledgments: This assignment has been worked on by many people over the years including Matthew Rocklin and Gustav Larsson.