Linear Regression¶
Due: Friday, Nov 11 at 4pm
In this assignment, you will fit linear regression models and implement simple variable selection algorithms. The assignment will give you experience with Numpy and more practice with using functions to support code reuse. You must work alone on this assignment.
At the heart of the assignment is a table of data, where each column is a variable and each row is a sample. As an example, in a health study, each sample might be a person, with variables like height, weight, sex, etc. In your analysis, you will build models that, with varying levels of accuracy, can predict the value of one of the variables as a function of the others.
Predictions are only possible if variables are related somehow. As an example, look at this plot of recorded crimes against logged complaint calls about garbage to 311.
Each point describes a sample, which in this example represents a geographical region of Chicago. Each region is associated with variables, such as the number of crimes or complaint calls during a fixed time frame. Given this plot, if you got the question of how many crimes you think were recorded for a region that had 150 complaint calls about garbage, you would follow the general trend and probably say something like 3000 recorded crimes. To formalize this prediction, we need a model for the data that relates a dependent variable (e.g. crimes) to a set of predictor variables (e.g. complaint calls). We will assume a linear dependence as follows
or using vector notation
We have introduced the following notation:
- \(n\)
- The sample that we are considering out of \(N\) total.
- \(y_n\)
- An observation of the dependent variable, e.g. the total number of crimes.
- \(x_{nk}\)
- An observation of a predictor variable \(k\) for sample \(n,\) e.g. the complaint calls about garbage. We assume a total of \(K\) different predictors.
- \(\beta = (\beta_0, \beta_1, \dots, \beta_K)^T\)
- A column vector of the regression coefficients, where \(\beta_0\) is the intercept and \(\beta_k\) is the coefficient associated with the \(k\mathrm{th}\) predictor. This vector describes the red line in the figure above. Note that a positive value of a coefficient suggests a positive correlation with the dependent variable. The same is true for a negative value and a negative correlation.
- \(\varepsilon_n\)
- Not all sample points lie on the red line, which means the equation \(y_n = \mathbf{x}_n^T \beta\) is rarely satisfied. In our model, this discrepancy is assumed to be caused by random error. We assume that the random error is zero-mean and drawn from a normal distribution of a fixed variance. We want as much as possible to be explained deterministically, which means we will favor values of \(\beta\) that result in small errors.
- \(\mathbf{x}_n = (1, x_{n1}, \dots, x_{nK})^T\)
- A column vector representation of all the predictors for a given sample. As a convenience, a 1 has been prepended to the vector so as to incorporate the intercept \(\beta_0\) into the vector notation.
When making predictions, what we are really doing is asking what is the weighted average value for a given set of predictors. Since we have assumed the errors have zero mean, to make a prediction, we just remove the error term as
This equation can be written for several samples at the same time as
where \(\mathbf{\hat y}\) is a column vector of predictions and \(\mathbf{X}\) is a \(N \times K\) matrix where each row is one sample.
Model fitting¶
There are many possible candidates for \(\beta,\) some that fit the data
better than others. Finding the best value of \(\beta\) is referred to as
fitting the model. To simplify your job, we have provided code for fitting a
linear model. The function linear_regression
finds a value \(\beta\)
that minimizes \(\mathbf{\hat y} - \mathbf{y}\) in a least squared sense.
That is, it finds values for \(\beta\) such that the predicted values
\(\mathbf{\hat y} = \mathbf{X} \beta\) are as close to the observed
variables as possible (in a statistically motivated way using maximum
likelihood). This function accepts a two-dimensional numpy array of floats
containing the predictor variables (X
) and a one-dimensional numpy array of
floats (y
) containing the dependent variable as input and returns
\(\beta.\) Note that X
does not correspond exactly with
\(\mathbf{X},\) since it should not have the column of ones (we will add
that for you).
We have also provided a function, apply_beta(beta, X)
for making
predictions using the model. It takes beta
, the array generated by
our linear regression function, and X
, a two-dimensional numpy array
of floats that contains values for the predictor variables, applies the
function described by beta
to each row in X
, and returns a one-dimensional numpy
array of the resulting values.
Here are sample calls to linear_regression
and apply_beta
:
>>> import numpy as np
>>> from model import *
>>> X = np.array([[5,2], [3,2], [6,2.1], [7, 3]]) # predictors
>>> y = np.array([5,2,6,6]) # dependent
>>> beta = linear_regression(X, y)
>>> beta
array([ 1.20104895, 1.41083916, -1.6958042 ]) # f(x) = 1.2 + 1.4*x1 - 1.7*x2
>>> apply_beta(beta, X)
array([ 4.86363636, 2.04195804, 6.1048951 , 5.98951049]) # predictions
The input to both of these functions require X
to be two-dimensional, even
in the case of K = 1. Numpy makes a distinction between a one-dimensional
and a two-dimensional array, even if they contain the same information. For
instance, if we want to extract the second column of X
as a one-dimensional
array, we do the following:
>>> X
array([[ 5. , 2. ],
[ 3. , 2. ],
[ 6. , 2.1],
[ 7. , 3. ]])
>>> X[:,1]
array([ 2. , 2. , 2.1, 3. ])
>>> X[:,1].shape
(4,)
However, this shape will not be accepted by linear_regression
and
apply_beta
. To retrieve a column subset of X
, you can use a
list of integers to index the columns instead. This mechanism also has
the benefit of keeping X
two-dimensional, even if the list of
indices has only one index:
>>> X[:,[1]]
array([[ 2. ],
[ 2. ],
[ 2.1],
[ 3. ]])
>>> X[:,[1]].shape
(4, 1)
In general, you can specify a slice containing a specific subset of columns as a list. For example, let Z be:
>>> Z = np.array([[1, 2, 3, 4], [11, 12, 13, 14], [21, 22, 23, 24]])
Evaluating the expresssion Z[:, [0,2,3]]
will yield a new 2D array
with columns 0, 2, and 3 from the array Z
:
>>> Z[:, [0, 2, 3]]
array([[ 1, 3, 4],
[11, 13, 14],
[21, 23, 24]])
or more generally, we can specify an expression that yields a list as in the slice.
>>> l = [0, 2, 3]
>>> X[:, l]
array([[ 1, 3, 4],
[11, 13, 14],
[21, 23, 24]])
Data¶
In this assignment you will perform the tasks below on two datasets. The first is the city data from the example in the introduction:
City: | Predicting crime rate from the number of complaint calls to 311. |
---|
We have included the output generated by our solution using this data on the city data page to help with testing.
For the second one, you have a choice between the following two datasets:
Stock: | Predicting a stock index from a set of stocks. |
---|---|
Election: | Predicting election results for counties based on population statistics. |
More information about each dataset and which columns to use, can be
found by clicking the links. Regardless of which
dataset you work on, you will get two CSV (comma-separated values)
files: training.csv
and testing.csv
. You will use the first
to train models and the second to evaluate how well your models
predict outcomes on unseen testing data.
We have provided a function, read_file
in model.py
that takes
the name of a file and returns a pair containing the column labels as
a list of strings and the table of data as a two-dimensional numpy
array of floats. Here is a sample call:
col_names, data = read_file("data/city/training.csv")
To use the election data replace city
with election
. To use
the stock data, replace city
with stock
.
Looking at the data sets you may be tempted to assume the data the
first N-1 columns of each data set contain the independent variables
and the last column contains the dependent variable. Do not make
this assumption. Each of the output_X.py
files included
definitions for constants that specify the indices (column numbers) of
independent variables and the index of the dependent variable for the
relevant data set.
Also, do not read the data from any given file more than once.
Submission requirements¶
Each task will list submission requirements. Code to generate the
required output text for the City dataset should be added to
output_city.py
. When we test your code, we will run the command:
$ python3 output_city.py
We will then manually inspect the output to the terminal. Submissions
that do not include the necessary code in output_city.py
will lose
substantial credit.
For your second dataset, you will repeat this process, updating the
filenames accordingly. If you choose the Stock data as your second
data set, you should modify the output_stock.py
file to generate
the required output. If you choose the Election data as your second
data set, you should modify the output_election.py
file to
generate the required output. Leave the output file for the dataset
that you did not choose unaltered.
Here are some general rules for the required output:
- All output must be labeled with the task number and should be output using print; and finally,
- Column labels (not the column indices) should be output, when a task requires you to print out the predictors used in a model.
Getting started¶
We have seeded your repository with a directory for this assignment.
To pick it up, change to your cs121-aut-16-username
directory
(where the string username
should be replaced with your username)
and then run the command: git pull upstream master
. You should
also run git pull
to make sure your local copy of your repository
is in sync with the server.
The pa5
directory contains the following files:
model.py : | Python file that contains code for reading the data from a file, computing linear regressions, and applying them to data. You should also add general-purpose functions to this file. This file should in no way be shaped by your datasets choices. |
---|---|
output_city.py : | Python file where you will put your code to compute the
required output for the City
dataset. This file should rely heavily on the general
functions that you wrote in model.py . |
output_stock.py /output_election.py : | |
Python files for the second dataset. You only need to modify one of these two files. |
The pa5
directory contains a data
directory which, in turn, contains three
directories: city
, election
, stock
. These directories correspond to the
three datasets we provide, and each of them contains the following files:
training.csv : | Data that will be used to train your models. This data will be used in Tasks 1-4. |
---|---|
testing.csv : | Data that will be used to evaluate your models. This data will be used only in Task 4. The exact locations of the CSV files is found under the links specific to your datasets. |
As said, we will only run your two output files and expect the correct
output. This means that you have free reins of how you structure your
code in terms of functions, data structures, etc. Think carefully
about these things, since they are important aspects that will largely
determine the quality of your code. We suggest that you create
functions that are as general as possible and break them up into
several smaller function if they get too big. We ask you to do each
task on two datasets, so that you will appreciate the benefits of
general and re-usable functions. These functions should be placed in
model.py
and leveraged from your output files in such a way that
as little code as possible is repeated between your output files.
Task 1a: Model evaluation using Coefficient of Determination, R2¶
Blindly regressing on all available predictors is bad practice and can lead to over-fit models. Over-fitting happens when your model picks up on random structure in the training data that does not represent the underlying model. This situation is made even worse if some of your predictors are highly correlated (see multicollinearity), since it can ruin the interpretation of your coefficients. As an example, consider a study that looks at the effects of smoking and drinking on developing lung cancer. Smoking and drinking are highly correlated, so the regression can arbitrarily pick up either as the explanation for high prevalence of lung cancer. As a result, drinking could be assigned a high coefficient while smoking a low one, leading to the questionable conclusion that drinking causes lung cancer while smoking does not.
This example illustrates the importance of variable selection. In order to compare two different models (subsets of predictors) against each other, we need a measurement of their goodness of fit, i.e. how well they explain the data. A popular choice is the coefficient of determination, R2, which first considers the variance of the dependent variable, \(\mathrm{Var}(y)\):
In this example we revisit the regression of the total number of crimes, y, on the number of calls about garbage to 311, x. To the right, a histogram over y is shown and its variance is calculated. The variance is a measure of how spread out the data is and is estimated from samples as
where \(\bar y\) denotes the mean of all y‘s.
We now subtract the red line from each point. These new points are called the residuals and represent the deviations from the expected values under our model. If the variance of the residuals, \(\mathrm{Var}(y - \hat y),\) is zero, it means that all residuals had to be zero and thus all samples had to lie perfectly on the fitted line. On the other hand, if the variance of the residuals is the same as the variance of y, it means that the model did not explain anything away and we can consider the model useless. These two extremes represent R2 values of 1 and 0, respectively. All R2 values in between these two extremes will be cases where the variance was reduced some, but not entirely. Our example is one such case:
As we have hinted, the coefficient of determination, R2, measures how much the variance was reduced by subtracting the model. More specifically, it calculates this reduction as a percentage of the original variance:
which finally becomes
In these steps we omitted the normalization constants N since they cancel each other out. We also did not subtract the mean when calculating the variance of the residuals, since it can be shown mathematically that if the model has been fit by least squares, the sum of the residuals is always zero.
In the example above, this computation comes out to \(R^2 = 1 - \frac{0.73}{2.09} = 0.65.\)
There are two properties of R2 that are good to keep in mind when checking the correctness of your calculations:
- 0 ≤ R2 ≤ 1
- If model A contains a superset of predictors of model B, then model A cannot have a lower R2 value than B.
These properties are only true if R2 is applied to the same data that was used to train your model parameters. If you calculate R2 on a held-out testing set, the R2 value could decrease with more predictors (due to over-fitting) and R2 is no longer guaranteed to be greater than or equal to zero. The intuition behind R2 would also need to be reconsidered, but since the final formula in (1) remains the same we will omit the details here.
A good model should explain as much of the variance as possible, so we will be favoring higher R2 values. However, since using all predictors gives the highest R2, we must balance this goal with a desire to use few predictors. In general it is important to be cautious and not blindly interpret R2; take a look at these two generated examples:
To the left, we see that the fitted model is close to the one we used to generate the data. However, the variance of the error is still high, so the R2 is rather low. To the right is a model that clearly is not right for the data, but still manages to record a high R2 value. This example also shows that that linear regression is not always the right model choice. However, in many cases, a simple transformation of the predictors or the dependent variable can resolve this problem, making the linear regression paradigm still relevant.
Note
In our calculations we used biased estimates of the variance, which means we divided by N both times and thus canceling. In practice, it is more common to use unbiased estimates, which results in N - 1 in the numerator and N - P - 1 in the denominator, where P is the number of predictors. The result is called the Adjusted R-squared and is commonly reported by statistical software. This breaks the second property of R2 listed above, which means that the value can decrease when adding additional predictors.
Do not use the Numpy var
method, because it computes the
unbiased variance and will lead you to generate the wrong answers!
Note
In the text, we often refer to “the fitted model” or “the model”. Think about what information is part of “a model”. It may be helpful, specially as you move on to the other tasks, to encapsulate all the information relevant to a model into a data structure, so that any function that requires working with “the model” will only require a single parameter to receive all the information about “the model”. It may be helpful to also have a function (or method) that, given the data, the predictor variables, and the dependent variables, produces the data structure that encapsulates all the information about a model.
Submission requirement
- Output a table of R2 values for how well each predictor variable separately predicts your dependent variable. Note that when we ask for the R2 value for a model, we mean the value obtained by computing R2 using \(\beta\) from the model on the data that was used to train the model.
Task 1b: Evaluating multiple-variable models using R2 values¶
From this table, you will see that none of the variables individually are particularly good predictors for the dependent variable. In subsequent tasks, we will construct better models using multiple variables. For example, we could predict crime using not only complaints about garbage, but graffiti as well. In this example, we want to find \(\beta = (\beta_0, \beta_\mathrm{garbage}, \beta_\mathrm{graffiti})\) in the equation
where \(f(\mathbf{x})\) is a prediction of the number of crimes given the number of complaint calls about garbage \((x_\mathrm{garbage})\) and graffiti \((x_\mathrm{graffiti}).\)
In terms of the R2 value, this new equation only changes how each prediction \(\hat y_n\) is calculated. At this point, you should make sure that your code to calculate a model’s R2 value is general enough to handle multivariate models of any K.
Submission requirement
- Output the R2 value for a model that uses all of the predictors for your dataset. According to the second property of R2, this value should be the largest R2 value you will ever see for this set of predictors and this data.
Task 2: Building and selecting bivariate models¶
In Task 1 we found that none of our single variable models were particularly accurate (they all had low R2 values compared to when using all predictors). In this and future tasks we’re going to build more complex models that use multiple variables as predictors, without using all of them and risking over-fitting.
For this task, you will test all possible bivariate models (K = 2) and determine the one with the highest R2 value. We suggest that you use two nested for-loops to iterate over all combinations of two predictors, calculate the R2 value for each combination and keep track of one with the highest R2 value.
How does your result compare to the single-variable models in Task 1? Which pair of variables perform best together? Does this result make sense given the results from Task 1?
Submission requirement
- Output the names of the two predictors that yield the bivariate model with the highest R2 value, along with its R2 value.
Task 3a: Building models of arbitrary complexity¶
How do we determine how many and which variables will generate the best model? It turns out that this question is unsolved and is of interest to both computer scientists and statisticians. To find the best model of three variables we could repeat the process in Task 2 with three nested for-loops instead of two. For K variables we would have to use K nested for-loops. Nesting for-loops quickly becomes computationally expensive and infeasible for even the most powerful computers. Keep in mind that models in genetics research easily reach into the thousands of variables. How then do we find optimal models of several variables?
We’ll approach this problem by using heuristics, which will give us an approximate (as opposed to an exact) result. We are going to split this task into two pieces: (1) determine the best K variables to use for a fixed K and (2) determine the best value for K.
How do we determine which K variables will yield the best model for a fixed K? As noted, the naive approach, test all possible combinations as in Task 2, is intractable for large K. An alternative simpler approach is to choose the K variables with the K highest R2 values in the table from Task 1. This approach does not work well if your predictors are correlated. Instead, you will start with a set that contains the variable with the highest R2 value and then repeatedly add variables until the set contains K variables. At each step, we identify the unused variable that when added to the set of variables in the model, yields the model with the best R2 value and add it to the set. This algorithm is called Forward selection and is an example of a greedy algorithm.
Submission requirement
- Let N be the number of predictor variables in your data set. Output a table that lists K, the predictor variables in the best K-variable model chosen by the forward selection algorithm, and R2 for this model for \(1 \leq K \leq N\).
Task 3b: Selecting the best K¶
The table from Task 3a will show that at some point increasing K yields diminishing returns. Since we want to avoid over-fitting, we will choose among the different models using a simple approach: stop increasing K, once the improvement in R2 between the model with K variables and the model with K-1 variables is strictly less than a specified threshold. Choose the model with K-1 variables.
We suggest creating a function that can be used to find the best model for any given value of this threshold.
Submission requirements
- Run this function with a threshold of 0.1 and output the returned model’s R2 value and the names of the selected predictors.
- Run this function with a threshold of 0.01 and output the returned model’s R2 value and the names of the selected predictors.
Task 4: Training vs. test data¶
Please read this section carefully
Until now, you have evaluated a model using the data that was used to train it.
The resulting model may be quite good for that particular dataset, but it may
not be particularly good at predicting novel data. This is the problem that we
have been referring to as over-fitting. The file testing.csv
contains a
random selection of samples that were held aside and not included in the
training set.
Up to this point, we have only computed R2 values for the training data that was used to construct the model. It is also valid to compute an R2 value for a model when applied to other data. For this task, you will construct models using the training data and then evaluate each model on the testing data. You will receive *zero credit* for this task if you train new models using the testing data. In particular, the models you will choose for evaluation on the held-out testing data in this task will be the models you found in Task 3a.
Submission requirements:
- Let N be the number of predictor variables in your data set. Output a table that lists K, the predictor variables in the best K-variable model chosen by the forward selection algorithm using the training data, and R2 for this using the model on the testing data for \(1 \leq K \leq N\).
Submission¶
To submit your assignment, make sure that you have:
- put your name at the top of your files,
- registered for the assignment using chisubmit,
- added, committed, and pushed your code to the git server, and
- run the chisubmit submission command.
chisubmit student assignment register pa5
git add model.py
git add output_*.py
git commit -m "final version ready for submission"
git push
chisubmit student assignment submit pa5
Acknowledgments: This assignment has been worked on by many people over the years including Matthew Rocklin and Gustav Larsson.