CS151 Labs

Lab 4 will be collected from your subversion repository on Monday, October 24, at 5pm.

Work in the file lab4/lab4.rkt. It is important that this file has exactly this name and location; we will be looking for your work under this name at collection time.

Today's exercise concerns linear regression. The problem is well-suited to being decomposed into a set of related functions. The lab exercise description does not specify too closely which individual functions you should write; this is to give you practice with program design.

At the beginning of lab your TAs will explain the basic concepts of linear regression, the notation used in this write-up, and how to use a spreadsheet application to generate results against which you can test.


Modeling Datasets

You will need three data definitions in order to do this exercise: datum, dataset and lineq (for "linear equation"). A datum will consist simply of an x and a y. (Why not just use posn? Semantics!) A dataset will consist of a nonempty list of datums. A lineq will consist of a slope and a y-intercept (m and b from the familiar linear equation form y = mx + b).

A linear regression analysis is applied to a set of data; its result is the line that mathematically best fits the data as it would appear on a two-dimensional plot. Operationally, with linear regression analysis, you put a dataset in and get a linear equation out. Your goal is to write a function linreg whose contract is

;; linreg: dataset -> lineq

Linear Regression

In the following discussion, assume that n is the number of datums in the given dataset.

The best-fit slope for the linear model of a dataset is given by

The y-intercept for the linear model of a dataset is given by

The two formulas above can be expressed as functions, with the following contracts:

;; slope : dataset -> num

;; intercept : dataset -> num

Your design will inevitably involve helper functions. The helper functions you will need to write in order to get linreg to work will mostly be structurally recursive functions that follow the How to Design Programs template closely.

When you have finished linreg, you have finished the lab. Commit as usual.

Hints


Extra Credit: The Linear Correlation Coefficient

A full linear regression analysis includes the computation of a linear correlation coefficient. This coefficient is usually referred to as r and is a measure, roughly speaking, of "how linear the data is." Typically a linear regression analysis provides the value r2, which lies in the interval [0,1]. The closer r2 is to 1, the stronger the correlation between the data and the line modeling the data. An r2 close to 0 means that the analysis has calculated a linear equation that bears no meaningful relationship to the data at hand.

An analysis will be a struct consisting of two parts: a lineq and a num, the latter being the value of r2. The optional challenge is to write a full linear regression analysis with the contract

;; full-linreg: dataset -> analysis

The linear correlation coefficient r is given by the following:

An overbar indicates the arithmetic mean of the term underneath. A lowercase sigma (σ) indicates the standard deviation (as defined below) of the subscript term.

Note: the formula above gives r; the full analysis returns r2 .

Do not skip ahead to extra credit. We will award extra credit when we see a working version of full-linreg and, furthermore, you have earned full or very nearly full credit on the main part of the exercise (linreg).


Acknowledgements

This programming exercise grew out of similar exercises developed in collaboration with faculty (of whom I was one) at the Dwight-Englewood School in Englewood, NJ.