Due: Tuesday, Oct 26, end of your lab session

Lab 5: You're Regressing

Today's exercise concerns linear regression. The problem is well-suited to being decomposed into a set of related functions, as you will discover yourself. The relevant concepts will be discussed briefly at the beginning of lab, as well as the use of Microsoft Excel to perform linear regression, which you can use to test your results.

Preliminaries

Modeling Datasets

You will need three data definitions in order to do this exercise: datum, dataset and lineq (for "linear equation"). A datum will consist simply of an x and a y. A dataset will consist of a nonempty list of datums. A lineq will consist of a slope and a y-intercept (m and b from the familiar linear equation form y = mx + b).

A linear regression analysis is applied to a set of data; its result is the line that mathematically best fits the data as it would appear on a two-dimensional plot. Operationally, with linear regression analysis, you put a dataset in and get a linear equation out. Your goal is to provide a function linreg whose contract is

;; linreg: dataset -> lineq

Linear Regression

In the following discussion, assume that n is the number of datums in the given dataset.

The best-fit slope for the linear model of a set of data is given by

The y-intercept for the linear model of a set of data is given by

The two formulas above can each be abstracted as functions, with the following contracts:

;; slope : dataset -> num
;; intercept : dataset -> num


Part 1

A.

Write data definitions for datum, dataset and lineq. Recall that:

B.

Review the mathematical formulas for the best fit slope and y-intercept. Decide what helper functions you would need to implement the final slope and intercept functions. The helper functions you will need to write in order to get linreg to work will mostly be structurally recursive functions that follow the How to Design Programs template closely. Be sure you surround all functions by contracts, purposes and tests.

[Hint: All your helper functions should take a dataset as input, not a list of x, y values. For example, you will need a function to sum all x values :
;; sumx : dataset -> num

[Note: Part 1 (data definitions and definitions for all your helper functions) should be finished by the end of your lab session. You should also get started on Part 2. Turn in all your work at the end of the lab session according to these submission instructions.]


Part 2

[Note: For full credit, you must start working on Part 2 of the lab. You are not required to finish by the end of the lab session, but should submit all your work. You will have a chance to finish Part 2 on this week's homework.]

Write functions slope, intercept and linreg.
;; slope : dataset -> num
;; intercept : dataset -> num
;; linreg: dataset -> lineq

For testing, we will use a small data set of 17 observations of boiling temperatures of water (measured in F degrees - y values) at different barometric pressures (measures in inches of mercury - x values).

[Source: Hand, DJ et al. "A Handbook of Small Data Sets", London Chapman and Hall, 1994]

(list
  (make-datum 20.79 194.5)
  (make-datum 20.79 194.3)
  (make-datum 22.4 197.9)
  (make-datum 22.67 198.4)
  (make-datum 23.15 199.4)
  (make-datum 23.35 199.9)
  (make-datum 23.89 200.9)
  (make-datum 23.99 201.1)
  (make-datum 24.02 201.4)
  (make-datum 24.01 201.3)
  (make-datum 25.14 203.6)
  (make-datum 26.57 204.6)
  (make-datum 28.49 209.5)
  (make-datum 27.76 208.6)
  (make-datum 29.04 210.7)
  (make-datum 29.88 211.9)
  (make-datum 30.06 212.2))

Test your functions on our sample data set and compare the results to results obtained using Excel. Download the dataset in Excel format from here. You will not need to submit the Excel file. Make sure your functions include a contract, a purpose statement and tests (using check-within).


Part 3

[Note: If you have gotten this far, great work! Part 3 will not count towards your grade for Lab 5, but will be part of this week's homework.]

The Linear Correlation Coefficient

A full linear regression analysis includes the computation of a linear correlation coefficient. This coefficient is usually referred to as r and is a measure, roughly speaking, of "how linear the data is." Typically a linear regression analysis provides the value r2, which inhabits the interval [0,1]. The closer r2 is to 1, the stronger the correlation between the data and the line modeling the data. An r2 close to 0 means that the analysis has calculated a linear equation that bears no meaningful relationship to the data at hand.

An analysis will be a struct consisting of two parts: a lineq and a num, the latter being the value of r2. Your goal is to write a full linear regression analysis with the contract

;; full-linreg: dataset -> analysis

The linear correlation coefficient r is given by the following:

An overbar indicates the arithmetic mean of the term underneath. A lowercase sigma (σ) indicates the standard deviation (as defined in last week's homework) of the subscript term.

Please feel free to include your code from last week to compute the standard deviation as part of the computation of r. Use Excel to verify your results for computing r2. As always, make sure you include contracts, purpose statements and tests for all the functions you write.

[Note: the formula above gives r; the full analysis returns r2 .]


Hand in your work

To receive full credit, you should complete Part 1 and have made a good start on Part 2. This week's homework will include Part 2 and Part 3 of this lab.
Save all your files and submit all your work (on all 3 parts of the lab) according to the submission instructions.


Material designed by faculty at the Dwight-Englewood School in Englewood, NJ. including Adam Shaw. Modified by Gabriela Turcu.