Linear Regression from first principles

Linear regression allows one to model a dependent variable with the best straight line fit to a set of predictor variables. In the simplest scenario we have a single predictor variable. This is called simple linear regression.

As an example let's take the trees dataset provided by the datasets package in R. The data consists of measurements of girth, height and volume of 31 felled black cherry trees. Suppose we wish to try and predict the volume of a black cherry tree from its girth. To begin with we can plot the two variables using the following command:

plot(trees$Girth, trees$Volume, xlab="Girth (inches)", ylab="Volume (cubic feet)")

The data certainly looks to have a linear relationship. A straight line equation in 2 dimensions can be defined by two variables namely the slope and intercept. Different choices for these parameters will produce different best fit lines as shown in the plot below:

The left hand figure above shows linear fits with varying slopes and a fixed intercept. The right hand figure shows linear fits with varying intercepts and constant slopes. We want to find the combination of parameters for which our line best fits the data.

We'll return back to the trees example but suppose that in general we observe \(n\) pairs of data points \((x_1, y_1), (x_2, y_2),\ldots, (x_n, y_n)\) for \(i = 1,2,\ldots n\). We can define a linear relationship by writing each observation pair as follows:

\(y_i = \beta_1 x_i + \beta_0 + \epsilon_i \qquad i = 1,2,\ldots n \tag{2}\label{2}\)

\(\beta_0\) and \(\beta_1\) represent the slope and intercept respectively of our straight line fit. The \(\epsilon_i\) are called residuals and represent the random variation of the \(y_i\) around the straight line fit. Thought of another way each \(\epsilon_i\) is the error between the true value of our response variable \(y_i\) and the prediction given by our linear regression.

We wish to find an intercept/slope \((\beta_0, \beta_1)\) combination that minimises the \(\epsilon_i\) for \(i = 1,2,\ldots n\) meaning that our prediction errors are small. The most popular method of achieving this is to compute the sum of the squares of the residuals for \(i = 1,2,\ldots n\) and choose values of \(\beta_0\text{ and }\beta_1\) that minimise this quanity. Mathematically we can write the residual sum of square (RSS) as follows:

\(RSS = \sum_{i=1}^n \epsilon_i^2 \tag{3}\label{3}\)

rearranging \(\eqref{2}\) to make \(\epsilon_i\) the subject:

\(\epsilon_i = y_i - \beta_1 x_i - \beta_0 \tag{4}\label{4}\)

substituting \(\eqref{4}\) into \(\eqref{3}\) we get:

\(RSS = \sum_{i=1}^n (y_i - \beta_1 x_i - \beta_0)^2 \tag{5}\label{5}\)

To minimise \(\eqref{5}\) we differentiate \(RSS\) with respect to both \(\beta_0\) and \(\beta_1\) set the resulting quantity equal to zero and solve the two equations.

We can make use of the chain rule which states that for a composite function of the form:

\begin{align*} & F(x) = f(g(x)) \\ \end{align*}

the deriviate is given by

\(F'(x) = f'(g(x)) g'(x) \tag{6}\label{6}\)

Applying \(\eqref{6}\) to \(\eqref{5}\) we get:

\(\frac{\partial RSS}{\partial \beta_0} = -2 \sum_{i=1}^n (y_i - \beta_1 x_i - \beta_0) \label{7}\tag{7}\)

\(\frac{\partial RSS}{\partial \beta_1} = -2 \sum_{i=1}^n x_i (y_i - \beta_1 x_i - \beta_0) \label{8}\tag{8}\)

We set the above equations equal to zero and solve for \(\beta_0\) and \(\beta_1\) to find the minima. Starting with \(\eqref{7}\):

\begin{align*} & 0 = -2 \sum_{i=1}^n (y_i - \beta_1 x_i - \beta_0) \\ & 0 = \sum_{i=1}^n (y_i - \beta_1 x_i - \beta_0) \\ & 0 = \sum_{i=1}^n y_i - \beta_1 \sum_{i=0}^n x_i -n \beta_0 \\ & n \beta_0 = \sum_{i=1}^n y_i - \beta_1 \sum_{i=1}^n x_i \end{align*}

making use of \(\bar{x} = \frac{\sum_{i=1}^n x_i}{n}\) and \(\bar{y} = \frac{\sum_{i=1}^n y_i}{n}\):

\begin{align*} & \beta_0 = \bar{y} - \beta_1 \bar{x} \label{9}\tag{9} \\ \end{align*}

Now we repeat the same process for \(\beta_{1}\):

\begin{align*} & 0 = -2 \sum_{i=1}^n x_i (y_i - \beta_1 x_i - \beta_0) \\ & 0 = \sum_{i=1}^n x_i (y_i - \beta_1 x_i - \beta_0) \end{align*}

substituting in the value for \(\beta_0\) obtained in \(\eqref{9}\):

\begin{align*} & 0 = -2 \sum_{i=1}^n x_i (y_i - \beta_1 x_i - \bar{y} + \beta_1 \bar{x}) \\ & 0 = \sum_{i=1}^n x_i (y_i - \bar{y}) + \beta_1 \sum_{i=1}^n x_i (\bar{x} - x_i) \\ \\ & \beta_1 = \frac{-\sum_{i=1}^n x_i(y_i - \bar{y})}{\sum_{i=1}^n x_i(\bar{x} - x_i)} \\ \\ & \beta_1 = \frac{-\sum_{i=1}^n (x_i y_i - x_i \bar{y})}{\sum_{i=1}^n (x_i \bar{x} - x_i^2)} \\ \\ & \beta_1 = \frac{\bar{y} \sum_{i=1}^n x_i - \sum_{i=1}^N x_i y_i}{\sum_{i=1}^n x_i - \sum_{i=1}^n x_i^2} \\ \\ & \beta_1 = \frac{n \bar{x} \bar{y} - \sum_{i=1}^n x_i y_i}{n \bar{x}^2 - \sum_{i=1}^n x_i^2} \end{align*}

We have now derived least squares estimates for \(\beta_0\) and \(\beta_1\) using simple linear regression. Armed with these we can generate parameter estimates for the cherry tree regression:

n <- nrow(trees)
x_bar <- mean(trees$Girth)
y_bar <- mean(trees$Volume)
xi_yi_sum <- sum(trees$Girth * trees$Volume)
xi_squared_sum <- sum(trees$Girth ^ 2)

beta_1 <- (n * x_bar * y_bar - xi_yi_sum)/(n * x_bar^2 - xi_squared_sum)
beta_0 <- y_bar - beta_1 * x_bar

[1] 5.065856
[1] -36.94346

Ofcourse R has a built in function (lm) to build a linear regression. We can use this function to compare the estimates we produced above:

> fit <- lm(trees$Volume~trees$Girth)
> fit$coefficients
(Intercept) trees$Girth
-36.943459    5.065856

which match our derived estimates.