Regression analysis is one of the **most frequently used** statistical techniques. It aims to build up an explicit relationship between one *response variable*, often denoted as \(y\), and one or several *explanatory variables*, often denoted as \(x_1,\dots,x_p\).

Alternative nomenclature for the variables:

\(y\) variable | \(x\) variable |
---|---|

Response variable | Explanatory variables |

Dependent variable | Independent variable |

Output variable | Input variable |

Covariates | |

Regressors |

**GOAL**:

- To understand how \(y\) depends on \(x_1,\dots,x_p\) (inference)
- To predict unobserved \(y\) value based on observed \(x_1,\dots,x_p\)

Look at this data set. It contains information about various cars and their related variables (a data frame with 32 observations on 11 (numeric) variables).

- [, 1]
`mpg`

Miles/(US) gallon - [, 2]
`cyl`

Number of cylinders - [, 3]
`disp`

Displacement (cu.in.) - [, 4]
`hp`

Gross horsepower - [, 5]
`drat`

Rear axle ratio - [, 6]
`wt`

Weight (1000 lbs) - [, 7]
`qsec`

1/4 mile time - [, 8]
`vs`

Engine (0 = V-shaped, 1 = straight) - [, 9]
`am`

Transmission (0 = automatic, 1 = manual) - [,10]
`gear`

Number of forward gears - [,11]
`carb`

Number of carburetors

`mtcars`

```
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
```

`summary(mtcars)`

```
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
```

Let’s concentrate on two variables: `mpg`

and `wt`

. Plot them:

```
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
# geom_abline(slope = -6, intercept = 42, linetype = "dashed", col = "red",
# alpha = 0.8) +
# geom_abline(slope = -7, intercept = 39, linetype = "dashed", col = "blue",
# alpha = 0.8) +
theme_bw()
```

Questions of interest:

- What is the general trend of
`mpg`

against`wt`

? - How can we draw the line of best fit through the data points?
- How do we explain the points that do not lie exactly on the line?

For a set of response variables \(\mathbf y = \{y_1,\dots,y_n\}\) and corresponding explanatory variables \(\mathbf x_{k} = \{x_{1k},\dots, x_{nk} \}\) for \(k=1,\dots,p\), the linear regression model is given by

\[\begin{gathered} y_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip} + \epsilon_i \\ \epsilon_i \sim N(0,\sigma^2) \text{ (iid)} \\ \text{for } i=1,\dots, n \end{gathered}\]

Using matrix notation, we can write this as

\[\begin{gathered} \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \\ \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{pmatrix} \end{gathered}\]

or simply

\[\mathbf y = \mathbf X\boldsymbol\beta + \boldsymbol\epsilon\]

where \(\mathbf y\) is an \(n \times 1\) vector of responses, \(\mathbf X\) is an \(n \times (p+1)\) matrix of observations (sometimes called the *design matrix*), \(\boldsymbol\beta\) is a \(p \times 1\) vector of coefficients, and \(\boldsymbol\epsilon\) is an \(n \times 1\) vector of errors.

The errors essentially measure the random deviation or random noise that the observation makes from the ‘true’ model.

- \(\text{E}(\epsilon_i) = 0, \forall i\).
- \(\text{Var}(\epsilon_i) = \sigma^2,\forall i\).
- \(\text{Cov}(\epsilon_i,\epsilon_j) = 0, \forall i\neq j\).
- We assume that the errors are normally distributed.
- We assume that the explanatory variables are fixed (non-random).

Note that The model is **linear** in the parameters \(\beta_0,\dots,\beta_p\). This means we cannot have a model with \(\beta_1^2\), or \(\beta_k^q\) or the like. However, it is fine to have the covariates \(x_k^2\), \(x_k^3\) or any transformation of them.

In order to proceed with either inference or prediction, we need to *estimate the model*. This means estimating the unknown values of the parameters (collectively called \(\theta\)) of the model, which are

\[\theta = \{\beta_0,\beta_1,\dots,\beta_p,\sigma^2 \}.\]

There are several methods available for estimating the linear regression model. Among them is the *least squares method*. Essentially, the line that fits the best through all the data points should have the smallest total error.

Let \(\hat{\boldsymbol\beta}\) be an estimate of \(\boldsymbol\beta\). Note that we use hats to represent estimates of coefficients/parameters. Consider the sum of the squared errors

\[ \begin{align} \sum_{i=1}^n \epsilon_i^2 &= \sum_{i=1}^n (y_i - \beta_0 + \beta_1 x_{i1} - \cdots - \beta_p x_{ip})^2 \\ &= \Vert \mathbf y - \mathbf X \boldsymbol\beta \Vert^2 \end{align} \]

By definition, the *least squares estimator (LSE)* for \(\boldsymbol\beta\) minimises the sum of squared errors, i.e.

\[ \hat{\boldsymbol\beta} = \arg\min \Vert \mathbf y - \mathbf X \boldsymbol\beta \Vert^2 \]

To solve for \(\boldsymbol\beta\), consider

\[\begin{align} \Vert \mathbf y - \mathbf X \boldsymbol\beta \Vert^2 &= \Vert \mathbf y - \mathbf X \hat{\boldsymbol\beta} + \mathbf X \hat{\boldsymbol\beta} - \mathbf X \boldsymbol\beta \Vert^2 \\ &= \Vert \mathbf y - \mathbf X \hat{\boldsymbol\beta} \Vert^2 + \Vert \mathbf X (\hat{\boldsymbol\beta} - \boldsymbol\beta) \Vert^2 + 2(\hat{\boldsymbol\beta} - \boldsymbol\beta)^\top\mathbf X^\top (\mathbf y - \mathbf X \hat{\boldsymbol\beta} ) \\ \end{align}\]

Now if we choose \(\boldsymbol\beta\) such that \[\mathbf X^\top (\mathbf y - \mathbf X \hat{\boldsymbol\beta} )\] then we have

\[\begin{align} \Vert \mathbf y - \mathbf X \boldsymbol\beta \Vert^2 &= \Vert \mathbf y - \mathbf X \hat{\boldsymbol\beta} \Vert^2 + \Vert \mathbf X (\hat{\boldsymbol\beta} - \boldsymbol\beta) \Vert^2 \\ &\geq \Vert \mathbf y - \mathbf X \hat{\boldsymbol\beta} \Vert^2 \end{align}\]

for any \(\boldsymbol\beta\), which satisfies the least squares condition. Therefore, the LSE for \(\boldsymbol\beta\) must satisfy

\[\begin{align} \mathbf X^\top (\mathbf y - \mathbf X \hat{\boldsymbol\beta} ) &= 0 \\ \mathbf X^\top\mathbf y - \mathbf X^\top\mathbf X \hat{\boldsymbol\beta} &= 0 \\ \mathbf X^\top\mathbf X \hat{\boldsymbol\beta} &= \mathbf X^\top\mathbf y\\ \hat{\boldsymbol\beta} &= (\mathbf X^\top\mathbf X )^{-1} \mathbf X^\top\mathbf y \end{align}\]

provided \(\mathbf X^\top\mathbf X\) is not singular, i.e. it is invertible.

With the knowledge of the LSE for the coefficients, and assuming that the errors are zero-meaned variables, an estimate of the variance of the errors is given by \[ \hat\sigma^2 = \frac{1}{n} \sum_{i=1}^n (y_i-\hat\beta_0-\hat\beta_1x_{i1} - \cdots -\hat\beta_px_{ip})^2 \] but usually we will use another estimator for \(\sigma^2\), as we will see later.

Let \(\hat y_i = \hat\beta_0+\hat\beta_1x_{i1} + \cdots +\hat\beta_px_{ip}\), the **predicted value** for the \(i\)’th observation using estimates \(\hat{\boldsymbol\beta}\). Residuals are defined to be the difference between the observed value and the predicted value,

\[\hat\epsilon_i = y_i - \hat y_i\]

If the model is correct, then the residuals should behave like random noise!

The LSE for \(\boldsymbol\beta\) turns out to be the

**maximum likelihood**estimator for \(\boldsymbol\beta\) too. This is easy to see:The likelihood under the normal linear model is

\[L(\boldsymbol\beta,\sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} \exp \left\{ -\frac{1}{2\sigma^2} \Vert \mathbf y - \mathbf X \boldsymbol\beta \Vert^2 \right\}\]

and since under the LSE, \(\Vert \mathbf y - \mathbf X {\boldsymbol\beta} \Vert^2 \geq \Vert \mathbf y - \mathbf X \hat{\boldsymbol\beta} \Vert^2\), which means that the likelihood is maximised at the LSE as well.

The estimator for \(\hat{\boldsymbol\beta}\) is normally distributed with mean and variance given by \[ \hat{\boldsymbol\beta} \sim \text{N}_p(\boldsymbol\beta, \sigma^2 (\mathbf X^\top \mathbf X)^{-1}) \] which implies that the estimators for the coefficients are

**unbiased**.The unbiased estimator for \(\sigma^2\) is given by the formula \[\hat\sigma^2 = \frac{ \Vert \mathbf y - \mathbf X \hat{ \boldsymbol\beta} \Vert^2}{n-p-1}\] which is distributed according to a \(\chi^2\) distribution with \(n-p-1\) degrees of freedom.

\(\hat{\boldsymbol\beta}\) and \(\hat\sigma^2\) are independent of each other.

The residuals have mean zero \[ \begin{align} \text{E}(\hat{\boldsymbol\epsilon}) &= \text{E}(\mathbf y - \mathbf X \hat{\boldsymbol\beta}) \\ &= \text{E}(\mathbf X \boldsymbol\beta - \mathbf X \hat{\boldsymbol\beta}) \\ &= \mathbf X \boldsymbol\beta -\mathbf X \boldsymbol\beta \\ &= \mathbf 0 \end{align} \]

The coefficients \(\beta_j\) represents the ‘strength or influence’ of the variable \(x_j\) on \(y\). It is the effect on \(y\) of changing \(x_j\) by a single unit, holding the other covariates fixed.

Consider the effect of the coefficient \(\beta_1\). Let \(y^{(0)} =\beta_0+\beta_1x_{1} + \cdots +\beta_px_{p}\), and also let \(y^{(1)} =\beta_0+\beta_1(x_{1}+1) + \cdots +\beta_px_{p}\). Since everything else cancels out, the difference between \(y^{(1)}\) and \(y^{(0)}\) is simply \[ y^{(1)} - y^{(0)} = \beta_1(x_{1}+1) - \beta_1x_{1} = \beta_1 \]

As we can see, \(\beta_1\) represents the change in the response variable when the variable \(x_1\) is increased by **one unit**, and **keeping all other variables fixed**. A positive value for the coefficient imparts a change in the positive direction in the response, and vice versa. Of course, the logic is the same for all of the other variables \(j=2,3,\dots,p\).

Recall that the **standard deviation** for each \(\hat\beta_j\) (from the normal distribution in property 2) is given by \[
\text{SD}(\hat\beta_j) = \sqrt{\sigma^2 v_{jj}}
\] where \(v_jj\) is the \((j+1,j+1)\)th element of the matrix \((\mathbf X^\top\mathbf X)^{-1}\). Notice that we do not know the true value of \(\sigma\) and therefore, the standard deviation of the coefficients as well.

The __ standard error__ for each \(\hat\beta_j\) is given by \[
\text{SE}(\hat\beta_j) = \sqrt{\hat\sigma^2 v_{jj}}
\] where \(\hat\sigma^2\) is the unbiased estimator for \(\sigma^2\) given in property 3 above. Essentially, by replacing the unknown value \(\sigma^2\) by its estimator, we get a “estimate” of the SD which we call the standard error.

Introduce the decomposition \[ \overbrace{\sum_{i=1}^n (y_i - \bar y)^2}^{\text{Total SS}} = \overbrace{\sum_{i=1}^n (\hat y_i - \bar y)^2}^{\text{Regression SS}} + \overbrace{\sum_{i=1}^n (y_i - \hat y_i)^2}^{\text{Residual SS}} \]

The term on the LHS is the *Total Sum of Squares*, which represents the total variation in the data (responses). The first term on the RHS is called the *Regression Sum of Squares*, which represents the variation in the regression model. The second term on the RHS we have seen before, called the *Residual Sum of Squares*, which measures variability between predicted and observed values.

Define the **coefficient of determination**, as \[
R^2 = \frac{\text{Regression SS}}{\text{Total SS}} = 1 - \frac{\text{Residual SS}}{\text{Total SS}}
\] \(R^2\) takes values between 0 and 1. \(100R^2\) is the percentage of the total variation in \(\{y_i \}\) explained by all the regressors. Therefore, the closer \(R^2\) is to 1, the better the model agreement is.

Sometimes, the **adjusted** \(R^2\) value is used instead, because it has nice distributional properties (for hypothesis testing) \[
R^2_{\text{adj}} = 1 - \frac{\text{Residual SS} / (n-p-1)}{\text{Total SS} / (n-1)}
\]

Roughly speaking, both should give about similar values.

For each coefficient \(\beta_j\), where \(j=0,1,\dots,p\), \[ \frac{\hat\beta_j - \beta_j}{\text{SE}(\hat{\beta_j})} \sim t_{n-p-1} \]

We can then use this fact to test the hypothesis \[ \begin{align} \text{H}_0&: \beta_j = b_j \\ \text{H}_1&: \beta_j \neq b_j \end{align} \]

Let \[ T = \frac{\hat\beta_j - b_j}{\text{SE}(\hat{\beta_j})} \] We reject the null hypothesis at the level \(\alpha\) against the alternative hypothesis if \(\vert T \vert > t_{n-p-1}(\alpha/2)\), where \(t_k(\alpha)\) is the top \(\alpha\)-point of the \(t_k\) distribution.

Note we can also test \(\text{H}_0\) against the alternatives

- \(\text{H}_1: \beta_j > b_j\), and we reject the null hypothesis if \(T > t_{n-p-1}(\alpha)\); and
- \(\text{H}_1: \beta_j < b_j\), and we reject the null hypothesis if \(T < -t_{n-p-1}(\alpha)\).

The \((1-\alpha)\) confidence interval for \(\beta_j\) is \[ \hat\beta_j \pm t_{n-p-1}(\alpha/2)\cdot \text{SE}(\hat\beta_j) \]

Remarks:

- When \(n\) is large, then the \(t\)-test becomes approximately equivalent to the \(Z\)-test / Wald test (using the normal distribution).
- We are not usually interested in testing the intercept. In most practical data applications, the intercept is not likely to be zero as it represents the ‘average value’ of the response when all covariates are zero.

Consider the hypothesis \[ \begin{align} \text{H}_0&: \beta_1 = \beta_2 = \cdots = \beta_p = 0 \\ \text{H}_1&: \text{Not all } \beta_1,\dots,\beta_p \text{ are } 0 \end{align} \]

Let \[ \begin{align} T = \frac{\sum_{i=1}^n (\hat y_i - \bar y)^2 / p}{\sum_{i=1}^n (y_i - \hat y_i)^2 / (n-p-1)} &= \frac{n-p-1}{p}\cdot \frac{\text{Regression SS}}{\text{Residual SS}} \\ &= \frac{n-p-1}{p}\cdot\frac{R^2}{1-R^2} \end{align} \] We reject the null hypothesis at the \(\alpha\) significance level if \(T > F_{p,n-p-1}(\alpha)\).

This has links to the ANOVA table which you may be familiar with:

Source | d.f. | Sum of squares | Mean SS | F-statistic |
---|---|---|---|---|

Regressors | \(p\) | \(\sum_{i=1}^n (\hat y_i - \bar y)^2\) | \(\frac{\sum_{i=1}^n (\hat y_i - \bar y)^2}{p}\) | \(\frac{p \sum_{i=1}^n (\hat y_i - \bar y)^2 }{(n-p-1)\sum_{i=1}^n (y_i - \hat y_i)^2}\) |

Residual | \(n-p-1\) | \(\sum_{i=1}^n (y_i - \hat y_i)^2\) | \(\frac{\sum_{i=1}^n (y_i - \hat y_i)^2}{n-p-1}\) | |

Total | \(n-1\) | \(\sum_{i=1}^n (y_i - \bar y)^2\) |

Let \(\mathbf H = \mathbf X(\mathbf X^\top\mathbf X)^{-1}\mathbf X^\top\), an \(n\times n\) matrix. This matrix is called the **projection matrix** or the **hat matrix** for linear regression. It has several interesting and useful properties, which we will not go into detail now.

For \(i=1,\dots,n\), define the **standardized residuals** as \[
\hat e_i = \frac{y_i - \hat y_i}{\sqrt{\hat\sigma^2(1-h_{ii})}}
\] where \(h_{ii}\) is the \((i,i)\)th entry of the matrix \(\mathbf H\). The standardized residuals then have mean zero and variance one.

We use the standardized residuals as a means of diagnosing the fit of the linear model, and to test assumptions of our linear model. For instance, we can plot the standardized residuals - in a qq-plot (to test for normality) - against covariates (to test for homo/heteroscedasticity) - against predicted values (to test for homo/heteroscedasticity)

Let’s fit a simple linear regression model to the `mtcars`

data set. In `R`

, the command to fit linear regression model is `lm()`

.

```
mod <- lm(formula = mpg ~ wt, data = mtcars)
summary(mod)
```

```
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
```

Let’s go through the results one by one.

```
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
```

This tells us that the following regression had been fitted: \[ \texttt{mpg} = \beta_0 + \beta_1 \cdot \texttt{wt} + \epsilon \]

```
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
```

This gives us a 5-point summary of the distribution of the residuals \(\hat\epsilon_i\). The intention is to tell us how well the model fits the data, but it will also be easier to diagnose model fit using residual plots.

```
diag.plots <- lindia::gg_diagnose(mod, plot.all = FALSE)
diag.plots <- lapply(diag.plots, function(x) x + theme_bw())
diag.plots[[2]] <- diag.plots[[2]] + geom_smooth(se = FALSE)
diag.plots[[3]] <- diag.plots[[3]] + geom_smooth(se = FALSE)
lindia::plot_all(diag.plots[1:4])
```