Introduction

Regression analysis is one of the most frequently used statistical techniques. It aims to build up an explicit relationship between one response variable, often denoted as \(y\), and one or several explanatory variables, often denoted as \(x_1,\dots,x_p\).

Alternative nomenclature for the variables:

\(y\) variable\(x\) variable
Response variableExplanatory variables
Dependent variableIndependent variable
Output variableInput variable
Covariates
Regressors

GOAL:

Example

Look at this data set. It contains information about various cars and their related variables (a data frame with 32 observations on 11 (numeric) variables).

  • [, 1] mpg Miles/(US) gallon
  • [, 2] cyl Number of cylinders
  • [, 3] disp Displacement (cu.in.)
  • [, 4] hp Gross horsepower
  • [, 5] drat Rear axle ratio
  • [, 6] wt Weight (1000 lbs)
  • [, 7] qsec 1/4 mile time
  • [, 8] vs Engine (0 = V-shaped, 1 = straight)
  • [, 9] am Transmission (0 = automatic, 1 = manual)
  • [,10] gear Number of forward gears
  • [,11] carb Number of carburetors
mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Let’s concentrate on two variables: mpg and wt. Plot them:

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  # geom_abline(slope = -6, intercept = 42, linetype = "dashed", col = "red",
  #             alpha = 0.8) +
  # geom_abline(slope = -7, intercept = 39, linetype = "dashed", col = "blue",
  #             alpha = 0.8) +
  theme_bw()

Questions of interest:

  1. What is the general trend of mpg against wt?
  2. How can we draw the line of best fit through the data points?
  3. How do we explain the points that do not lie exactly on the line?

The linear regression model

For a set of response variables \(\mathbf y = \{y_1,\dots,y_n\}\) and corresponding explanatory variables \(\mathbf x_{k} = \{x_{1k},\dots, x_{nk} \}\) for \(k=1,\dots,p\), the linear regression model is given by

\[\begin{gathered} y_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip} + \epsilon_i \\ \epsilon_i \sim N(0,\sigma^2) \text{ (iid)} \\ \text{for } i=1,\dots, n \end{gathered}\]

Using matrix notation, we can write this as

\[\begin{gathered} \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \\ \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{pmatrix} \end{gathered}\]

or simply

\[\mathbf y = \mathbf X\boldsymbol\beta + \boldsymbol\epsilon\]

where \(\mathbf y\) is an \(n \times 1\) vector of responses, \(\mathbf X\) is an \(n \times (p+1)\) matrix of observations (sometimes called the design matrix), \(\boldsymbol\beta\) is a \(p \times 1\) vector of coefficients, and \(\boldsymbol\epsilon\) is an \(n \times 1\) vector of errors.

The errors essentially measure the random deviation or random noise that the observation makes from the ‘true’ model.

Assumptions

  • \(\text{E}(\epsilon_i) = 0, \forall i\).
  • \(\text{Var}(\epsilon_i) = \sigma^2,\forall i\).
  • \(\text{Cov}(\epsilon_i,\epsilon_j) = 0, \forall i\neq j\).
  • We assume that the errors are normally distributed.
  • We assume that the explanatory variables are fixed (non-random).

Note that The model is linear in the parameters \(\beta_0,\dots,\beta_p\). This means we cannot have a model with \(\beta_1^2\), or \(\beta_k^q\) or the like. However, it is fine to have the covariates \(x_k^2\), \(x_k^3\) or any transformation of them.

Estimation

In order to proceed with either inference or prediction, we need to estimate the model. This means estimating the unknown values of the parameters (collectively called \(\theta\)) of the model, which are

\[\theta = \{\beta_0,\beta_1,\dots,\beta_p,\sigma^2 \}.\]

There are several methods available for estimating the linear regression model. Among them is the least squares method. Essentially, the line that fits the best through all the data points should have the smallest total error.

Let \(\hat{\boldsymbol\beta}\) be an estimate of \(\boldsymbol\beta\). Note that we use hats to represent estimates of coefficients/parameters. Consider the sum of the squared errors

\[ \begin{align} \sum_{i=1}^n \epsilon_i^2 &= \sum_{i=1}^n (y_i - \beta_0 + \beta_1 x_{i1} - \cdots - \beta_p x_{ip})^2 \\ &= \Vert \mathbf y - \mathbf X \boldsymbol\beta \Vert^2 \end{align} \]

By definition, the least squares estimator (LSE) for \(\boldsymbol\beta\) minimises the sum of squared errors, i.e.

\[ \hat{\boldsymbol\beta} = \arg\min \Vert \mathbf y - \mathbf X \boldsymbol\beta \Vert^2 \]

To solve for \(\boldsymbol\beta\), consider

\[\begin{align} \Vert \mathbf y - \mathbf X \boldsymbol\beta \Vert^2 &= \Vert \mathbf y - \mathbf X \hat{\boldsymbol\beta} + \mathbf X \hat{\boldsymbol\beta} - \mathbf X \boldsymbol\beta \Vert^2 \\ &= \Vert \mathbf y - \mathbf X \hat{\boldsymbol\beta} \Vert^2 + \Vert \mathbf X (\hat{\boldsymbol\beta} - \boldsymbol\beta) \Vert^2 + 2(\hat{\boldsymbol\beta} - \boldsymbol\beta)^\top\mathbf X^\top (\mathbf y - \mathbf X \hat{\boldsymbol\beta} ) \\ \end{align}\]

Now if we choose \(\boldsymbol\beta\) such that \[\mathbf X^\top (\mathbf y - \mathbf X \hat{\boldsymbol\beta} )\] then we have

\[\begin{align} \Vert \mathbf y - \mathbf X \boldsymbol\beta \Vert^2 &= \Vert \mathbf y - \mathbf X \hat{\boldsymbol\beta} \Vert^2 + \Vert \mathbf X (\hat{\boldsymbol\beta} - \boldsymbol\beta) \Vert^2 \\ &\geq \Vert \mathbf y - \mathbf X \hat{\boldsymbol\beta} \Vert^2 \end{align}\]

for any \(\boldsymbol\beta\), which satisfies the least squares condition. Therefore, the LSE for \(\boldsymbol\beta\) must satisfy

\[\begin{align} \mathbf X^\top (\mathbf y - \mathbf X \hat{\boldsymbol\beta} ) &= 0 \\ \mathbf X^\top\mathbf y - \mathbf X^\top\mathbf X \hat{\boldsymbol\beta} &= 0 \\ \mathbf X^\top\mathbf X \hat{\boldsymbol\beta} &= \mathbf X^\top\mathbf y\\ \hat{\boldsymbol\beta} &= (\mathbf X^\top\mathbf X )^{-1} \mathbf X^\top\mathbf y \end{align}\]

provided \(\mathbf X^\top\mathbf X\) is not singular, i.e. it is invertible.

With the knowledge of the LSE for the coefficients, and assuming that the errors are zero-meaned variables, an estimate of the variance of the errors is given by \[ \hat\sigma^2 = \frac{1}{n} \sum_{i=1}^n (y_i-\hat\beta_0-\hat\beta_1x_{i1} - \cdots -\hat\beta_px_{ip})^2 \] but usually we will use another estimator for \(\sigma^2\), as we will see later.

Residuals

Let \(\hat y_i = \hat\beta_0+\hat\beta_1x_{i1} + \cdots +\hat\beta_px_{ip}\), the predicted value for the \(i\)’th observation using estimates \(\hat{\boldsymbol\beta}\). Residuals are defined to be the difference between the observed value and the predicted value,

\[\hat\epsilon_i = y_i - \hat y_i\]

If the model is correct, then the residuals should behave like random noise!

Properties of estimator

  1. The LSE for \(\boldsymbol\beta\) turns out to be the maximum likelihood estimator for \(\boldsymbol\beta\) too. This is easy to see:

    The likelihood under the normal linear model is

    \[L(\boldsymbol\beta,\sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} \exp \left\{ -\frac{1}{2\sigma^2} \Vert \mathbf y - \mathbf X \boldsymbol\beta \Vert^2 \right\}\]

    and since under the LSE, \(\Vert \mathbf y - \mathbf X {\boldsymbol\beta} \Vert^2 \geq \Vert \mathbf y - \mathbf X \hat{\boldsymbol\beta} \Vert^2\), which means that the likelihood is maximised at the LSE as well.

  2. The estimator for \(\hat{\boldsymbol\beta}\) is normally distributed with mean and variance given by \[ \hat{\boldsymbol\beta} \sim \text{N}_p(\boldsymbol\beta, \sigma^2 (\mathbf X^\top \mathbf X)^{-1}) \] which implies that the estimators for the coefficients are unbiased.

  3. The unbiased estimator for \(\sigma^2\) is given by the formula \[\hat\sigma^2 = \frac{ \Vert \mathbf y - \mathbf X \hat{ \boldsymbol\beta} \Vert^2}{n-p-1}\] which is distributed according to a \(\chi^2\) distribution with \(n-p-1\) degrees of freedom.

  4. \(\hat{\boldsymbol\beta}\) and \(\hat\sigma^2\) are independent of each other.

  5. The residuals have mean zero \[ \begin{align} \text{E}(\hat{\boldsymbol\epsilon}) &= \text{E}(\mathbf y - \mathbf X \hat{\boldsymbol\beta}) \\ &= \text{E}(\mathbf X \boldsymbol\beta - \mathbf X \hat{\boldsymbol\beta}) \\ &= \mathbf X \boldsymbol\beta -\mathbf X \boldsymbol\beta \\ &= \mathbf 0 \end{align} \]

Inference

Interpretation of coefficients

The coefficients \(\beta_j\) represents the ‘strength or influence’ of the variable \(x_j\) on \(y\). It is the effect on \(y\) of changing \(x_j\) by a single unit, holding the other covariates fixed.

Consider the effect of the coefficient \(\beta_1\). Let \(y^{(0)} =\beta_0+\beta_1x_{1} + \cdots +\beta_px_{p}\), and also let \(y^{(1)} =\beta_0+\beta_1(x_{1}+1) + \cdots +\beta_px_{p}\). Since everything else cancels out, the difference between \(y^{(1)}\) and \(y^{(0)}\) is simply \[ y^{(1)} - y^{(0)} = \beta_1(x_{1}+1) - \beta_1x_{1} = \beta_1 \]

As we can see, \(\beta_1\) represents the change in the response variable when the variable \(x_1\) is increased by one unit, and keeping all other variables fixed. A positive value for the coefficient imparts a change in the positive direction in the response, and vice versa. Of course, the logic is the same for all of the other variables \(j=2,3,\dots,p\).

Standard errors of coefficients

Recall that the standard deviation for each \(\hat\beta_j\) (from the normal distribution in property 2) is given by \[ \text{SD}(\hat\beta_j) = \sqrt{\sigma^2 v_{jj}} \] where \(v_jj\) is the \((j+1,j+1)\)th element of the matrix \((\mathbf X^\top\mathbf X)^{-1}\). Notice that we do not know the true value of \(\sigma\) and therefore, the standard deviation of the coefficients as well.

The standard error for each \(\hat\beta_j\) is given by \[ \text{SE}(\hat\beta_j) = \sqrt{\hat\sigma^2 v_{jj}} \] where \(\hat\sigma^2\) is the unbiased estimator for \(\sigma^2\) given in property 3 above. Essentially, by replacing the unknown value \(\sigma^2\) by its estimator, we get a “estimate” of the SD which we call the standard error.

Coefficient of determination

Introduce the decomposition \[ \overbrace{\sum_{i=1}^n (y_i - \bar y)^2}^{\text{Total SS}} = \overbrace{\sum_{i=1}^n (\hat y_i - \bar y)^2}^{\text{Regression SS}} + \overbrace{\sum_{i=1}^n (y_i - \hat y_i)^2}^{\text{Residual SS}} \]

The term on the LHS is the Total Sum of Squares, which represents the total variation in the data (responses). The first term on the RHS is called the Regression Sum of Squares, which represents the variation in the regression model. The second term on the RHS we have seen before, called the Residual Sum of Squares, which measures variability between predicted and observed values.

Define the coefficient of determination, as \[ R^2 = \frac{\text{Regression SS}}{\text{Total SS}} = 1 - \frac{\text{Residual SS}}{\text{Total SS}} \] \(R^2\) takes values between 0 and 1. \(100R^2\) is the percentage of the total variation in \(\{y_i \}\) explained by all the regressors. Therefore, the closer \(R^2\) is to 1, the better the model agreement is.

Sometimes, the adjusted \(R^2\) value is used instead, because it has nice distributional properties (for hypothesis testing) \[ R^2_{\text{adj}} = 1 - \frac{\text{Residual SS} / (n-p-1)}{\text{Total SS} / (n-1)} \]

Roughly speaking, both should give about similar values.

Tests for single coefficients

For each coefficient \(\beta_j\), where \(j=0,1,\dots,p\), \[ \frac{\hat\beta_j - \beta_j}{\text{SE}(\hat{\beta_j})} \sim t_{n-p-1} \]

We can then use this fact to test the hypothesis \[ \begin{align} \text{H}_0&: \beta_j = b_j \\ \text{H}_1&: \beta_j \neq b_j \end{align} \]

Let \[ T = \frac{\hat\beta_j - b_j}{\text{SE}(\hat{\beta_j})} \] We reject the null hypothesis at the level \(\alpha\) against the alternative hypothesis if \(\vert T \vert > t_{n-p-1}(\alpha/2)\), where \(t_k(\alpha)\) is the top \(\alpha\)-point of the \(t_k\) distribution.

Note we can also test \(\text{H}_0\) against the alternatives

  • \(\text{H}_1: \beta_j > b_j\), and we reject the null hypothesis if \(T > t_{n-p-1}(\alpha)\); and
  • \(\text{H}_1: \beta_j < b_j\), and we reject the null hypothesis if \(T < -t_{n-p-1}(\alpha)\).

The \((1-\alpha)\) confidence interval for \(\beta_j\) is \[ \hat\beta_j \pm t_{n-p-1}(\alpha/2)\cdot \text{SE}(\hat\beta_j) \]

Remarks:

  • When \(n\) is large, then the \(t\)-test becomes approximately equivalent to the \(Z\)-test / Wald test (using the normal distribution).
  • We are not usually interested in testing the intercept. In most practical data applications, the intercept is not likely to be zero as it represents the ‘average value’ of the response when all covariates are zero.

Tests for all zero-regression coefficients

Consider the hypothesis \[ \begin{align} \text{H}_0&: \beta_1 = \beta_2 = \cdots = \beta_p = 0 \\ \text{H}_1&: \text{Not all } \beta_1,\dots,\beta_p \text{ are } 0 \end{align} \]

Let \[ \begin{align} T = \frac{\sum_{i=1}^n (\hat y_i - \bar y)^2 / p}{\sum_{i=1}^n (y_i - \hat y_i)^2 / (n-p-1)} &= \frac{n-p-1}{p}\cdot \frac{\text{Regression SS}}{\text{Residual SS}} \\ &= \frac{n-p-1}{p}\cdot\frac{R^2}{1-R^2} \end{align} \] We reject the null hypothesis at the \(\alpha\) significance level if \(T > F_{p,n-p-1}(\alpha)\).

This has links to the ANOVA table which you may be familiar with:

Sourced.f.Sum of squaresMean SSF-statistic
Regressors\(p\)\(\sum_{i=1}^n (\hat y_i - \bar y)^2\)\(\frac{\sum_{i=1}^n (\hat y_i - \bar y)^2}{p}\)\(\frac{p \sum_{i=1}^n (\hat y_i - \bar y)^2 }{(n-p-1)\sum_{i=1}^n (y_i - \hat y_i)^2}\)
Residual\(n-p-1\)\(\sum_{i=1}^n (y_i - \hat y_i)^2\)\(\frac{\sum_{i=1}^n (y_i - \hat y_i)^2}{n-p-1}\)
Total\(n-1\)\(\sum_{i=1}^n (y_i - \bar y)^2\)

Standardized residuals

Let \(\mathbf H = \mathbf X(\mathbf X^\top\mathbf X)^{-1}\mathbf X^\top\), an \(n\times n\) matrix. This matrix is called the projection matrix or the hat matrix for linear regression. It has several interesting and useful properties, which we will not go into detail now.

For \(i=1,\dots,n\), define the standardized residuals as \[ \hat e_i = \frac{y_i - \hat y_i}{\sqrt{\hat\sigma^2(1-h_{ii})}} \] where \(h_{ii}\) is the \((i,i)\)th entry of the matrix \(\mathbf H\). The standardized residuals then have mean zero and variance one.

We use the standardized residuals as a means of diagnosing the fit of the linear model, and to test assumptions of our linear model. For instance, we can plot the standardized residuals - in a qq-plot (to test for normality) - against covariates (to test for homo/heteroscedasticity) - against predicted values (to test for homo/heteroscedasticity)

Example

Let’s fit a simple linear regression model to the mtcars data set. In R, the command to fit linear regression model is lm().

mod <- lm(formula = mpg ~ wt, data = mtcars)
summary(mod)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Let’s go through the results one by one.

Call

## Call:
## lm(formula = mpg ~ wt, data = mtcars)

This tells us that the following regression had been fitted: \[ \texttt{mpg} = \beta_0 + \beta_1 \cdot \texttt{wt} + \epsilon \]

Residuals

## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 

This gives us a 5-point summary of the distribution of the residuals \(\hat\epsilon_i\). The intention is to tell us how well the model fits the data, but it will also be easier to diagnose model fit using residual plots.

diag.plots <- lindia::gg_diagnose(mod, plot.all = FALSE)
diag.plots <- lapply(diag.plots, function(x) x + theme_bw())
diag.plots[[2]] <- diag.plots[[2]] + geom_smooth(se = FALSE)
diag.plots[[3]] <- diag.plots[[3]] + geom_smooth(se = FALSE)
lindia::plot_all(diag.plots[1:4])