Regression analysis is one of the most frequently used statistical techniques. It aims to build up an explicit relationship between one response variable, often denoted as \(y\), and one or several explanatory variables, often denoted as \(x_1,\dots,x_p\).
Alternative nomenclature for the variables:
\(y\) variable | \(x\) variable |
---|---|
Response variable | Explanatory variables |
Dependent variable | Independent variable |
Output variable | Input variable |
Covariates | |
Regressors |
GOAL:
Look at this data set. It contains information about various cars and their related variables (a data frame with 32 observations on 11 (numeric) variables).
mpg
Miles/(US) galloncyl
Number of cylindersdisp
Displacement (cu.in.)hp
Gross horsepowerdrat
Rear axle ratiowt
Weight (1000 lbs)qsec
1/4 mile timevs
Engine (0 = V-shaped, 1 = straight)am
Transmission (0 = automatic, 1 = manual)gear
Number of forward gearscarb
Number of carburetorsmtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
Let’s concentrate on two variables: mpg
and wt
. Plot them:
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
# geom_abline(slope = -6, intercept = 42, linetype = "dashed", col = "red",
# alpha = 0.8) +
# geom_abline(slope = -7, intercept = 39, linetype = "dashed", col = "blue",
# alpha = 0.8) +
theme_bw()
Questions of interest:
mpg
against wt
?For a set of response variables \(\mathbf y = \{y_1,\dots,y_n\}\) and corresponding explanatory variables \(\mathbf x_{k} = \{x_{1k},\dots, x_{nk} \}\) for \(k=1,\dots,p\), the linear regression model is given by
\[\begin{gathered} y_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip} + \epsilon_i \\ \epsilon_i \sim N(0,\sigma^2) \text{ (iid)} \\ \text{for } i=1,\dots, n \end{gathered}\]
Using matrix notation, we can write this as
\[\begin{gathered} \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \\ \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{pmatrix} \end{gathered}\]
or simply
\[\mathbf y = \mathbf X\boldsymbol\beta + \boldsymbol\epsilon\]
where \(\mathbf y\) is an \(n \times 1\) vector of responses, \(\mathbf X\) is an \(n \times (p+1)\) matrix of observations (sometimes called the design matrix), \(\boldsymbol\beta\) is a \(p \times 1\) vector of coefficients, and \(\boldsymbol\epsilon\) is an \(n \times 1\) vector of errors.
The errors essentially measure the random deviation or random noise that the observation makes from the ‘true’ model.
Note that The model is linear in the parameters \(\beta_0,\dots,\beta_p\). This means we cannot have a model with \(\beta_1^2\), or \(\beta_k^q\) or the like. However, it is fine to have the covariates \(x_k^2\), \(x_k^3\) or any transformation of them.
In order to proceed with either inference or prediction, we need to estimate the model. This means estimating the unknown values of the parameters (collectively called \(\theta\)) of the model, which are
\[\theta = \{\beta_0,\beta_1,\dots,\beta_p,\sigma^2 \}.\]
There are several methods available for estimating the linear regression model. Among them is the least squares method. Essentially, the line that fits the best through all the data points should have the smallest total error.
Let \(\hat{\boldsymbol\beta}\) be an estimate of \(\boldsymbol\beta\). Note that we use hats to represent estimates of coefficients/parameters. Consider the sum of the squared errors
\[ \begin{aligned} \sum_{i=1}^n \epsilon_i^2 &= \sum_{i=1}^n (y_i - \beta_0 + \beta_1 x_{i1} - \cdots - \beta_p x_{ip})^2 \\ &= \Vert \mathbf y - \mathbf X \boldsymbol\beta \Vert^2 \end{aligned} \]
By definition, the least squares estimator (LSE) for \(\boldsymbol\beta\) minimises the sum of squared errors, i.e.
\[ \hat{\boldsymbol\beta} = \arg\min \Vert \mathbf y - \mathbf X \boldsymbol\beta \Vert^2 \]
To solve for \(\boldsymbol\beta\), consider
\[\begin{aligned} \Vert \mathbf y - \mathbf X \boldsymbol\beta \Vert^2 &= \Vert \mathbf y - \mathbf X \hat{\boldsymbol\beta} + \mathbf X \hat{\boldsymbol\beta} - \mathbf X \boldsymbol\beta \Vert^2 \\ &= \Vert \mathbf y - \mathbf X \hat{\boldsymbol\beta} \Vert^2 + \Vert \mathbf X (\hat{\boldsymbol\beta} - \boldsymbol\beta) \Vert^2 + 2(\hat{\boldsymbol\beta} - \boldsymbol\beta)^\top\mathbf X^\top (\mathbf y - \mathbf X \hat{\boldsymbol\beta} ) \\ \end{aligned}\]
Now if we choose \(\boldsymbol\beta\) such that \[\mathbf X^\top (\mathbf y - \mathbf X \hat{\boldsymbol\beta} )\] then we have
\[\begin{aligned} \Vert \mathbf y - \mathbf X \boldsymbol\beta \Vert^2 &= \Vert \mathbf y - \mathbf X \hat{\boldsymbol\beta} \Vert^2 + \Vert \mathbf X (\hat{\boldsymbol\beta} - \boldsymbol\beta) \Vert^2 \\ &\geq \Vert \mathbf y - \mathbf X \hat{\boldsymbol\beta} \Vert^2 \end{aligned}\]
for any \(\boldsymbol\beta\), which satisfies the least squares condition. Therefore, the LSE for \(\boldsymbol\beta\) must satisfy
\[\begin{aligned} \mathbf X^\top (\mathbf y - \mathbf X \hat{\boldsymbol\beta} ) &= 0 \\ \mathbf X^\top\mathbf y - \mathbf X^\top\mathbf X \hat{\boldsymbol\beta} &= 0 \\ \mathbf X^\top\mathbf X \hat{\boldsymbol\beta} &= \mathbf X^\top\mathbf y\\ \hat{\boldsymbol\beta} &= (\mathbf X^\top\mathbf X )^{-1} \mathbf X^\top\mathbf y \end{aligned}\]
provided \(\mathbf X^\top\mathbf X\) is not singular, i.e. it is invertible.
With the knowledge of the LSE for the coefficients, and assuming that the errors are zero-meaned variables, an estimate of the variance of the errors is given by \[ \hat\sigma^2 = \frac{1}{n} \sum_{i=1}^n (y_i-\hat\beta_0-\hat\beta_1x_{i1} - \cdots -\hat\beta_px_{ip})^2 \] but usually we will use another estimator for \(\sigma^2\), as we will see later.
Let \(\hat y_i = \hat\beta_0+\hat\beta_1x_{i1} + \cdots +\hat\beta_px_{ip}\), the predicted value for the \(i\)’th observation using estimates \(\hat{\boldsymbol\beta}\). Residuals are defined to be the difference between the observed value and the predicted value,
\[\hat\epsilon_i = y_i - \hat y_i\]
If the model is correct, then the residuals should behave like random noise!
The LSE for \(\boldsymbol\beta\) turns out to be the maximum likelihood estimator for \(\boldsymbol\beta\) too. This is easy to see:
The likelihood under the normal linear model is
\[L(\boldsymbol\beta,\sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} \exp \left\{ -\frac{1}{2\sigma^2} \Vert \mathbf y - \mathbf X \boldsymbol\beta \Vert^2 \right\}\]
and since under the LSE, \(\Vert \mathbf y - \mathbf X {\boldsymbol\beta} \Vert^2 \geq \Vert \mathbf y - \mathbf X \hat{\boldsymbol\beta} \Vert^2\), which means that the likelihood is maximised at the LSE as well.
The estimator for \(\hat{\boldsymbol\beta}\) is normally distributed with mean and variance given by \[ \hat{\boldsymbol\beta} \sim \text{N}_p(\boldsymbol\beta, \sigma^2 (\mathbf X^\top \mathbf X)^{-1}) \] which implies that the estimators for the coefficients are unbiased.
The unbiased estimator for \(\sigma^2\) is given by the formula \[\hat\sigma^2 = \frac{ \Vert \mathbf y - \mathbf X \hat{ \boldsymbol\beta} \Vert^2}{n-p-1}\] which is distributed according to a \(\chi^2\) distribution with \(n-p-1\) degrees of freedom.
\(\hat{\boldsymbol\beta}\) and \(\hat\sigma^2\) are independent of each other.
The residuals have mean zero \[ \begin{aligned} \text{E}(\hat{\boldsymbol\epsilon}) &= \text{E}(\mathbf y - \mathbf X \hat{\boldsymbol\beta}) \\ &= \text{E}(\mathbf X \boldsymbol\beta - \mathbf X \hat{\boldsymbol\beta}) \\ &= \mathbf X \boldsymbol\beta -\mathbf X \boldsymbol\beta \\ &= \mathbf 0 \end{aligned} \]
The coefficients \(\beta_j\) represents the ‘strength or influence’ of the variable \(x_j\) on \(y\). It is the effect on \(y\) of changing \(x_j\) by a single unit, holding the other covariates fixed.
Consider the effect of the coefficient \(\beta_1\). Let \(y^{(0)} =\beta_0+\beta_1x_{1} + \cdots +\beta_px_{p}\), and also let \(y^{(1)} =\beta_0+\beta_1(x_{1}+1) + \cdots +\beta_px_{p}\). Since everything else cancels out, the difference between \(y^{(1)}\) and \(y^{(0)}\) is simply \[ y^{(1)} - y^{(0)} = \beta_1(x_{1}+1) - \beta_1x_{1} = \beta_1 \]
As we can see, \(\beta_1\) represents the change in the response variable when the variable \(x_1\) is increased by one unit, and keeping all other variables fixed. A positive value for the coefficient imparts a change in the positive direction in the response, and vice versa. Of course, the logic is the same for all of the other variables \(j=2,3,\dots,p\).
Recall that the standard deviation for each \(\hat\beta_j\) (from the normal distribution in property 2) is given by \[ \text{SD}(\hat\beta_j) = \sqrt{\sigma^2 v_{jj}} \] where \(v_jj\) is the \((j+1,j+1)\)th element of the matrix \((\mathbf X^\top\mathbf X)^{-1}\). Notice that we do not know the true value of \(\sigma\) and therefore, the standard deviation of the coefficients as well.
The standard error for each \(\hat\beta_j\) is given by \[ \text{SE}(\hat\beta_j) = \sqrt{\hat\sigma^2 v_{jj}} \] where \(\hat\sigma^2\) is the unbiased estimator for \(\sigma^2\) given in property 3 above. Essentially, by replacing the unknown value \(\sigma^2\) by its estimator, we get a “estimate” of the SD which we call the standard error.
Introduce the decomposition
\[ \overbrace{\sum_{i=1}^n (y_i - \bar y)^2}^{\text{Total SS}} = \overbrace{\sum_{i=1}^n (\hat y_i - \bar y)^2}^{\text{Regression SS}} + \overbrace{\sum_{i=1}^n (y_i - \hat y_i)^2}^{\text{Residual SS}} \]
The term on the LHS is the Total Sum of Squares, which represents the total variation in the data (responses). The first term on the RHS is called the Regression Sum of Squares, which represents the variation in the regression model. The second term on the RHS we have seen before, called the Residual Sum of Squares, which measures variability between predicted and observed values.
Define the coefficient of determination, as \[ R^2 = \frac{\text{Regression SS}}{\text{Total SS}} = 1 - \frac{\text{Residual SS}}{\text{Total SS}} \] \(R^2\) takes values between 0 and 1. \(100R^2\) is the percentage of the total variation in \(\{y_i \}\) explained by all the regressors. Therefore, the closer \(R^2\) is to 1, the better the model agreement is.
Sometimes, the adjusted \(R^2\) value is used instead, because it has nice distributional properties (for hypothesis testing) \[ R^2_{\text{adj}} = 1 - \frac{\text{Residual SS} / (n-p-1)}{\text{Total SS} / (n-1)} \]
Roughly speaking, both should give about similar values.
For each coefficient \(\beta_j\), where \(j=0,1,\dots,p\), \[ \frac{\hat\beta_j - \beta_j}{\text{SE}(\hat{\beta_j})} \sim t_{n-p-1} \]
We can then use this fact to test the hypothesis \[ \begin{aligned} \text{H}_0&: \beta_j = b_j \\ \text{H}_1&: \beta_j \neq b_j \end{aligned} \]
Let \[ T = \frac{\hat\beta_j - b_j}{\text{SE}(\hat{\beta_j})} \] We reject the null hypothesis at the level \(\alpha\) against the alternative hypothesis if \(\vert T \vert > t_{n-p-1}(\alpha/2)\), where \(t_k(\alpha)\) is the top \(\alpha\)-point of the \(t_k\) distribution.
Note we can also test \(\text{H}_0\) against the alternatives
The \((1-\alpha)\) confidence interval for \(\beta_j\) is \[ \hat\beta_j \pm t_{n-p-1}(\alpha/2)\cdot \text{SE}(\hat\beta_j) \]
Remarks:
Consider the hypothesis \[ \begin{aligned} \text{H}_0&: \beta_1 = \beta_2 = \cdots = \beta_p = 0 \\ \text{H}_1&: \text{Not all } \beta_1,\dots,\beta_p \text{ are } 0 \end{aligned} \]
Let \[ \begin{aligned} T = \frac{\sum_{i=1}^n (\hat y_i - \bar y)^2 / p}{\sum_{i=1}^n (y_i - \hat y_i)^2 / (n-p-1)} &= \frac{n-p-1}{p}\cdot \frac{\text{Regression SS}}{\text{Residual SS}} \\ &= \frac{n-p-1}{p}\cdot\frac{R^2}{1-R^2} \end{aligned} \] We reject the null hypothesis at the \(\alpha\) significance level if \(T > F_{p,n-p-1}(\alpha)\).
This has links to the ANOVA table which you may be familiar with:
Source | d.f. | Sum of squares | Mean SS | F-statistic |
---|---|---|---|---|
Regressors | \(p\) | \(\sum_{i=1}^n (\hat y_i - \bar y)^2\) | \(\frac{\sum_{i=1}^n (\hat y_i - \bar y)^2}{p}\) | \(\frac{p \sum_{i=1}^n (\hat y_i - \bar y)^2 }{(n-p-1)\sum_{i=1}^n (y_i - \hat y_i)^2}\) |
Residual | \(n-p-1\) | \(\sum_{i=1}^n (y_i - \hat y_i)^2\) | \(\frac{\sum_{i=1}^n (y_i - \hat y_i)^2}{n-p-1}\) | |
Total | \(n-1\) | \(\sum_{i=1}^n (y_i - \bar y)^2\) |
Let \(\mathbf H = \mathbf X(\mathbf X^\top\mathbf X)^{-1}\mathbf X^\top\), an \(n\times n\) matrix. This matrix is called the projection matrix or the hat matrix for linear regression. It has several interesting and useful properties, which we will not go into detail now.
For \(i=1,\dots,n\), define the standardized residuals as \[ \hat e_i = \frac{y_i - \hat y_i}{\sqrt{\hat\sigma^2(1-h_{ii})}} \] where \(h_{ii}\) is the \((i,i)\)th entry of the matrix \(\mathbf H\). The standardized residuals then have mean zero and variance one.
We use the standardized residuals as a means of diagnosing the fit of the linear model, and to test assumptions of our linear model. For instance, we can plot the standardized residuals - in a qq-plot (to test for normality) - against covariates (to test for homo/heteroscedasticity) - against predicted values (to test for homo/heteroscedasticity)
Let’s fit a simple linear regression model to the mtcars
data set. In R
, the command to fit linear regression model is lm()
.
mod <- lm(formula = mpg ~ wt, data = mtcars)
summary(mod)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Let’s go through the results one by one.
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
This tells us that the following regression had been fitted: \[ \texttt{mpg} = \beta_0 + \beta_1 \cdot \texttt{wt} + \epsilon \]
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
This gives us a 5-point summary of the distribution of the residuals \(\hat\epsilon_i\). The intention is to tell us how well the model fits the data, but it will also be easier to diagnose model fit using residual plots.
diag.plots <- lindia::gg_diagnose(mod, plot.all = FALSE)
diag.plots <- lapply(diag.plots, function(x) x + theme_bw())
diag.plots[[2]] <- diag.plots[[2]] + geom_smooth(se = FALSE)
diag.plots[[3]] <- diag.plots[[3]] + geom_smooth(se = FALSE)
lindia::plot_all(diag.plots[1:4])
In this case, we see some kind of curved relationship between residuals and fitted values / wt
.
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This piece of information tells us that the model has been estimated to be
\[ \texttt{mpg} = 37.3 - 5.3 \, \texttt{wt} \]
The table also shows individual hypothesis tests of significance (i.e. testing that the coefficient is non-zero) conducted for each of the coefficients. We find that both \(\beta_0\) and \(\beta_1\) are statistically significant, with both tests strongly rejecting the null hypothesis that \(\beta_0 = 0\) (\(p\)-value = \(<2e16\)) and \(\beta_1=0\) (\(p\)-value = \(1.29e-10\)).
What is the interpretation of the coefficient for wt
? Firstly, notice that it is negative, which means that there is an inverse relationship between the explanatory variable wt
and the response variable mpg
. This makes sense, because the heavier the car is, the less fuel efficient it is. For every unit increase in wt
(given in 1000 lbs), we see on average a decrease of -5.3 miles per gallon.
## Residual standard error: 3.046 on 30 degrees of freedom
This gives the estimate for \(\hat\sigma\), otherwise known as residual standard error. To understand why this is called what it is, take a look at the formula. As we mentioned above, this follows a \(\chi^2\) distribution with \(n-p-1=30\) degrees of freedom.
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
As the name states, this gives the estimate for \(R^2\) and \(R_{\text{adj}}^2\) respectively.
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
This is the piece of information which relates to testing all coefficients are non-zero simultaneously. The \(F\)-statistic, \(T=91.38\) is calculated using the formula given above, and it is compared against the \(F_{1,30}\) distribution. We see that it gives a \(p\)-value of \(1.294e-10\) which implies strongly that the null hypothesis is rejected, concluding that at least one of the coefficients is non-zero.
Using the information from the linear regression, one can produce the following table:
knitr::kable(fortify(mod)[, -c(3:5)])
mpg | wt | .fitted | .resid | .stdresid | |
---|---|---|---|---|---|
Mazda RX4 | 21.0 | 2.620 | 23.282611 | -2.2826106 | -0.7661677 |
Mazda RX4 Wag | 21.0 | 2.875 | 21.919770 | -0.9197704 | -0.3074305 |
Datsun 710 | 22.8 | 2.320 | 24.885952 | -2.0859521 | -0.7057525 |
Hornet 4 Drive | 21.4 | 3.215 | 20.102650 | 1.2973499 | 0.4327511 |
Hornet Sportabout | 18.7 | 3.440 | 18.900144 | -0.2001440 | -0.0668188 |
Valiant | 18.1 | 3.460 | 18.793254 | -0.6932545 | -0.2314831 |
Duster 360 | 14.3 | 3.570 | 18.205363 | -3.9053627 | -1.3055222 |
Merc 240D | 24.4 | 3.190 | 20.236262 | 4.1637381 | 1.3888971 |
Merc 230 | 22.8 | 3.150 | 20.450041 | 2.3499593 | 0.7839269 |
Merc 280 | 19.2 | 3.440 | 18.900144 | 0.2998560 | 0.1001080 |
Merc 280C | 17.8 | 3.440 | 18.900144 | -1.1001440 | -0.3672871 |
Merc 450SE | 16.4 | 4.070 | 15.533127 | 0.8668731 | 0.2928865 |
Merc 450SL | 17.3 | 3.730 | 17.350247 | -0.0502472 | -0.0168379 |
Merc 450SLC | 15.2 | 3.780 | 17.083024 | -1.8830236 | -0.6315997 |
Cadillac Fleetwood | 10.4 | 5.250 | 9.226650 | 1.1733496 | 0.4229607 |
Lincoln Continental | 10.4 | 5.424 | 8.296712 | 2.1032876 | 0.7697987 |
Chrysler Imperial | 14.7 | 5.345 | 8.718926 | 5.9810744 | 2.1735331 |
Fiat 128 | 32.4 | 2.200 | 25.527289 | 6.8727113 | 2.3349021 |
Honda Civic | 30.4 | 1.615 | 28.653805 | 1.7461954 | 0.6103569 |
Toyota Corolla | 33.9 | 1.835 | 27.478021 | 6.4219792 | 2.2170827 |
Toyota Corona | 21.5 | 2.465 | 24.111004 | -2.6110037 | -0.8796401 |
Dodge Challenger | 15.5 | 3.520 | 18.472586 | -2.9725862 | -0.9931363 |
AMC Javelin | 15.2 | 3.435 | 18.926866 | -3.7268663 | -1.2441801 |
Camaro Z28 | 13.3 | 3.840 | 16.762355 | -3.4623553 | -1.1627910 |
Pontiac Firebird | 19.2 | 3.845 | 16.735633 | 2.4643670 | 0.8277197 |
Fiat X1-9 | 27.3 | 1.935 | 26.943574 | 0.3564263 | 0.1224441 |
Porsche 914-2 | 26.0 | 2.140 | 25.847957 | 0.1520430 | 0.0517719 |
Lotus Europa | 30.4 | 1.513 | 29.198941 | 1.2010593 | 0.4225427 |
Ford Pantera L | 15.8 | 3.170 | 20.343151 | -4.5431513 | -1.5154971 |
Ferrari Dino | 19.7 | 2.770 | 22.480940 | -2.7809399 | -0.9308693 |
Maserati Bora | 15.0 | 3.570 | 18.205363 | -3.2053627 | -1.0715194 |
Volvo 142E | 21.4 | 2.780 | 22.427495 | -1.0274952 | -0.3438822 |
We can then plot the line of best fit as follows
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
geom_segment(data = fortify(mod),
aes(x = wt, xend = wt, y = mpg, yend = .fitted),
linetype = "dashed") +
theme_bw()
## `geom_smooth()` using formula 'y ~ x'
Consider the following data set, which has observations on four variables
LVOL
logarithms of weekly sales volumePROMP
promotion priceFEAT
feature advertisingDISP
displayfoods <- as_tibble(read.table("foods.txt", header = TRUE))
foods
## # A tibble: 156 × 4
## LVOL PROMP FEAT DISP
## <dbl> <dbl> <dbl> <dbl>
## 1 14.5 3.52 39.9 21.4
## 2 14.2 3.7 25.8 34.6
## 3 14.3 3.42 23.3 27.4
## 4 14.3 3.55 25.5 25.7
## 5 14.2 3.64 39.2 30.2
## 6 14.0 3.78 13.1 41.7
## 7 14.0 3.86 17.1 37.6
## 8 14.0 3.60 24.5 35.1
## 9 14.1 3.72 15.5 34.2
## 10 14.2 3.52 24.8 29.2
## # … with 146 more rows
Here’s an exploratory plot of the all the variables
GGally::ggpairs(foods) + theme_bw()
Here are the results of the regression model fitted on the data set
mod0 <- lm(LVOL ~ 1, foods)
mod1 <- lm(LVOL ~ PROMP, foods)
mod2 <- lm(LVOL ~ PROMP + FEAT, foods)
mod3 <- lm(LVOL ~ PROMP + FEAT + DISP, foods)
mtable("Model 1" = mod1, "Model 2" = mod2, "Model 3" = mod3,
summary.stats = c("sigma", "R-squared", "F", "p", "N",
"Log-likelihood", "Deviance" , "AIC", "BIC"))
##
## Calls:
## Model 1: lm(formula = LVOL ~ PROMP, data = foods)
## Model 2: lm(formula = LVOL ~ PROMP + FEAT, data = foods)
## Model 3: lm(formula = LVOL ~ PROMP + FEAT + DISP, data = foods)
##
## =========================================================
## Model 1 Model 2 Model 3
## ---------------------------------------------------------
## (Intercept) 18.409*** 17.150*** 17.237***
## (0.302) (0.249) (0.249)
## PROMP -1.197*** -0.904*** -0.956***
## (0.087) (0.069) (0.073)
## FEAT 0.010*** 0.010***
## (0.001) (0.001)
## DISP 0.004*
## (0.002)
## ---------------------------------------------------------
## sigma 0.172 0.127 0.125
## R-squared 0.549 0.756 0.763
## F 187.104 236.997 163.425
## p 0.000 0.000 0.000
## N 156 156 156
## Log-likelihood 54.371 102.362 104.752
## Deviance 4.549 2.459 2.385
## AIC -102.743 -196.724 -199.504
## BIC -93.593 -184.525 -184.254
## =========================================================
## Significance: *** = p < 0.001; ** = p < 0.01;
## * = p < 0.05
As a side note,
Consider the ANOVA table for Model 3
knitr::kable(anova(mod3), digits = 3)
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
PROMP | 1 | 5.527 | 5.527 | 352.305 | 0.000 |
FEAT | 1 | 2.090 | 2.090 | 133.243 | 0.000 |
DISP | 1 | 0.074 | 0.074 | 4.729 | 0.031 |
Residuals | 152 | 2.385 | 0.016 |
And here are the diagnostic plots
diag.plots <- lindia::gg_diagnose(mod3, plot.all = FALSE)
diag.plots <- lapply(diag.plots, function(x) x + theme_bw())
diag.plots[[2]] <- diag.plots[[2]] + geom_smooth(se = FALSE)
diag.plots[[3]] <- diag.plots[[3]] + geom_smooth(se = FALSE)
diag.plots[[4]] <- diag.plots[[4]] + geom_smooth(se = FALSE)
diag.plots[[5]] <- diag.plots[[5]] + geom_smooth(se = FALSE)
lindia::plot_all(diag.plots[1:6])
QUESTIONS
DISP
to the regression model? In other words, what is the percentage of DISP
‘s regression sum of squares in consideration of all regressors’ sum of squares?Read the article Kobina & Abledu, “Multiple Regression Analysis of the Impact of Senior Secondary School Certificate Examination (SSCE) Scores on the final Cumulative Grade Point Average (CGPA) of Students of Tertiary Institutions in Ghana.” (link here: https://pdfs.semanticscholar.org/68ba/ad469ae2f59f194f6e97f1c3d262fc2c6375.pdf).
In your own words, state the authors’ research question. Are they trying to answer a causal question?
What is their independent variable? What are their dependent variables?
What is the main finding from the paper? In other words, how do the authors answer their research question? What is the main evidence they use to make this claim? Put this in your own words.
When we have data that is non-continous (e.g. binary data, categorical data, count data, etc.), certain assumptions of the linear model are violated (which ones?). We have to fit a generalised linear model in such cases. Examples include logistic regression, Poisson regression, etc. Find some introductory material on these kinds of regression models.
What happens when we have explanatory variables which are categorical in nature? E.g. Sex, Likert scale responses, political party, etc. Read up on the use of “dummy variables” in regression analysis.
When we have a lot of predictor variables, the model may not adequately fit the data. How do we perform model/variable selection? One way is to do forward/backward selection or otherwise known as stepwise addition/delection scheme. Read up on this method.
When we have too many variables, specifically when \(p>n\), the linear regression model becomes mathematically impossible to fit. Why is this?