Regression analysis is one of the most frequently used statistical techniques. It aims to build up an explicit relationship between one response variable, often denoted as \(y\), and one or several explanatory variables, often denoted as \(x_1,\dots,x_p\).
Alternative nomenclature for the variables:
| \(y\) variable | \(x\) variable | 
|---|---|
| Response variable | Explanatory variables | 
| Dependent variable | Independent variable | 
| Output variable | Input variable | 
| Covariates | |
| Regressors | 
GOAL:
Look at this data set. It contains information about various cars and their related variables (a data frame with 32 observations on 11 (numeric) variables).
mpg Miles/(US) galloncyl Number of cylindersdisp Displacement (cu.in.)hp Gross horsepowerdrat Rear axle ratiowt Weight (1000 lbs)qsec 1/4 mile timevs Engine (0 = V-shaped, 1 = straight)am Transmission (0 = automatic, 1 = manual)gear Number of forward gearscarb Number of carburetorsmtcars##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2summary(mtcars)##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000Let’s concentrate on two variables: mpg and wt. Plot them:
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  # geom_abline(slope = -6, intercept = 42, linetype = "dashed", col = "red",
  #             alpha = 0.8) +
  # geom_abline(slope = -7, intercept = 39, linetype = "dashed", col = "blue",
  #             alpha = 0.8) +
  theme_bw()Questions of interest:
mpg against wt?For a set of response variables \(\mathbf y = \{y_1,\dots,y_n\}\) and corresponding explanatory variables \(\mathbf x_{k} = \{x_{1k},\dots, x_{nk} \}\) for \(k=1,\dots,p\), the linear regression model is given by
\[\begin{gathered} y_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip} + \epsilon_i \\ \epsilon_i \sim N(0,\sigma^2) \text{ (iid)} \\ \text{for } i=1,\dots, n \end{gathered}\]
Using matrix notation, we can write this as
\[\begin{gathered} \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \\ \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{pmatrix} \end{gathered}\]
or simply
\[\mathbf y = \mathbf X\boldsymbol\beta + \boldsymbol\epsilon\]
where \(\mathbf y\) is an \(n \times 1\) vector of responses, \(\mathbf X\) is an \(n \times (p+1)\) matrix of observations (sometimes called the design matrix), \(\boldsymbol\beta\) is a \(p \times 1\) vector of coefficients, and \(\boldsymbol\epsilon\) is an \(n \times 1\) vector of errors.
The errors essentially measure the random deviation or random noise that the observation makes from the ‘true’ model.
Note that The model is linear in the parameters \(\beta_0,\dots,\beta_p\). This means we cannot have a model with \(\beta_1^2\), or \(\beta_k^q\) or the like. However, it is fine to have the covariates \(x_k^2\), \(x_k^3\) or any transformation of them.
In order to proceed with either inference or prediction, we need to estimate the model. This means estimating the unknown values of the parameters (collectively called \(\theta\)) of the model, which are
\[\theta = \{\beta_0,\beta_1,\dots,\beta_p,\sigma^2 \}.\]
There are several methods available for estimating the linear regression model. Among them is the least squares method. Essentially, the line that fits the best through all the data points should have the smallest total error.
Let \(\hat{\boldsymbol\beta}\) be an estimate of \(\boldsymbol\beta\). Note that we use hats to represent estimates of coefficients/parameters. Consider the sum of the squared errors
\[ \begin{aligned} \sum_{i=1}^n \epsilon_i^2 &= \sum_{i=1}^n (y_i - \beta_0 + \beta_1 x_{i1} - \cdots - \beta_p x_{ip})^2 \\ &= \Vert \mathbf y - \mathbf X \boldsymbol\beta \Vert^2 \end{aligned} \]
By definition, the least squares estimator (LSE) for \(\boldsymbol\beta\) minimises the sum of squared errors, i.e.
\[ \hat{\boldsymbol\beta} = \arg\min \Vert \mathbf y - \mathbf X \boldsymbol\beta \Vert^2 \]
To solve for \(\boldsymbol\beta\), consider
\[\begin{aligned} \Vert \mathbf y - \mathbf X \boldsymbol\beta \Vert^2 &= \Vert \mathbf y - \mathbf X \hat{\boldsymbol\beta} + \mathbf X \hat{\boldsymbol\beta} - \mathbf X \boldsymbol\beta \Vert^2 \\ &= \Vert \mathbf y - \mathbf X \hat{\boldsymbol\beta} \Vert^2 + \Vert \mathbf X (\hat{\boldsymbol\beta} - \boldsymbol\beta) \Vert^2 + 2(\hat{\boldsymbol\beta} - \boldsymbol\beta)^\top\mathbf X^\top (\mathbf y - \mathbf X \hat{\boldsymbol\beta} ) \\ \end{aligned}\]
Now if we choose \(\boldsymbol\beta\) such that \[\mathbf X^\top (\mathbf y - \mathbf X \hat{\boldsymbol\beta} )\] then we have
\[\begin{aligned} \Vert \mathbf y - \mathbf X \boldsymbol\beta \Vert^2 &= \Vert \mathbf y - \mathbf X \hat{\boldsymbol\beta} \Vert^2 + \Vert \mathbf X (\hat{\boldsymbol\beta} - \boldsymbol\beta) \Vert^2 \\ &\geq \Vert \mathbf y - \mathbf X \hat{\boldsymbol\beta} \Vert^2 \end{aligned}\]
for any \(\boldsymbol\beta\), which satisfies the least squares condition. Therefore, the LSE for \(\boldsymbol\beta\) must satisfy
\[\begin{aligned} \mathbf X^\top (\mathbf y - \mathbf X \hat{\boldsymbol\beta} ) &= 0 \\ \mathbf X^\top\mathbf y - \mathbf X^\top\mathbf X \hat{\boldsymbol\beta} &= 0 \\ \mathbf X^\top\mathbf X \hat{\boldsymbol\beta} &= \mathbf X^\top\mathbf y\\ \hat{\boldsymbol\beta} &= (\mathbf X^\top\mathbf X )^{-1} \mathbf X^\top\mathbf y \end{aligned}\]
provided \(\mathbf X^\top\mathbf X\) is not singular, i.e. it is invertible.
With the knowledge of the LSE for the coefficients, and assuming that the errors are zero-meaned variables, an estimate of the variance of the errors is given by \[ \hat\sigma^2 = \frac{1}{n} \sum_{i=1}^n (y_i-\hat\beta_0-\hat\beta_1x_{i1} - \cdots -\hat\beta_px_{ip})^2 \] but usually we will use another estimator for \(\sigma^2\), as we will see later.
Let \(\hat y_i = \hat\beta_0+\hat\beta_1x_{i1} + \cdots +\hat\beta_px_{ip}\), the predicted value for the \(i\)’th observation using estimates \(\hat{\boldsymbol\beta}\). Residuals are defined to be the difference between the observed value and the predicted value,
\[\hat\epsilon_i = y_i - \hat y_i\]
If the model is correct, then the residuals should behave like random noise!
The LSE for \(\boldsymbol\beta\) turns out to be the maximum likelihood estimator for \(\boldsymbol\beta\) too. This is easy to see:
The likelihood under the normal linear model is
\[L(\boldsymbol\beta,\sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} \exp \left\{ -\frac{1}{2\sigma^2} \Vert \mathbf y - \mathbf X \boldsymbol\beta \Vert^2 \right\}\]
and since under the LSE, \(\Vert \mathbf y - \mathbf X {\boldsymbol\beta} \Vert^2 \geq \Vert \mathbf y - \mathbf X \hat{\boldsymbol\beta} \Vert^2\), which means that the likelihood is maximised at the LSE as well.
The estimator for \(\hat{\boldsymbol\beta}\) is normally distributed with mean and variance given by \[ \hat{\boldsymbol\beta} \sim \text{N}_p(\boldsymbol\beta, \sigma^2 (\mathbf X^\top \mathbf X)^{-1}) \] which implies that the estimators for the coefficients are unbiased.
The unbiased estimator for \(\sigma^2\) is given by the formula \[\hat\sigma^2 = \frac{ \Vert \mathbf y - \mathbf X \hat{ \boldsymbol\beta} \Vert^2}{n-p-1}\] which is distributed according to a \(\chi^2\) distribution with \(n-p-1\) degrees of freedom.
\(\hat{\boldsymbol\beta}\) and \(\hat\sigma^2\) are independent of each other.
The residuals have mean zero \[ \begin{aligned} \text{E}(\hat{\boldsymbol\epsilon}) &= \text{E}(\mathbf y - \mathbf X \hat{\boldsymbol\beta}) \\ &= \text{E}(\mathbf X \boldsymbol\beta - \mathbf X \hat{\boldsymbol\beta}) \\ &= \mathbf X \boldsymbol\beta -\mathbf X \boldsymbol\beta \\ &= \mathbf 0 \end{aligned} \]
The coefficients \(\beta_j\) represents the ‘strength or influence’ of the variable \(x_j\) on \(y\). It is the effect on \(y\) of changing \(x_j\) by a single unit, holding the other covariates fixed.
Consider the effect of the coefficient \(\beta_1\). Let \(y^{(0)} =\beta_0+\beta_1x_{1} + \cdots +\beta_px_{p}\), and also let \(y^{(1)} =\beta_0+\beta_1(x_{1}+1) + \cdots +\beta_px_{p}\). Since everything else cancels out, the difference between \(y^{(1)}\) and \(y^{(0)}\) is simply \[ y^{(1)} - y^{(0)} = \beta_1(x_{1}+1) - \beta_1x_{1} = \beta_1 \]
As we can see, \(\beta_1\) represents the change in the response variable when the variable \(x_1\) is increased by one unit, and keeping all other variables fixed. A positive value for the coefficient imparts a change in the positive direction in the response, and vice versa. Of course, the logic is the same for all of the other variables \(j=2,3,\dots,p\).
Recall that the standard deviation for each \(\hat\beta_j\) (from the normal distribution in property 2) is given by \[ \text{SD}(\hat\beta_j) = \sqrt{\sigma^2 v_{jj}} \] where \(v_jj\) is the \((j+1,j+1)\)th element of the matrix \((\mathbf X^\top\mathbf X)^{-1}\). Notice that we do not know the true value of \(\sigma\) and therefore, the standard deviation of the coefficients as well.
The standard error for each \(\hat\beta_j\) is given by \[ \text{SE}(\hat\beta_j) = \sqrt{\hat\sigma^2 v_{jj}} \] where \(\hat\sigma^2\) is the unbiased estimator for \(\sigma^2\) given in property 3 above. Essentially, by replacing the unknown value \(\sigma^2\) by its estimator, we get a “estimate” of the SD which we call the standard error.
Introduce the decomposition
\[ \overbrace{\sum_{i=1}^n (y_i - \bar y)^2}^{\text{Total SS}} = \overbrace{\sum_{i=1}^n (\hat y_i - \bar y)^2}^{\text{Regression SS}} + \overbrace{\sum_{i=1}^n (y_i - \hat y_i)^2}^{\text{Residual SS}} \]
The term on the LHS is the Total Sum of Squares, which represents the total variation in the data (responses). The first term on the RHS is called the Regression Sum of Squares, which represents the variation in the regression model. The second term on the RHS we have seen before, called the Residual Sum of Squares, which measures variability between predicted and observed values.
Define the coefficient of determination, as \[ R^2 = \frac{\text{Regression SS}}{\text{Total SS}} = 1 - \frac{\text{Residual SS}}{\text{Total SS}} \] \(R^2\) takes values between 0 and 1. \(100R^2\) is the percentage of the total variation in \(\{y_i \}\) explained by all the regressors. Therefore, the closer \(R^2\) is to 1, the better the model agreement is.
Sometimes, the adjusted \(R^2\) value is used instead, because it has nice distributional properties (for hypothesis testing) \[ R^2_{\text{adj}} = 1 - \frac{\text{Residual SS} / (n-p-1)}{\text{Total SS} / (n-1)} \]
Roughly speaking, both should give about similar values.
For each coefficient \(\beta_j\), where \(j=0,1,\dots,p\), \[ \frac{\hat\beta_j - \beta_j}{\text{SE}(\hat{\beta_j})} \sim t_{n-p-1} \]
We can then use this fact to test the hypothesis \[ \begin{aligned} \text{H}_0&: \beta_j = b_j \\ \text{H}_1&: \beta_j \neq b_j \end{aligned} \]
Let \[ T = \frac{\hat\beta_j - b_j}{\text{SE}(\hat{\beta_j})} \] We reject the null hypothesis at the level \(\alpha\) against the alternative hypothesis if \(\vert T \vert > t_{n-p-1}(\alpha/2)\), where \(t_k(\alpha)\) is the top \(\alpha\)-point of the \(t_k\) distribution.
Note we can also test \(\text{H}_0\) against the alternatives
The \((1-\alpha)\) confidence interval for \(\beta_j\) is \[ \hat\beta_j \pm t_{n-p-1}(\alpha/2)\cdot \text{SE}(\hat\beta_j) \]
Remarks:
Consider the hypothesis \[ \begin{aligned} \text{H}_0&: \beta_1 = \beta_2 = \cdots = \beta_p = 0 \\ \text{H}_1&: \text{Not all } \beta_1,\dots,\beta_p \text{ are } 0 \end{aligned} \]
Let \[ \begin{aligned} T = \frac{\sum_{i=1}^n (\hat y_i - \bar y)^2 / p}{\sum_{i=1}^n (y_i - \hat y_i)^2 / (n-p-1)} &= \frac{n-p-1}{p}\cdot \frac{\text{Regression SS}}{\text{Residual SS}} \\ &= \frac{n-p-1}{p}\cdot\frac{R^2}{1-R^2} \end{aligned} \] We reject the null hypothesis at the \(\alpha\) significance level if \(T > F_{p,n-p-1}(\alpha)\).
This has links to the ANOVA table which you may be familiar with:
| Source | d.f. | Sum of squares | Mean SS | F-statistic | 
|---|---|---|---|---|
| Regressors | \(p\) | \(\sum_{i=1}^n (\hat y_i - \bar y)^2\) | \(\frac{\sum_{i=1}^n (\hat y_i - \bar y)^2}{p}\) | \(\frac{p \sum_{i=1}^n (\hat y_i - \bar y)^2 }{(n-p-1)\sum_{i=1}^n (y_i - \hat y_i)^2}\) | 
| Residual | \(n-p-1\) | \(\sum_{i=1}^n (y_i - \hat y_i)^2\) | \(\frac{\sum_{i=1}^n (y_i - \hat y_i)^2}{n-p-1}\) | |
| Total | \(n-1\) | \(\sum_{i=1}^n (y_i - \bar y)^2\) | 
Let \(\mathbf H = \mathbf X(\mathbf X^\top\mathbf X)^{-1}\mathbf X^\top\), an \(n\times n\) matrix. This matrix is called the projection matrix or the hat matrix for linear regression. It has several interesting and useful properties, which we will not go into detail now.
For \(i=1,\dots,n\), define the standardized residuals as \[ \hat e_i = \frac{y_i - \hat y_i}{\sqrt{\hat\sigma^2(1-h_{ii})}} \] where \(h_{ii}\) is the \((i,i)\)th entry of the matrix \(\mathbf H\). The standardized residuals then have mean zero and variance one.
We use the standardized residuals as a means of diagnosing the fit of the linear model, and to test assumptions of our linear model. For instance, we can plot the standardized residuals - in a qq-plot (to test for normality) - against covariates (to test for homo/heteroscedasticity) - against predicted values (to test for homo/heteroscedasticity)
Let’s fit a simple linear regression model to the mtcars data set. In R, the command to fit linear regression model is lm().
mod <- lm(formula = mpg ~ wt, data = mtcars)
summary(mod)## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10Let’s go through the results one by one.
## Call:
## lm(formula = mpg ~ wt, data = mtcars)This tells us that the following regression had been fitted: \[ \texttt{mpg} = \beta_0 + \beta_1 \cdot \texttt{wt} + \epsilon \]
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 This gives us a 5-point summary of the distribution of the residuals \(\hat\epsilon_i\). The intention is to tell us how well the model fits the data, but it will also be easier to diagnose model fit using residual plots.
diag.plots <- lindia::gg_diagnose(mod, plot.all = FALSE)
diag.plots <- lapply(diag.plots, function(x) x + theme_bw())
diag.plots[[2]] <- diag.plots[[2]] + geom_smooth(se = FALSE)
diag.plots[[3]] <- diag.plots[[3]] + geom_smooth(se = FALSE)
lindia::plot_all(diag.plots[1:4])In this case, we see some kind of curved relationship between residuals and fitted values / wt.
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1This piece of information tells us that the model has been estimated to be
\[ \texttt{mpg} = 37.3 - 5.3 \, \texttt{wt} \]
The table also shows individual hypothesis tests of significance (i.e. testing that the coefficient is non-zero) conducted for each of the coefficients. We find that both \(\beta_0\) and \(\beta_1\) are statistically significant, with both tests strongly rejecting the null hypothesis that \(\beta_0 = 0\) (\(p\)-value = \(<2e16\)) and \(\beta_1=0\) (\(p\)-value = \(1.29e-10\)).
What is the interpretation of the coefficient for wt? Firstly, notice that it is negative, which means that there is an inverse relationship between the explanatory variable wt and the response variable mpg. This makes sense, because the heavier the car is, the less fuel efficient it is. For every unit increase in wt (given in 1000 lbs), we see on average a decrease of -5.3 miles per gallon.
## Residual standard error: 3.046 on 30 degrees of freedomThis gives the estimate for \(\hat\sigma\), otherwise known as residual standard error. To understand why this is called what it is, take a look at the formula. As we mentioned above, this follows a \(\chi^2\) distribution with \(n-p-1=30\) degrees of freedom.
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 As the name states, this gives the estimate for \(R^2\) and \(R_{\text{adj}}^2\) respectively.
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10This is the piece of information which relates to testing all coefficients are non-zero simultaneously. The \(F\)-statistic, \(T=91.38\) is calculated using the formula given above, and it is compared against the \(F_{1,30}\) distribution. We see that it gives a \(p\)-value of \(1.294e-10\) which implies strongly that the null hypothesis is rejected, concluding that at least one of the coefficients is non-zero.
Using the information from the linear regression, one can produce the following table:
knitr::kable(fortify(mod)[, -c(3:5)])| mpg | wt | .fitted | .resid | .stdresid | |
|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 2.620 | 23.282611 | -2.2826106 | -0.7661677 | 
| Mazda RX4 Wag | 21.0 | 2.875 | 21.919770 | -0.9197704 | -0.3074305 | 
| Datsun 710 | 22.8 | 2.320 | 24.885952 | -2.0859521 | -0.7057525 | 
| Hornet 4 Drive | 21.4 | 3.215 | 20.102650 | 1.2973499 | 0.4327511 | 
| Hornet Sportabout | 18.7 | 3.440 | 18.900144 | -0.2001440 | -0.0668188 | 
| Valiant | 18.1 | 3.460 | 18.793254 | -0.6932545 | -0.2314831 | 
| Duster 360 | 14.3 | 3.570 | 18.205363 | -3.9053627 | -1.3055222 | 
| Merc 240D | 24.4 | 3.190 | 20.236262 | 4.1637381 | 1.3888971 | 
| Merc 230 | 22.8 | 3.150 | 20.450041 | 2.3499593 | 0.7839269 | 
| Merc 280 | 19.2 | 3.440 | 18.900144 | 0.2998560 | 0.1001080 | 
| Merc 280C | 17.8 | 3.440 | 18.900144 | -1.1001440 | -0.3672871 | 
| Merc 450SE | 16.4 | 4.070 | 15.533127 | 0.8668731 | 0.2928865 | 
| Merc 450SL | 17.3 | 3.730 | 17.350247 | -0.0502472 | -0.0168379 | 
| Merc 450SLC | 15.2 | 3.780 | 17.083024 | -1.8830236 | -0.6315997 | 
| Cadillac Fleetwood | 10.4 | 5.250 | 9.226650 | 1.1733496 | 0.4229607 | 
| Lincoln Continental | 10.4 | 5.424 | 8.296712 | 2.1032876 | 0.7697987 | 
| Chrysler Imperial | 14.7 | 5.345 | 8.718926 | 5.9810744 | 2.1735331 | 
| Fiat 128 | 32.4 | 2.200 | 25.527289 | 6.8727113 | 2.3349021 | 
| Honda Civic | 30.4 | 1.615 | 28.653805 | 1.7461954 | 0.6103569 | 
| Toyota Corolla | 33.9 | 1.835 | 27.478021 | 6.4219792 | 2.2170827 | 
| Toyota Corona | 21.5 | 2.465 | 24.111004 | -2.6110037 | -0.8796401 | 
| Dodge Challenger | 15.5 | 3.520 | 18.472586 | -2.9725862 | -0.9931363 | 
| AMC Javelin | 15.2 | 3.435 | 18.926866 | -3.7268663 | -1.2441801 | 
| Camaro Z28 | 13.3 | 3.840 | 16.762355 | -3.4623553 | -1.1627910 | 
| Pontiac Firebird | 19.2 | 3.845 | 16.735633 | 2.4643670 | 0.8277197 | 
| Fiat X1-9 | 27.3 | 1.935 | 26.943574 | 0.3564263 | 0.1224441 | 
| Porsche 914-2 | 26.0 | 2.140 | 25.847957 | 0.1520430 | 0.0517719 | 
| Lotus Europa | 30.4 | 1.513 | 29.198941 | 1.2010593 | 0.4225427 | 
| Ford Pantera L | 15.8 | 3.170 | 20.343151 | -4.5431513 | -1.5154971 | 
| Ferrari Dino | 19.7 | 2.770 | 22.480940 | -2.7809399 | -0.9308693 | 
| Maserati Bora | 15.0 | 3.570 | 18.205363 | -3.2053627 | -1.0715194 | 
| Volvo 142E | 21.4 | 2.780 | 22.427495 | -1.0274952 | -0.3438822 | 
We can then plot the line of best fit as follows
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  geom_segment(data = fortify(mod), 
               aes(x = wt, xend = wt, y = mpg, yend = .fitted), 
               linetype = "dashed") +
  theme_bw()## `geom_smooth()` using formula 'y ~ x'Consider the following data set, which has observations on four variables
LVOL logarithms of weekly sales volumePROMP promotion priceFEAT feature advertisingDISP displayfoods <- as_tibble(read.table("foods.txt", header = TRUE))
foods## # A tibble: 156 × 4
##     LVOL PROMP  FEAT  DISP
##    <dbl> <dbl> <dbl> <dbl>
##  1  14.5  3.52  39.9  21.4
##  2  14.2  3.7   25.8  34.6
##  3  14.3  3.42  23.3  27.4
##  4  14.3  3.55  25.5  25.7
##  5  14.2  3.64  39.2  30.2
##  6  14.0  3.78  13.1  41.7
##  7  14.0  3.86  17.1  37.6
##  8  14.0  3.60  24.5  35.1
##  9  14.1  3.72  15.5  34.2
## 10  14.2  3.52  24.8  29.2
## # … with 146 more rowsHere’s an exploratory plot of the all the variables
GGally::ggpairs(foods) + theme_bw()Here are the results of the regression model fitted on the data set
mod0 <- lm(LVOL ~ 1, foods)
mod1 <- lm(LVOL ~ PROMP, foods)
mod2 <- lm(LVOL ~ PROMP + FEAT, foods)
mod3 <- lm(LVOL ~ PROMP + FEAT + DISP, foods)
mtable("Model 1" = mod1, "Model 2" = mod2, "Model 3" = mod3,
       summary.stats = c("sigma", "R-squared", "F", "p", "N",
                         "Log-likelihood", "Deviance" , "AIC", "BIC"))## 
## Calls:
## Model 1: lm(formula = LVOL ~ PROMP, data = foods)
## Model 2: lm(formula = LVOL ~ PROMP + FEAT, data = foods)
## Model 3: lm(formula = LVOL ~ PROMP + FEAT + DISP, data = foods)
## 
## =========================================================
##                     Model 1      Model 2      Model 3    
## ---------------------------------------------------------
##   (Intercept)       18.409***    17.150***    17.237***  
##                     (0.302)      (0.249)      (0.249)    
##   PROMP             -1.197***    -0.904***    -0.956***  
##                     (0.087)      (0.069)      (0.073)    
##   FEAT                            0.010***     0.010***  
##                                  (0.001)      (0.001)    
##   DISP                                         0.004*    
##                                               (0.002)    
## ---------------------------------------------------------
##   sigma              0.172        0.127        0.125     
##   R-squared          0.549        0.756        0.763     
##   F                187.104      236.997      163.425     
##   p                  0.000        0.000        0.000     
##   N                156          156          156         
##   Log-likelihood    54.371      102.362      104.752     
##   Deviance           4.549        2.459        2.385     
##   AIC             -102.743     -196.724     -199.504     
##   BIC              -93.593     -184.525     -184.254     
## =========================================================
##   Significance: *** = p < 0.001; ** = p < 0.01;   
##                 * = p < 0.05As a side note,
Consider the ANOVA table for Model 3
knitr::kable(anova(mod3),  digits = 3)| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| PROMP | 1 | 5.527 | 5.527 | 352.305 | 0.000 | 
| FEAT | 1 | 2.090 | 2.090 | 133.243 | 0.000 | 
| DISP | 1 | 0.074 | 0.074 | 4.729 | 0.031 | 
| Residuals | 152 | 2.385 | 0.016 | 
And here are the diagnostic plots
diag.plots <- lindia::gg_diagnose(mod3, plot.all = FALSE)
diag.plots <- lapply(diag.plots, function(x) x + theme_bw())
diag.plots[[2]] <- diag.plots[[2]] + geom_smooth(se = FALSE)
diag.plots[[3]] <- diag.plots[[3]] + geom_smooth(se = FALSE)
diag.plots[[4]] <- diag.plots[[4]] + geom_smooth(se = FALSE)
diag.plots[[5]] <- diag.plots[[5]] + geom_smooth(se = FALSE)
lindia::plot_all(diag.plots[1:6])QUESTIONS
DISP to the regression model? In other words, what is the percentage of DISP‘s regression sum of squares in consideration of all regressors’ sum of squares?Read the article Kobina & Abledu, “Multiple Regression Analysis of the Impact of Senior Secondary School Certificate Examination (SSCE) Scores on the final Cumulative Grade Point Average (CGPA) of Students of Tertiary Institutions in Ghana.” (link here: https://pdfs.semanticscholar.org/68ba/ad469ae2f59f194f6e97f1c3d262fc2c6375.pdf).
In your own words, state the authors’ research question. Are they trying to answer a causal question?
What is their independent variable? What are their dependent variables?
What is the main finding from the paper? In other words, how do the authors answer their research question? What is the main evidence they use to make this claim? Put this in your own words.
When we have data that is non-continous (e.g. binary data, categorical data, count data, etc.), certain assumptions of the linear model are violated (which ones?). We have to fit a generalised linear model in such cases. Examples include logistic regression, Poisson regression, etc. Find some introductory material on these kinds of regression models.
What happens when we have explanatory variables which are categorical in nature? E.g. Sex, Likert scale responses, political party, etc. Read up on the use of “dummy variables” in regression analysis.
When we have a lot of predictor variables, the model may not adequately fit the data. How do we perform model/variable selection? One way is to do forward/backward selection or otherwise known as stepwise addition/delection scheme. Read up on this method.
When we have too many variables, specifically when \(p>n\), the linear regression model becomes mathematically impossible to fit. Why is this?