1. Describe the null hypotheses to which the \(p\)-values given in Table \(3.4\) correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales
, TV
, radio
, and newspaper
, rather than in terms of the coefficients of the linear model.
Answer. Null hypothesis \(H_0:\ \beta_i = 0\).
sales
is not zero when TV
, radio
and newspaper
are zero.TV
: \(p<0.0001\) means that the null hypothesis is rejected, i.e., the change in TV
will affect sales
value.radio
: \(p<0.0001\) means that the null hypothesis is rejected, i.e., the change in radio
will affect sales
value.newspaper
: \(p=0.8599\) means that the null hypothesis is accpeted, i.e., there is no relationship between sales
and newspaper
.5. Consider the fitted values that result from performing linear regression without an intercept. In this setting, the \(i\) th fitted value takes the form \[\begin{equation}\label{eq:1} \hat{y}_{i}=x_{i} \hat{\beta}, \end{equation}\] where \[\begin{equation}\label{eq:2} \hat{\beta}=\left(\sum_{i=1}^{n} x_{i} y_{i}\right) /\left(\sum_{i^{\prime}=1}^{n} x_{i^{\prime}}^{2}\right). \end{equation}\] Show that we can write \[\begin{equation}\label{eq:3} \hat{y}_{i}=\sum_{i^{\prime}=1}^{n} a_{i^{\prime}} y_{i^{\prime}}. \end{equation}\] What is \(a_{i^{\prime}}\)?
proof. Substitute () into () and get \[\begin{equation} \hat{y}_{i}=x_i\frac{\sum_{i^{\prime}=1}^{n} x_{i^{\prime}} y_{i^{\prime}}}{\sum_{j=1}^{n} x_{j}^{2}} =\sum_{i^{\prime}=1}^{n}\frac{x_ix_{i^{\prime}}}{\sum_{j=1}^n{x_j}^2} y_{i^{\prime}}. \end{equation}\] So it is the form of equation (), and \(a_{i^{\prime}} = \displaystyle\frac{x_ix_{i^{\prime}}}{\sum_{j=1}^n{x_j}^2}\).
8. This question involves the use of simple linear regression on the Auto
data set.
Use the lm()
function to perform a simple linear regression with mpg
as the response and horsepower
as the predictor. Use the summary()
function to print the results.
library(ISLR)
fit.lm <- lm(mpg ~ horsepower, data = Auto)
summary(fit.lm)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
Comment on the output. For example:
Is there a relationship between the predictor and the response?
Yes. Because the \(p\)-value <2e-16, which means the relationship between the predictor and the response is statistically significant.
How strong is the relationship between the predictor and the response?
summary(fit.lm)$r.squared
## [1] 0.6059483
The horsepower
explains 60.59% of the variance in mpg
.
summary(fit.lm)$sigma
## [1] 4.905757
summary(fit.lm)$sigma/mean(Auto$mpg)
## [1] 0.2092371
The residual standard error (RSE) is 4.906. The percentage of error is 20.92%.
coefficients(fit.lm)["horsepower"]
## horsepower
## -0.1578447
It is negative.
mpg
associated with a horsepower
of 98? What are the associated \(95 \%\) confidence and prediction intervals?predict(fit.lm, data.frame(horsepower = 98),
interval = "confidence", level = 0.95)
## fit lwr upr
## 1 24.46708 23.97308 24.96108
The confidence interval is \(24.467 \pm 0.494\).
predict(fit.lm, data.frame(horsepower = 98),
interval = "prediction", level = 0.95)
## fit lwr upr
## 1 24.46708 14.8094 34.12476
The prediction interval is \(24.467 \pm 9.658\).
The prediction interval is wider. Because the true \(y\) at \(x_0\) is \(y=x_{0}^{T} \beta+\epsilon\). Since \(E(\epsilon)=0\), the predicted value will be \(\hat{y}=x_{0}^{T} \hat{\beta}\). The prediction interval includes the variance of \(\hat{\beta}\) and \(\epsilon\). But the confidence interval only considers the variance of \(\epsilon\) and does not take the error term into account.
Plot the response and the predictor. Use the abline()
function to display the least squares regression line.
# ggplot(Auto, aes(x = horsepower, y = mpg)) + geom_point() +
# geom_abline(intercept = coef(fit.lm)[1], slope = coef(fit.lm)[2],
# col = "blue", size = 1) +
# theme_set(theme_bw())
attach(Auto)
plot(horsepower, mpg); abline(fit.lm, col = "blue", size = 1)
Use the plot()
function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.
par(mfrow=c(2,2))
plot(fit.lm)
From residual plot, there are distinctive patterns from observing the deviation from the residual = 0 line. The non-linear relationship is not explained by the model and is left out in the residuals.
9. This question involves the use of multiple linear regression on the Auto
data set.
Produce a scatterplot matrix which includes all of the variables in the data set.
library(ISLR)
pairs(Auto, main="Scatterplot-Auto", pch = 21, bg=c(1:nrow(Auto)))
Compute the matrix of correlations between the variables using the function cor()
. You will need to exclude the name variable, which is qualitative.
cor(Auto[,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
Use the lm()
function to perform a multiple linear regression with mpg
as the response and all other variables except name as the predictors. Use the summary()
function to print the results. Comment on the output. For instance:
Auto$origin <- as.factor(Auto$origin)
fit.lm <- lm(mpg ~ ., data = Auto[,-9])
summary(fit.lm)
##
## Call:
## lm(formula = mpg ~ ., data = Auto[, -9])
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0095 -2.0785 -0.0982 1.9856 13.3608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.795e+01 4.677e+00 -3.839 0.000145 ***
## cylinders -4.897e-01 3.212e-01 -1.524 0.128215
## displacement 2.398e-02 7.653e-03 3.133 0.001863 **
## horsepower -1.818e-02 1.371e-02 -1.326 0.185488
## weight -6.710e-03 6.551e-04 -10.243 < 2e-16 ***
## acceleration 7.910e-02 9.822e-02 0.805 0.421101
## year 7.770e-01 5.178e-02 15.005 < 2e-16 ***
## origin2 2.630e+00 5.664e-01 4.643 4.72e-06 ***
## origin3 2.853e+00 5.527e-01 5.162 3.93e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared: 0.8242, Adjusted R-squared: 0.8205
## F-statistic: 224.5 on 8 and 383 DF, p-value: < 2.2e-16
Yes. The null hypothesis is that all the coefficients are 0. We use F-statistic to test the hypothesis. Since p-value: < 2.2e-16, we reject null hypothesis and there is a relation between the predictors and response.
displacement
, weight
, year
and origin
.
year
variable suggest?coef(fit.lm)["year"]
## year
## 0.7770269
The increase of a unit in year
is associated with the increase of 0.7770 unit in mpg
.
Use the plot()
function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow=c(2,2))
plot(fit.lm)
From residual plot, there are some distinctive patterns from observing the deviation from the residual = 0 line. There may be non-linear relationship. Points in the upper-right corner may be outliers.
From leverage plot, point 14 has unsually high leverage compared to other points. But the case is within Cook’s line. (When cases are outside of the Cook’s distance, they are influential to the regression results. The regression results will be altered if we exclude those cases.) So the leverage plot seems admissable.
Use the *
and :
symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
fit.lm <- lm(mpg ~ .*., data = Auto[,-9])
# summary(fit.lm)
Set p-values threshold as 0.05. The interaction terms are as follows.
cylinders:acceleration *
acceleration:year *
acceleration:origin2 ***
acceleration:origin3 *
year:origin2 *
year:origin3 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In fact, they are cylinder:acceleration
, acceleration:year
, acceleration:origin
and year:origin
.
Try a few different transformations of the variables, such as \(\log (X), \sqrt{X}, X^{2}\). Comment on your findings.
# Auto[,c(-8,-9)] no transformation on origin
R2 <- rep(0,6)
R2.log <- R2
R2.sqrt <- R2
R2.quadr <- R2
for (i in 2:7){
# X
R2[i-1] <- summary(lm(Auto$mpg ~ Auto[,i]))$r.squared
if (min(Auto[,i]<0)){
X <- Auto[,i]+min(Auto[,i])
}else{
X <- Auto[,i]
}
# log(X)
R2.log[i-1] <- summary(lm(Auto$mpg~log(X+1)))$r.squared
# sqrt(X)
R2.sqrt[i-1] <- summary(lm(Auto$mpg~sqrt(X)))$r.squared
# X+X^2
X <- Auto[,i]
R2.quadr[i-1] <- summary(lm(Auto$mpg~X+I(X^2)))$r.squared
}
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
method <- c("none", "log", "sqrt", "quadratic")
p <- data.frame(predictor = colnames(Auto)[2:7],
R2 = round(R2, 3),
R2.log = round(R2.log, 3),
R2.sqrt = round(R2.sqrt, 3),
R2.quadr = round(R2.quadr, 3))
p <- p %>%
mutate(R2.max = pmax(R2, R2.log, R2.sqrt, R2.quadr),
method = method[max.col(p[,2:5])]) %>%
arrange(desc(R2.max))
p
## predictor R2 R2.log R2.sqrt R2.quadr R2.max method
## 1 weight 0.693 0.713 0.706 0.715 0.715 quadratic
## 2 displacement 0.648 0.686 0.675 0.689 0.689 quadratic
## 3 horsepower 0.606 0.668 0.644 0.688 0.688 quadratic
## 4 cylinders 0.605 0.605 0.606 0.607 0.607 quadratic
## 5 year 0.337 0.332 0.335 0.368 0.368 quadratic
## 6 acceleration 0.179 0.190 0.185 0.194 0.194 quadratic
\(X^2\) transformation seems more suitable. We then visualize the improvement of \(R^2\). Here I refer to https://rpubs.com/lmorgan95/ISLR_CH3_Solutions.
library(tidyr)
colnames(p)[2:5] <- method
mpg_predictors <- p %>%
select(-c(R2.max, method)) %>%
gather(key = "method", value = "R2", -predictor)
mpg_predictors$predictor <- factor(mpg_predictors$predictor,
ordered = T,
levels = p$predictor)
mpg_predictors$method <- factor(mpg_predictors$method,
levels = method)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'Auto':
##
## mpg
ggplot(mpg_predictors, aes(x = R2, y = predictor, col = method, group = predictor)) +
geom_line(col = "grey15") + geom_point(size = 2) +
theme_light() + theme(legend.position = "bottom") +
labs(title = "Best predictors (& transformations) for mpg",
col = "Predictor Transformation",
y = "Predictor")
Concluded from the graph, we should test quadratic terms for weight
, displacement
, horsepower
and year
.
fit.lm.2 <- lm(mpg ~ . + I(weight^2) + I(displacement^2) +
I(horsepower^2) + I(year^2), data = Auto[,-9])
summary(fit.lm.2)
##
## Call:
## lm(formula = mpg ~ . + I(weight^2) + I(displacement^2) + I(horsepower^2) +
## I(year^2), data = Auto[, -9])
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4816 -1.5384 0.0735 1.3671 12.0213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.185e+02 6.966e+01 6.008 4.40e-09 ***
## cylinders 5.073e-01 3.191e-01 1.590 0.112692
## displacement -3.328e-02 2.045e-02 -1.627 0.104480
## horsepower -1.781e-01 3.953e-02 -4.506 8.81e-06 ***
## weight -1.114e-02 2.587e-03 -4.306 2.12e-05 ***
## acceleration -1.700e-01 9.652e-02 -1.762 0.078960 .
## year -1.019e+01 1.837e+00 -5.546 5.49e-08 ***
## origin2 1.323e+00 5.304e-01 2.494 0.013068 *
## origin3 1.258e+00 5.129e-01 2.452 0.014637 *
## I(weight^2) 1.182e-06 3.438e-07 3.439 0.000649 ***
## I(displacement^2) 5.839e-05 3.435e-05 1.700 0.089967 .
## I(horsepower^2) 4.388e-04 1.336e-04 3.284 0.001118 **
## I(year^2) 7.210e-02 1.207e-02 5.974 5.35e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.776 on 379 degrees of freedom
## Multiple R-squared: 0.8773, Adjusted R-squared: 0.8735
## F-statistic: 225.9 on 12 and 379 DF, p-value: < 2.2e-16
The adjusted \(R^2\) is improved from 0.8205 to 0.8735 after transformation.