Homework 1

1. Describe the null hypotheses to which the $p$-values given in Table $3.4$ correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.

Answer. Null hypothesis $H_0:\ \beta_i = 0$.

Intercept: $p<0.0001$ means that the null hypothesis is rejected, i.e., sales is not zero when TV, radio and newspaper are zero.
TV: $p<0.0001$ means that the null hypothesis is rejected, i.e., the change in TV will affect sales value.
radio: $p<0.0001$ means that the null hypothesis is rejected, i.e., the change in radio will affect sales value.
newspaper: $p=0.8599$ means that the null hypothesis is accpeted, i.e., there is no relationship between sales and newspaper.

5. Consider the fitted values that result from performing linear regression without an intercept. In this setting, the $i$ th fitted value takes the form \[\begin{equation}\label{eq:1} \hat{y}_{i}=x_{i} \hat{\beta}, \end{equation}\] where \[\begin{equation}\label{eq:2} \hat{\beta}=\left(\sum_{i=1}^{n} x_{i} y_{i}\right) /\left(\sum_{i^{\prime}=1}^{n} x_{i^{\prime}}^{2}\right). \end{equation}\] Show that we can write \[\begin{equation}\label{eq:3} \hat{y}_{i}=\sum_{i^{\prime}=1}^{n} a_{i^{\prime}} y_{i^{\prime}}. \end{equation}\] What is $a_{i^{\prime}}$?

proof. Substitute () into () and get \[\begin{equation} \hat{y}_{i}=x_i\frac{\sum_{i^{\prime}=1}^{n} x_{i^{\prime}} y_{i^{\prime}}}{\sum_{j=1}^{n} x_{j}^{2}} =\sum_{i^{\prime}=1}^{n}\frac{x_ix_{i^{\prime}}}{\sum_{j=1}^n{x_j}^2} y_{i^{\prime}}. \end{equation}\] So it is the form of equation (), and $a_{i^{\prime}} = \displaystyle\frac{x_ix_{i^{\prime}}}{\sum_{j=1}^n{x_j}^2}$.

8. This question involves the use of simple linear regression on the Auto data set.

Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results.
```
library(ISLR)
fit.lm <- lm(mpg ~ horsepower, data = Auto)
summary(fit.lm)
```
```
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
```
Comment on the output. For example:
1. Is there a relationship between the predictor and the response?
  
  Yes. Because the $p$-value <2e-16, which means the relationship between the predictor and the response is statistically significant.
2. How strong is the relationship between the predictor and the response?
```
summary(fit.lm)$r.squared    
```
```
## [1] 0.6059483
```
The horsepower explains 60.59% of the variance in mpg.
```
summary(fit.lm)$sigma
```
```
## [1] 4.905757
```
```
summary(fit.lm)$sigma/mean(Auto$mpg)
```
```
## [1] 0.2092371
```
The residual standard error (RSE) is 4.906. The percentage of error is 20.92%.
1. Is the relationship between the predictor and the response positive or negative?
```
coefficients(fit.lm)["horsepower"]
```
```
## horsepower 
## -0.1578447
```
It is negative.
1. What is the predicted mpg associated with a horsepower of 98? What are the associated $95 \%$ confidence and prediction intervals?
```
predict(fit.lm, data.frame(horsepower = 98), 
            interval = "confidence", level = 0.95)
```
```
##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108
```
The confidence interval is $24.467 \pm 0.494$.
```
predict(fit.lm, data.frame(horsepower = 98), 
        interval = "prediction", level = 0.95)
```
```
##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476
```
The prediction interval is $24.467 \pm 9.658$.

The prediction interval is wider. Because the true $y$ at $x_0$ is $y=x_{0}^{T} \beta+\epsilon$. Since $E(\epsilon)=0$, the predicted value will be $\hat{y}=x_{0}^{T} \hat{\beta}$. The prediction interval includes the variance of $\hat{\beta}$ and $\epsilon$. But the confidence interval only considers the variance of $\epsilon$ and does not take the error term into account.

Plot the response and the predictor. Use the abline() function to display the least squares regression line.

# ggplot(Auto, aes(x = horsepower, y = mpg)) + geom_point() +
#     geom_abline(intercept = coef(fit.lm)[1], slope = coef(fit.lm)[2],
#                 col = "blue", size = 1) + 
#                 theme_set(theme_bw())
attach(Auto)
plot(horsepower, mpg); abline(fit.lm, col = "blue", size = 1)

Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.
```
par(mfrow=c(2,2))
plot(fit.lm)
```
From residual plot, there are distinctive patterns from observing the deviation from the residual = 0 line. The non-linear relationship is not explained by the model and is left out in the residuals.

9. This question involves the use of multiple linear regression on the Auto data set.

Produce a scatterplot matrix which includes all of the variables in the data set.
```
library(ISLR)
pairs(Auto, main="Scatterplot-Auto", pch = 21, bg=c(1:nrow(Auto)))
```

Compute the matrix of correlations between the variables using the function cor() . You will need to exclude the name variable, which is qualitative.

cor(Auto[,-9])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

Is there a relationship between the predictors and the response?

Auto$origin <- as.factor(Auto$origin)
fit.lm <- lm(mpg ~ ., data = Auto[,-9])
summary(fit.lm)

## 
## Call:
## lm(formula = mpg ~ ., data = Auto[, -9])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0095 -2.0785 -0.0982  1.9856 13.3608 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.795e+01  4.677e+00  -3.839 0.000145 ***
## cylinders    -4.897e-01  3.212e-01  -1.524 0.128215    
## displacement  2.398e-02  7.653e-03   3.133 0.001863 ** 
## horsepower   -1.818e-02  1.371e-02  -1.326 0.185488    
## weight       -6.710e-03  6.551e-04 -10.243  < 2e-16 ***
## acceleration  7.910e-02  9.822e-02   0.805 0.421101    
## year          7.770e-01  5.178e-02  15.005  < 2e-16 ***
## origin2       2.630e+00  5.664e-01   4.643 4.72e-06 ***
## origin3       2.853e+00  5.527e-01   5.162 3.93e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared:  0.8242, Adjusted R-squared:  0.8205 
## F-statistic: 224.5 on 8 and 383 DF,  p-value: < 2.2e-16

Yes. The null hypothesis is that all the coefficients are 0. We use F-statistic to test the hypothesis. Since p-value: < 2.2e-16, we reject null hypothesis and there is a relation between the predictors and response.

Which predictors appear to have a statistically significant relationship to the response?

displacement, weight, year and origin.

What does the coefficient for the year variable suggest?

coef(fit.lm)["year"]

##      year 
## 0.7770269

The increase of a unit in year is associated with the increase of 0.7770 unit in mpg.

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
```
par(mfrow=c(2,2))
plot(fit.lm)
```
From residual plot, there are some distinctive patterns from observing the deviation from the residual = 0 line. There may be non-linear relationship. Points in the upper-right corner may be outliers.

From leverage plot, point 14 has unsually high leverage compared to other points. But the case is within Cook’s line. (When cases are outside of the Cook’s distance, they are influential to the regression results. The regression results will be altered if we exclude those cases.) So the leverage plot seems admissable.

Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

fit.lm <- lm(mpg ~ .*., data = Auto[,-9])
# summary(fit.lm)

Set p-values threshold as 0.05. The interaction terms are as follows.

cylinders:acceleration    *  
acceleration:year         *  
acceleration:origin2      ***
acceleration:origin3      *  
year:origin2              *  
year:origin3              * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In fact, they are cylinder:acceleration, acceleration:year, acceleration:origin and year:origin.

Try a few different transformations of the variables, such as $\log (X), \sqrt{X}, X^{2}$. Comment on your findings.

# Auto[,c(-8,-9)] no transformation on origin
R2 <- rep(0,6)
R2.log <- R2
R2.sqrt <- R2
R2.quadr <- R2

for (i in 2:7){
  # X
  R2[i-1] <- summary(lm(Auto$mpg ~ Auto[,i]))$r.squared 

  if (min(Auto[,i]<0)){
    X <- Auto[,i]+min(Auto[,i])
  }else{
    X <- Auto[,i]
  }
  # log(X)
  R2.log[i-1] <- summary(lm(Auto$mpg~log(X+1)))$r.squared

  # sqrt(X)
  R2.sqrt[i-1] <- summary(lm(Auto$mpg~sqrt(X)))$r.squared

  # X+X^2
  X <- Auto[,i]
  R2.quadr[i-1] <- summary(lm(Auto$mpg~X+I(X^2)))$r.squared
}

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

method <- c("none", "log", "sqrt", "quadratic")
p <- data.frame(predictor = colnames(Auto)[2:7],
                R2 = round(R2, 3), 
                R2.log = round(R2.log, 3), 
                R2.sqrt = round(R2.sqrt, 3),
                R2.quadr = round(R2.quadr, 3)) 
p <- p %>%
  mutate(R2.max = pmax(R2, R2.log, R2.sqrt, R2.quadr), 
         method = method[max.col(p[,2:5])]) %>%
  arrange(desc(R2.max))
p

##      predictor    R2 R2.log R2.sqrt R2.quadr R2.max    method
## 1       weight 0.693  0.713   0.706    0.715  0.715 quadratic
## 2 displacement 0.648  0.686   0.675    0.689  0.689 quadratic
## 3   horsepower 0.606  0.668   0.644    0.688  0.688 quadratic
## 4    cylinders 0.605  0.605   0.606    0.607  0.607 quadratic
## 5         year 0.337  0.332   0.335    0.368  0.368 quadratic
## 6 acceleration 0.179  0.190   0.185    0.194  0.194 quadratic

$X^2$ transformation seems more suitable. We then visualize the improvement of $R^2$. Here I refer to https://rpubs.com/lmorgan95/ISLR_CH3_Solutions.

library(tidyr)
colnames(p)[2:5] <- method
mpg_predictors <- p %>%
  select(-c(R2.max, method)) %>%
  gather(key = "method", value = "R2", -predictor)
mpg_predictors$predictor <- factor(mpg_predictors$predictor, 
                                   ordered = T, 
                                   levels = p$predictor)
mpg_predictors$method <- factor(mpg_predictors$method,
                                   levels = method)

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'Auto':
## 
##     mpg

ggplot(mpg_predictors, aes(x = R2, y = predictor, col = method, group = predictor)) + 
  geom_line(col = "grey15") + geom_point(size = 2) + 
  theme_light() + theme(legend.position = "bottom") +
  labs(title = "Best predictors (& transformations) for mpg", 
       col = "Predictor Transformation", 
       y = "Predictor")

Concluded from the graph, we should test quadratic terms for weight, displacement, horsepower and year.

fit.lm.2 <- lm(mpg ~ . + I(weight^2) + I(displacement^2) + 
                 I(horsepower^2) + I(year^2), data = Auto[,-9])
summary(fit.lm.2)

## 
## Call:
## lm(formula = mpg ~ . + I(weight^2) + I(displacement^2) + I(horsepower^2) + 
##     I(year^2), data = Auto[, -9])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4816 -1.5384  0.0735  1.3671 12.0213 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.185e+02  6.966e+01   6.008 4.40e-09 ***
## cylinders          5.073e-01  3.191e-01   1.590 0.112692    
## displacement      -3.328e-02  2.045e-02  -1.627 0.104480    
## horsepower        -1.781e-01  3.953e-02  -4.506 8.81e-06 ***
## weight            -1.114e-02  2.587e-03  -4.306 2.12e-05 ***
## acceleration      -1.700e-01  9.652e-02  -1.762 0.078960 .  
## year              -1.019e+01  1.837e+00  -5.546 5.49e-08 ***
## origin2            1.323e+00  5.304e-01   2.494 0.013068 *  
## origin3            1.258e+00  5.129e-01   2.452 0.014637 *  
## I(weight^2)        1.182e-06  3.438e-07   3.439 0.000649 ***
## I(displacement^2)  5.839e-05  3.435e-05   1.700 0.089967 .  
## I(horsepower^2)    4.388e-04  1.336e-04   3.284 0.001118 ** 
## I(year^2)          7.210e-02  1.207e-02   5.974 5.35e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.776 on 379 degrees of freedom
## Multiple R-squared:  0.8773, Adjusted R-squared:  0.8735 
## F-statistic: 225.9 on 12 and 379 DF,  p-value: < 2.2e-16

The adjusted $R^2$ is improved from 0.8205 to 0.8735 after transformation.

Homework 1

Huang Sisi