1. It was mentioned in the chapter that a cubic regression spline with one knot at \(\xi\) can be obtained using a basis of the form \(x, x^{2}, x^{3}\), \((x-\xi)_{+}^{3}\), where \((x-\xi)_{+}^{3}=(x-\xi)^{3}\) if \(x>\xi\) and equals 0 otherwise. We will now show that a function of the form \[ f(x)=\beta_{0}+\beta_{1} x+\beta_{2} x^{2}+\beta_{3} x^{3}+\beta_{4}(x-\xi)_{+}^{3} \] is indeed a cubic regression spline, regardless of the values of \(\beta_{0}, \beta_{1}, \beta_{2}\), \(\beta_{3}, \beta_{4}\)

  1. Find a cubic polynomial \[ f_{1}(x)=a_{1}+b_{1} x+c_{1} x^{2}+d_{1} x^{3} \] such that \(f(x)=f_{1}(x)\) for all \(x \leq \xi\). Express \(a_{1}, b_{1}, c_{1}, d_{1}\) in terms of \(\beta_{0}, \beta_{1}, \beta_{2}, \beta_{3}, \beta_{4}\).

    Answer. For \(x \leq \xi\), \(f(x)=\beta_{0}+\beta_{1} x+\beta_{2} x^{2}+\beta_{3} x^{3}\). So \(a_1=\beta_{0}\), \(b_1=\beta_{1}\), \(c_{1}=\beta_{2}\), \(d_{1}=\beta_{3}\).

  2. Find a cubic polynomial \[ f_{2}(x)=a_{2}+b_{2} x+c_{2} x^{2}+d_{2} x^{3} \] such that \(f(x)=f_{2}(x)\) for all \(x>\xi\). Express \(a_{2}, b_{2}, c_{2}, d_{2}\) in terms of \(\beta_{0}, \beta_{1}, \beta_{2}, \beta_{3}, \beta_{4}\). We have now established that \(f(x)\) is a piecewise polynomial.

    Answer. For \(x \leq \xi\), \(f(x)=(\beta_{0}-\beta_{4}\xi^3)+(\beta_{1}+3\beta_{4}\xi^2) x+(\beta_{2}-3\beta_{4}\xi) x^{2}+(\beta_{3}+\beta_{4}) x^{3}\). So \(a_1=\beta_{0}-\beta_{4}\xi^3\), \(b_1=\beta_{1}+3\beta_{4}\xi^2\), \(c_{1}=\beta_{2}-3\beta_{4}\xi\), \(d_{1}=\beta_{3}+\beta_{4}\).

  3. Show that \(f_{1}(\xi)=f_{2}(\xi)\). That is, \(f(x)\) is continuous at \(\xi\).

  4. Show that \(f_{1}^{\prime}(\xi)=f_{2}^{\prime}(\xi)\). That is, \(f^{\prime}(x)\) is continuous at \(\xi\).

  5. Show that \(f_{1}^{\prime \prime}(\xi)=f_{2}^{\prime \prime}(\xi) .\) That is, \(f^{\prime \prime}(x)\) is continuous at \(\xi\).

    Answer for (c)(d)(e). \(f_1(x) = \beta_{0}+\beta_{1} x+\beta_{2} x^{2}+\beta_{3} x^{3}\), \(f_2(x) = \beta_{0}+\beta_{1} x+\beta_{2} x^{2}+\beta_{3} x^{3}+\beta_{4}(x-\xi)^{3}\). Let \(h(x)=f_2(x)-f_1(x)=\beta_{4}(x-\xi)^{3}\), it’s easy to see \(h(\xi)=h'(\xi)=h''(\xi)=0\). So \(f_{1}(\xi)=f_{2}(\xi),\ f'_{1}(\xi)=f'_{2}(\xi),\ f''_{1}(\xi)=f''_{2}(\xi)\).

Therefore, \(f(x)\) is indeed a cubic spline.

2. Suppose that a curve \(\hat{g}\) is computed to smoothly fit a set of \(n\) points using the following formula: \[ \hat{g}=\arg \min _{g}\left(\sum_{i=1}^{n}\left(y_{i}-g\left(x_{i}\right)\right)^{2}+\lambda \int\left[g^{(m)}(x)\right]^{2} d x\right), \] where \(g^{(m)}\) represents the \(m\) th derivative of \(g\left(\right.\) and \(\left.g^{(0)}=g\right)\). Provide example sketches of \(\hat{g}\) in each of the following scenarios.

  1. \(\lambda=\infty, m=0\).

    Answer. \(\hat{g}=0\).

  2. \(\lambda=\infty, m=1\).

    Answer. \(\hat{g}=\sum_i y_i/n\).

  3. \(\lambda=\infty, m=2\).

    Answer. \(\hat{g}\propto k_1x+k_0\).

  4. \(\lambda=\infty, m=3\).

    Answer. \(\hat{g}\propto k_2x^2+k_1x+k_0\).

  5. \(\lambda=0, m=3\).

    Answer. \(\hat{g}=\sum_{i=1}^n y_i\frac{\prod_{j\neq i}(x-x_j)}{\prod_{j\neq i}(x_i-x_j)}\) (Interpolating spline).

3. Suppose we fit a curve with basis functions \(b_{1}(X)=X, b_{2}(X)=\) \((X-1)^{2} I(X \geq 1) .\) (Note that \(I(X \geq 1)\) equals 1 for \(X \geq 1\) and 0 otherwise.) We fit the linear regression model \[ Y=\beta_{0}+\beta_{1} b_{1}(X)+\beta_{2} b_{2}(X)+\epsilon \] and obtain coefficient estimates \(\hat{\beta}_{0}=1, \hat{\beta}_{1}=1, \hat{\beta}_{2}=-2 .\) Sketch the estimated curve between \(X=-2\) and \(X=2\). Note the intercepts, slopes, and other relevant information.

x = seq(-2,2,0.2) 
y = 1 + x + -2 * (x-1)^2 * I(x>1)
plot(x, y, type = "l", lwd=2)
abline(v = 1, lty = 3, lwd = 2, col = "darkgrey")

grid(lty = 1, col = alpha("lightgray",0.4))
text(0.5, 1, "y=1+x", cex=0.8)
text(1.5, 1, "y=1+x-2 * (x-1)^2", cex=0.8)

5. Consider two curves, \(\hat{g}_{1}\) and \(\hat{g}_{2}\), defined by \[ \begin{array}{l} \hat{g}_{1}=\arg \min _{g}\left(\sum_{i=1}^{n}\left(y_{i}-g\left(x_{i}\right)\right)^{2}+\lambda \int\left[g^{(3)}(x)\right]^{2} d x\right), \\ \hat{g}_{2}=\arg \min _{g}\left(\sum_{i=1}^{n}\left(y_{i}-g\left(x_{i}\right)\right)^{2}+\lambda \int\left[g^{(4)}(x)\right]^{2} d x\right), \end{array} \] where \(g^{(m)}\) represents the \(m\) th derivative of \(g\).

  1. As \(\lambda \rightarrow \infty\), will \(\hat{g}_{1}\) or \(\hat{g}_{2}\) have the smaller training RSS?

    Answer. \(\hat{g}_{2}\) may have smaller training RSS because its order is higher.

  2. As \(\lambda \rightarrow \infty\), will \(\hat{g}_{1}\) or \(\hat{g}_{2}\) have the smaller test RSS?

    Answer. \(\hat{g}_{1}\) may have smaller training RSS because \(\hat{g}_{2}\) is more likely to overfit the training data.

  3. For \(\lambda=0\), will \(\hat{g}_{1}\) or \(\hat{g}_{2}\) have the smaller training and test RSS?

    Answer. There is no penalty, and \(\hat{g}_{1}=\hat{g}_{2}\). So they will have the same training and test RSS.

6. In this exercise, you will further analyze the Wage data set considered throughout this chapter.

  1. Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree \(d\) for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

    Answer.

library(ISLR)
library(boot)
deltas <- {}
for (i in 1:10) {
    fit <- glm(wage ~ poly(age, i), data = Wage)
    deltas[i] <- cv.glm(Wage, fit, K = 10)$delta[2]
}
# plot the delta values to see which produce the lowest MSE (delta). 
# The first value of delta: standard k-fold estimate 
# The second: bias corrected
plot(1:10, deltas, xlab = "Degree", ylab = "CV MSE", type = "b")
abline(h = min(deltas) + 0.2 * sd(deltas), col = "red", lty = 2)
# choose degree of 3
points(3, deltas[3], col = "#BC3C29FF", cex = 2, pch = 20)

The CV plot shows that MSE decreases rapidly when the degree goes from 1 to 2. Then the decreasing trend is not obvious. So we may choose the optimal degree 3. We can test the results of degrees from 1 to 6 using ANOVA.

fit.1 = lm(wage~poly(age, 1), data=Wage)
fit.2 = lm(wage~poly(age, 2), data=Wage)
fit.3 = lm(wage~poly(age, 3), data=Wage)
fit.4 = lm(wage~poly(age, 4), data=Wage)
fit.5 = lm(wage~poly(age, 5), data=Wage)
fit.6 = lm(wage~poly(age, 6), data=Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5, fit.6)
## Analysis of Variance Table
## 
## Model 1: wage ~ poly(age, 1)
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Model 6: wage ~ poly(age, 6)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.6636 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8936  0.001675 ** 
## 4   2995 4771604  1      6070   3.8117  0.050989 .  
## 5   2994 4770322  1      1283   0.8054  0.369565    
## 6   2993 4766389  1      3932   2.4692  0.116201    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So the degrees \(d\geq 3\) are insignificant. So \(d=3\) is optimal. We can make a plot of the resulting polynomial fit to the data.

attach(Wage)
plot(age, wage, col = "darkgrey")
x <- seq(min(age),max(age))
pred <- predict(fit.3, newdata = list(age = x))
lines(x, pred, col = "#BC3C29FF", lwd = 2)

  1. Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.

    Answer.

deltas <- {}
for (i in 2:10) {
    Wage$cut.age <- cut(age, i)
    fit <- glm(wage ~ cut.age, data = Wage)
    deltas[i] <- cv.glm(Wage, fit, K = 10)$delta[2]
}
plot(2:10, deltas[-1], xlab = "Cuts", ylab = "CV MSE", type = "b")
abline(h = min(deltas[-1]) + 0.2 * sd(deltas[-1]), col = "red", lty = 2)
points(8, deltas[8], col = "#BC3C29FF", cex = 2, pch = 20)

The 8 cuts seems optimal. We can make a plot of the fit obtained.

attach(Wage)
## The following objects are masked from Wage (pos = 3):
## 
##     age, education, health, health_ins, jobclass, logwage, maritl,
##     race, region, wage, year
plot(age, wage, col = "darkgrey")
Wage$cut.age <- cut(age, 8)
fit <- glm(wage ~ cut(age, 8), data = Wage)
pred <- predict(fit, newdata = list(age = x))
lines(x, pred, col = "#BC3C29FF", lwd = 2)

7. The Wage data set contains a number of other features not explored in this chapter, such as marital status (maritl), job class (jobclass), and others. Explore the relationships between some of these other predictors and wage, and use non-linear fitting techniques in order to fit flexible models to the data. Create plots of the results obtained, and write a summary of your findings.

Answer.

par(mfrow = c(1, 2))
boxplot(wage~maritl, data=Wage, pars  =  list(xaxt = "n"))
text(1:5, par("usr")[3] - 17, labels = levels(Wage$maritl), 
     cex = 0.75, srt = 30, pos = 1, xpd = TRUE)
boxplot(wage~jobclass, data=Wage, cex.axis=0.75)

Married people seem to have higher wages, and the people doing informational jobs get higher wages. I will use generalized additive models gam to fit the data.

library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.20
fit <- gam(wage ~ maritl + jobclass + education + 
               lo(year, span = 0.7) + s(age, df = 5), data = Wage)
par(mfrow = c(2, 3))
plot(fit, se = T, col = "blue")

Apart from marital status and job class mentioned above, the wage increases with education and years. People in middle ages earn the most.

10. This question relates to the College data set.

  1. Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

    Answer.

library(leaps)
set.seed(1)
train <- sample(nrow(College), nrow(College)*0.5)
College.train <- College[train, ]
College.test <- College[-train, ]

One approach to implement forward stepwise selection is stepAIC.

library(MASS)
# Fit the full model 
full.model <- lm(Outstate ~ ., data = College.train)
# forward stepwise regression model
step.model <- stepAIC(full.model, direction = "forward", trace = FALSE)
summary(step.model)
## 
## Call:
## lm(formula = Outstate ~ Private + Apps + Accept + Enroll + Top10perc + 
##     Top25perc + F.Undergrad + P.Undergrad + Room.Board + Books + 
##     Personal + PhD + Terminal + S.F.Ratio + perc.alumni + Expend + 
##     Grad.Rate, data = College.train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6276  -1367    -36   1411  10011 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.092e+03  1.204e+03  -0.907  0.36485    
## PrivateYes   2.169e+03  3.622e+02   5.989 4.99e-09 ***
## Apps        -3.851e-01  9.594e-02  -4.014 7.22e-05 ***
## Accept       8.983e-01  1.966e-01   4.568 6.73e-06 ***
## Enroll      -7.638e-01  6.033e-01  -1.266  0.20632    
## Top10perc    5.953e+01  1.669e+01   3.566  0.00041 ***
## Top25perc   -2.335e+01  1.245e+01  -1.876  0.06144 .  
## F.Undergrad -3.081e-02  1.046e-01  -0.294  0.76856    
## P.Undergrad -6.099e-02  1.035e-01  -0.589  0.55608    
## Room.Board   1.088e+00  1.256e-01   8.659  < 2e-16 ***
## Books       -9.438e-01  7.502e-01  -1.258  0.20914    
## Personal    -2.996e-01  1.652e-01  -1.813  0.07062 .  
## PhD          2.376e+00  1.269e+01   0.187  0.85154    
## Terminal     3.213e+01  1.366e+01   2.353  0.01915 *  
## S.F.Ratio   -6.772e+01  3.850e+01  -1.759  0.07943 .  
## perc.alumni  4.689e+01  1.121e+01   4.184 3.58e-05 ***
## Expend       1.474e-01  2.993e-02   4.924 1.28e-06 ***
## Grad.Rate    2.494e+01  8.220e+00   3.034  0.00258 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2042 on 370 degrees of freedom
## Multiple R-squared:  0.7794, Adjusted R-squared:  0.7692 
## F-statistic: 76.88 on 17 and 370 DF,  p-value: < 2.2e-16

From the p-value, we can choose 8 variables: PrivateYes, Apps, Accept, Top10perc, Room.Board, perc.alumni, Expend, Grad.Rate.

Another approach is regsubsets.

models <- regsubsets(Outstate ~ ., data = College.train, nvmax = 17,
                     method = "forward")
sum.models <- summary(models)
par(mfrow = c(1, 3))
cp <- sum.models$cp; bic <- sum.models$bic; adjr2 <- sum.models$adjr2
plot(cp, xlab = "Number of variables", ylab = "Cp", type = "b")
abline(h = min(cp) + 0.2 * sd(cp), col = "red", lty = 2)
points(7, cp[7], col = "#BC3C29FF", cex = 2, pch = 20)
plot(bic, xlab = "Number of variables", ylab = "BIC", type='b')
abline(h = min(bic) + 0.2 * sd(bic), col = "red", lty = 2)
points(7, bic[7], col = "#BC3C29FF", cex = 2, pch = 20)
plot(adjr2, xlab = "Number of variables", ylab = "Adjusted R2", type = "b")
abline(h = max(adjr2) - 0.2 * sd(adjr2), col = "red", lty = 2)
points(7, adjr2[7], col = "#BC3C29FF", cex = 2, pch = 20)

Cp, BIC and Adjusted R2 show 7 variables are optimal. They are PrivateYes, Room.Board, Personal, PhD, perc.alumni, Expend, Grad.Rate.

models <- regsubsets(Outstate ~ ., data = College, method = "forward")
coeffs <- coef(models, id = 7)
names(coeffs)
## [1] "(Intercept)" "PrivateYes"  "Room.Board"  "Personal"    "PhD"        
## [6] "perc.alumni" "Expend"      "Grad.Rate"
  1. Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

    Answer.

gam.fit <- gam(Outstate ~ Private + s(Room.Board, df = 2) + s(Personal, df = 2) + 
                   s(PhD, df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + 
                   s(Grad.Rate, df = 2), data=College.train)
par(mfrow = c(3, 3))
plot(gam.fit, se = T, col = "blue")

The out-of-state tuition of private universities is higher. It also increases with boarding rooms, Phds, alumni percentage, graduation rate. But it decreases with personals. As for expend, it increases with the expend rapidly initially, slows down, and decreases a little in the end.

  1. Evaluate the model obtained on the test set, and explain the results obtained.

    Answer.

gam.pred = predict(gam.fit, College.test)
gam.res = mean((College.test$Outstate - gam.pred)^2)
gam.tot = mean((College.test$Outstate - mean(College.test$Outstate))^2)
test.r2 = 1 - gam.res/gam.tot
test.r2
## [1] 0.7654465

7 variables explain 0.77 of the variance.

  1. For which variables, if any, is there evidence of a non-linear relationship with the response?

    Answer.

summary(gam.fit)
## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(Personal, 
##     df = 2) + s(PhD, df = 2) + s(perc.alumni, df = 2) + s(Expend, 
##     df = 5) + s(Grad.Rate, df = 2), data = College.train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -7271.77 -1149.51    13.33  1270.26  7036.29 
## 
## (Dispersion Parameter for gaussian family taken to be 3649123)
## 
##     Null Deviance: 6989966760 on 387 degrees of freedom
## Residual Deviance: 1353824246 on 370.9999 degrees of freedom
## AIC: 6982.392 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                         Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private                  1 1772040424 1772040424 485.607 < 2.2e-16 ***
## s(Room.Board, df = 2)    1 1566237740 1566237740 429.209 < 2.2e-16 ***
## s(Personal, df = 2)      1   78005544   78005544  21.377 5.218e-06 ***
## s(PhD, df = 2)           1  307597547  307597547  84.294 < 2.2e-16 ***
## s(perc.alumni, df = 2)   1  292446035  292446035  80.141 < 2.2e-16 ***
## s(Expend, df = 5)        1  565514955  565514955 154.973 < 2.2e-16 ***
## s(Grad.Rate, df = 2)     1   68872241   68872241  18.874 1.805e-05 ***
## Residuals              371 1353824246    3649123                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                        Npar Df  Npar F     Pr(F)    
## (Intercept)                                         
## Private                                             
## s(Room.Board, df = 2)        1  2.1205    0.1462    
## s(Personal, df = 2)          1  1.3607    0.2442    
## s(PhD, df = 2)               1  1.1926    0.2755    
## s(perc.alumni, df = 2)       1  0.2184    0.6405    
## s(Expend, df = 5)            4 20.8603 1.554e-15 ***
## s(Grad.Rate, df = 2)         1  0.5097    0.4757    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova for Nonparametric Effects shows significant nonlinear relationship between out-of-state tuition and expend.