2. It was stated in the text that classifying an observation to the class for which (4.12) is largest is equivalent to classifying an observation to the class for which (4.13) is largest. Prove that this is the case. In other words, under the assumption that the observations in the kth class are drawn from a \(N(\mu_k, \sigma^2)\) distribution, the Bayes’ classifier assigns an observation to the class for which the discriminant function is maximized.

Proof. For (4.12) \[\begin{equation} p_{k}(x)=\frac{\pi_{k} \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{1}{2 \sigma^{2}}\left(x-\mu_{k}\right)^{2}\right)}{\sum_{l=1}^{K} \pi_{l} \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{1}{2 \sigma^{2}}\left(x-\mu_{l}\right)^{2}\right)}, \end{equation}\] the denominator is the same for all \(k\). So maximizing (4.12) is equivalent to maximize the numerator. Take the \(ln\) and get \[\begin{equation}\label{eq:2} -\frac{1}{2 \sigma^{2}}\left(x-\mu_{k}\right)^{2}+ln(\pi_k \frac{1}{\sqrt{2 \pi} \sigma} ). \end{equation}\] We could deliminate the terms that are invariant to \(k\) and it is equivalent to maximize \[\begin{equation} x \cdot \frac{\mu_{k}}{\sigma^{2}}-\frac{\mu_{k}^{2}}{2 \sigma^{2}}+\ln \left(\pi_{k}\right). \end{equation}\] So it is equivalent to maximize (4.13) (multiply by \(\log e\) ) \[\begin{equation} \delta_{k}(x)=x \cdot \frac{\mu_{k}}{\sigma^{2}}-\frac{\mu_{k}^{2}}{2 \sigma^{2}}+\log \left(\pi_{k}\right). \end{equation}\]

3. This problem relates to the QDA model, in which the observations within each class are drawn from a normal distribution with a classspecific mean vector and a class specific covariance matrix. We consider the simple case where \(p=1 ;\) i.e. there is only one feature

Suppose that we have \(K\) classes, and that if an observation belongs to the \(k\) th class then \(X\) comes from a one-dimensional normal distribution, \(X \sim N\left(\mu_{k}, \sigma_{k}^{2}\right) .\) Recall that the density function for the one-dimensional normal distribution is given in (4.11). Prove that in this case, the Bayes’ classifier is not linear. Argue that it is in fact quadratic.

Proof. Equation () should be \[\begin{equation} -\frac{1}{2 \sigma_k^{2}}\left(x-\mu_{k}\right)^{2}+ln(\pi_k \frac{1}{\sqrt{2 \pi} \sigma_k} ). \end{equation}\] The coefficient of quadratic term \(x^2\) is \(-\frac{1}{2 \sigma_k^{2}}\), which changes with \(k\). So the corresponding discriminant function is quadratic.

4. When the number of features \(p\) is large, there tends to be a deterioration in the performance of \(\mathrm{KNN}\) and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality, and it ties into the fact that non-parametric approaches often perform poorly when \(p\) is large. We will now investigate this curse.

  1. Suppose that we have a set of observations, each with measurements on \(p=1\) feature, \(X\). We assume that \(X\) is uniformly (evenly) distributed on \([0,1]\). Associated with each observation is a response value. Suppose that we wish to predict a test observation’s response using only observations that are within \(10 \%\) of the range of \(X\) closest to that test observation. For instance, in order to predict the response for a test observation with \(X=0.6\), we will use observations in the range \([0.55,0.65] .\) On average, what fraction of the available observations will we use to make the prediction?

    Answer. On average, it is 10% if ignoring \(x\in [0,0.05]\cup [0.95,1]\). (Or take the integration \(2\times \int_{0}^{0.05} (x+0.05)\ dx + \int_{0.05}^{0.95} 0.1\ dx = 0.0975\). For simplicity we would not consider the marginal area in the following.)

  2. Now suppose that we have a set of observations, each with measurements on \(p=2\) features, \(X_{1}\) and \(X_{2}\). We assume that \(\left(X_{1}, X_{2}\right)\) are uniformly distributed on \([0,1] \times[0,1] .\) We wish to predict a test observation’s response using only observations that are within \(10 \%\) of the range of \(X_{1}\) and within \(10 \%\) of the range of \(X_{2}\) closest to that test observation. For instance, in order to predict the response for a test observation with \(X_{1}=0.6\) and \(X_{2}=0.35\), we will use observations in the range \([0.55,0.65]\) for \(X_{1}\) and in the range \([0.3,0.4]\) for \(X_{2} .\) On average, what fraction of the available observations will we use to make the prediction?

    Answer. It is 1%.

  3. Now suppose that we have a set of observations on \(p=100\) features. Again the observations are uniformly distributed on each feature, and again each feature ranges in value from 0 to \(1 .\) We wish to predict a test observation’s response using observations within the \(10 \%\) of each feature’s range that is closest to that test observation. What fraction of the available observations will we use to make the prediction?

    Answer. It is \(0.1^{100}\approx 0\).

  4. Using your answers to parts (a)-(c), argue that a drawback of KNN when \(p\) is large is that there are very few training observations “near” any given test observation.

    Answer. Because the fraction of the available observations decreases exponentially as \(p\) grows linearly.

  5. Now suppose that we wish to make a prediction for a test observation by creating a \(p\) -dimensional hypercube centered around the test observation that contains, on average, \(10 \%\) of the training observations. For \(p=1,2\), and 100, what is the length of each side of the hypercube? Comment on your answer.

    Answer. The length should be \(0.10\), \(0.1^{1/2}=0.32\), \(0.1^{1/100}=0.98\). The length increases dramatically and even approximates to \(1\) in each dimension when \(p=100\). The observations in this hypercube are not “near” the test observation.

5. We now examine the differences between LDA and QDA.

  1. If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?

    Answer. If the Bayes decision boundary is linear, QDA performs better on the training set because it is more flexible with the quadratic term. But LDA performs better on the test set because the Bayes decision boundary is actually linear and QDA overfits.

  2. If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?

    Answer. If the Bayes decision boundary is non-linear, QDA performs better both on the training set and the test set.

  3. In general, as the sample size \(n\) increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?

    Answer. In general, we expect the performance of QDA will improve because the variance is offset when \(n\) is large and QDA is more flexible.

  4. True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer.

    Answer. False. Bayes decision boundary is actually linear and QDA may overfit if the sample size is not large enough. And the test error rate will be inferior.

9. This problem has to do with odds.

  1. On average, what fraction of people with an odds of \(0.37\) of defaulting on their credit card payment will in fact default?

    Answer. Denote the fraction \(p\), then \[\begin{equation} \frac{p}{1-p} = 0.37. \end{equation}\] We get the fraction \(p = 0.27\).

  2. Suppose that an individual has a \(16 \%\) chance of defaulting on her credit card payment. What are the odds that she will default?

    Answer. \(p=0.16\), then the odds \(= \displaystyle\frac{p}{1-p} = 0.19\).

11. In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

  1. Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

    library(ISLR)
    attach(Auto)
    mpg01 <- rep(0, length(mpg))
    mpg01[mpg > median(mpg)] <- 1
    dat <- data.frame(Auto, mpg01)
  2. Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

    # explore correlation
    cor(dat[,-9])
    ##                     mpg  cylinders displacement horsepower     weight
    ## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
    ## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
    ## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
    ## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
    ## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
    ## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
    ## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
    ## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
    ## mpg01         0.8369392 -0.7591939   -0.7534766 -0.6670526 -0.7577566
    ##              acceleration       year     origin      mpg01
    ## mpg             0.4233285  0.5805410  0.5652088  0.8369392
    ## cylinders      -0.5046834 -0.3456474 -0.5689316 -0.7591939
    ## displacement   -0.5438005 -0.3698552 -0.6145351 -0.7534766
    ## horsepower     -0.6891955 -0.4163615 -0.4551715 -0.6670526
    ## weight         -0.4168392 -0.3091199 -0.5850054 -0.7577566
    ## acceleration    1.0000000  0.2903161  0.2127458  0.3468215
    ## year            0.2903161  1.0000000  0.1815277  0.4299042
    ## origin          0.2127458  0.1815277  1.0000000  0.5136984
    ## mpg01           0.3468215  0.4299042  0.5136984  1.0000000

    mpg01 is positively correlated to acceleration, year and origin, and negatively correlated to cylinders, displacement, horsepower and weight.

    # scatterplots
    # library(car)
    # scatterplotMatrix(dat, main='scatterplot-Auto')
    pairs(dat, main="Scatterplot-Auto", pch = 21, bg=c(1:nrow(dat)))

    # boxplots
    library(ggpubr)
    ## Loading required package: ggplot2
    ## 
    ## Attaching package: 'ggplot2'
    ## The following object is masked from 'Auto':
    ## 
    ##     mpg
    # boxplot(cylinders ~ mpg01, data = dat)
    p <- ggboxplot(dat, x = "mpg01", y = "weight",
          color = "mpg01", palette = "jco",
          add = "jitter")
    #  Add p-value
    p + stat_compare_means()

    # Change method
    # p + stat_compare_means(method = "t.test")
  3. Split the data into a training set and a test set.

    # 70% for training and 30% for testing
    percent <- 0.7
    idx <- sort(sample(nrow(dat), nrow(dat)*percent))
    train <- dat[idx,]
    test <- dat[-idx,]
  4. Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

    library(MASS)
    fit.lda <- lda(mpg01 ~ cylinders + weight + displacement + horsepower, 
                   data = dat, subset = idx)
    fit.lda
    ## Call:
    ## lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = dat, 
    ##     subset = idx)
    ## 
    ## Prior probabilities of groups:
    ##         0         1 
    ## 0.4963504 0.5036496 
    ## 
    ## Group means:
    ##   cylinders   weight displacement horsepower
    ## 0  6.786765 3641.022     275.2941  130.96324
    ## 1  4.188406 2314.000     114.5290   78.00725
    ## 
    ## Coefficients of linear discriminants:
    ##                        LD1
    ## cylinders    -0.3974647924
    ## weight       -0.0009670704
    ## displacement -0.0029615583
    ## horsepower    0.0049004106
    pred.lda = predict(fit.lda, test)
    # mean(pred.lda$class != test$mpg01)
    # Or we can use confusion matrix
    conf <- table(list(predicted=pred.lda$class, observed=test$mpg01))
    library(caret)
    ## Loading required package: lattice
    confusionMatrix(conf)
    ## Confusion Matrix and Statistics
    ## 
    ##          observed
    ## predicted  0  1
    ##         0 50  3
    ##         1 10 55
    ##                                        
    ##                Accuracy : 0.8898       
    ##                  95% CI : (0.819, 0.94)
    ##     No Information Rate : 0.5085       
    ##     P-Value [Acc > NIR] : < 2e-16      
    ##                                        
    ##                   Kappa : 0.78         
    ##                                        
    ##  Mcnemar's Test P-Value : 0.09609      
    ##                                        
    ##             Sensitivity : 0.8333       
    ##             Specificity : 0.9483       
    ##          Pos Pred Value : 0.9434       
    ##          Neg Pred Value : 0.8462       
    ##              Prevalence : 0.5085       
    ##          Detection Rate : 0.4237       
    ##    Detection Prevalence : 0.4492       
    ##       Balanced Accuracy : 0.8908       
    ##                                        
    ##        'Positive' Class : 0            
    ## 
    test_error <- as.numeric(1-confusionMatrix(conf)$overall["Accuracy"])
    test_error
    ## [1] 0.1101695

    On testing dataset, the test error is 11.02%.

  5. Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

    fit.qda <- qda(mpg01 ~ cylinders + weight + displacement + horsepower, 
                   data = dat, subset = idx)
    fit.qda
    ## Call:
    ## qda(mpg01 ~ cylinders + weight + displacement + horsepower, data = dat, 
    ##     subset = idx)
    ## 
    ## Prior probabilities of groups:
    ##         0         1 
    ## 0.4963504 0.5036496 
    ## 
    ## Group means:
    ##   cylinders   weight displacement horsepower
    ## 0  6.786765 3641.022     275.2941  130.96324
    ## 1  4.188406 2314.000     114.5290   78.00725
    pred.qda = predict(fit.qda, test)
    mean(pred.qda$class != test$mpg01)
    ## [1] 0.1016949

    On testing dataset, the test error is 10.17%.

  6. Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

    fit.glm <- glm(mpg01 ~ cylinders + weight + displacement + horsepower,
                   data = dat, family = binomial, subset = idx)
    summary(fit.glm)
    ## 
    ## Call:
    ## glm(formula = mpg01 ~ cylinders + weight + displacement + horsepower, 
    ##     family = binomial, data = dat, subset = idx)
    ## 
    ## Deviance Residuals: 
    ##      Min        1Q    Median        3Q       Max  
    ## -2.44120  -0.17870   0.08712   0.31147   3.05303  
    ## 
    ## Coefficients:
    ##                Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept)  11.8103006  2.0819718   5.673 1.41e-08 ***
    ## cylinders     0.1869071  0.3972245   0.471  0.63797    
    ## weight       -0.0020251  0.0008573  -2.362  0.01817 *  
    ## displacement -0.0164493  0.0095899  -1.715  0.08629 .  
    ## horsepower   -0.0443408  0.0172072  -2.577  0.00997 ** 
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 379.83  on 273  degrees of freedom
    ## Residual deviance: 138.27  on 269  degrees of freedom
    ## AIC: 148.27
    ## 
    ## Number of Fisher Scoring iterations: 7
    pred.glm = as.numeric(predict(fit.glm, test) > 0.5)
    mean(pred.glm != test$mpg01)
    ## [1] 0.1101695

    On testing dataset, the test error is 11.02%.

  7. Perform KNN on the training data, with several values of \(K\), in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of \(K\) seems to perform the best on this data set?

    library(class)
    knn.K <- function(k){
        pred.knn <- knn(train[,c("cylinders", "weight", 
                             "displacement", "horsepower")], 
                    test[,c("cylinders", "weight", 
                             "displacement", "horsepower")], 
                    train$mpg01, k)
        mean(pred.knn != test$mpg01)
    }
    for (k in c(1,5,10,20,50,100)){
        # print(k)
        # print(knn.K(k))
        cat("Value K =", k, ": the test error for KNN is", knn.K(k), "\n")
    }
    ## Value K = 1 : the test error for KNN is 0.1694915 
    ## Value K = 5 : the test error for KNN is 0.1101695 
    ## Value K = 10 : the test error for KNN is 0.1016949 
    ## Value K = 20 : the test error for KNN is 0.1101695 
    ## Value K = 50 : the test error for KNN is 0.1101695 
    ## Value K = 100 : the test error for KNN is 0.1186441

    By trying different \(K\) values, the lowest test error is reached when \(K=10\).