1. This problem involves the \(K\)-means clustering algorithm.

  1. Prove (10.12), that is

\[ \frac{1}{\left|C_{k}\right|} \sum_{i, i^{\prime} \in C_{k}} \sum_{j=1}^{p}\left(x_{i j}-x_{i^{\prime} j}\right)^{2}=2 \sum_{i \in C_{k}} \sum_{j=1}^{p}\left(x_{i j}-\bar{x}_{k j}\right)^{2}, \]

where \(\bar{x}_{k j}=\frac{1}{\left|C_{k}\right|} \sum_{i \in C_{k}} x_{i j}\) is the mean for feature \(j\) in cluster \(C_{k}\).

Answer. Noting that \[ \sum_{i'\in C_{k}}(x_{i' j}-\bar x_{kj})=(\sum_{i'\in C_{k}}x_{i' j})-|C_{k}|\cdot\bar x_{kj}=0, \] so \[ \begin{aligned} \frac{1}{\left|C_{k}\right|} \sum_{i, i^{\prime} \in C_{k}} \sum_{j=1}^{p}\left(x_{i j}-x_{i^{\prime} j}\right)^{2} &= \frac{1}{\left|C_{k}\right|} \sum_{i, i^{\prime} \in C_{k}} \sum_{j=1}^{p}\left((x_{i j}-\bar x_{kj})-(x_{i^{\prime}j} -\bar x_{kj})\right)^{2}\\ &=\frac{1}{\left|C_{k}\right|} \sum_{i, i^{\prime} \in C_{k}} \sum_{j=1}^{p}\left((x_{i j}-\bar x_{kj})^2+(x_{i^{\prime}j} -\bar x_{kj})^2-2(x_{i j}-\bar x_{kj})(x_{i^{\prime}j} -\bar x_{kj})\right)\\ &=2\sum_{i\in C_{k}} \sum_{j=1}^{p}\left(x_{i j}-\bar x_{kj}\right)^2-\frac{2}{\left|C_{k}\right|}\sum_{j=1}^{p}\sum_{i\in C_{k}}\left((x_{i j}-\bar x_{kj})\sum_{i'\in C_{k}}(x_{i' j}-\bar x_{kj})\right)\\ &=2\sum_{i\in C_{k}} \sum_{j=1}^{p}\left(x_{i j}-\bar x_{kj}\right)^2. \end{aligned} \]

  1. On the basis of this identity, argue that the \(K\)-means clustering algorithm (Algorithm 10.1) decreases the objective (10.11), that is \[ \underset{C_{1}, \ldots, C_{K}}{\operatorname{minimize}}\left\{\sum_{k=1}^{K} \frac{1}{\left|C_{k}\right|} \sum_{i, i^{\prime} \in C_{k}} \sum_{j=1}^{p}\left(x_{i j}-x_{i^{\prime} j}\right)^{2}\right\}, \] at each iteration.

Answer. Because \(\sum_{j=1}^{p}\left(x_{i j}-\bar{x}_{k j}\right)^{2}\) is the Euclidean distance of the observation \(i\) and the belonged centroid, minimizing the objective is equivalent to minimize the within-cluster variance.

2. Suppose that we have four observations, for which we compute a dissimilarity matrix, given by \[ \left[\begin{array}{cccc} & 0.3 & 0.4 & 0.7 \\ 0.3 & & 0.5 & 0.8 \\ 0.4 & 0.5 & & 0.45 \\ 0.7 & 0.8 & 0.45 & \end{array}\right] \] For instance, the dissimilarity between the first and second observations is \(0.3\), and the dissimilarity between the second and fourth observations is \(0.8\).

  1. On the basis of this dissimilarity matrix, sketch the dendrogram that results from hierarchically clustering these four observations using complete linkage. Be sure to indicate on the plot the height at which each fusion occurs, as well as the observations corresponding to each leaf in the dendrogram.
d <- data.frame(c(0,0.3,0.4,0.7), c(0.3,0,0.5,0.8),
                c(0.4,0.5,0,0.45), c(0.7,0.8,0.45,0))
colnames(d) <- c(1,2,3,4)
d <- as.dist(d)
plot(hclust(d, method = 'complete'), xlab = '')

  1. Repeat (a), this time using single linkage clustering.
plot(hclust(d, method='single'), xlab='')

  1. Suppose that we cut the dendogram obtained in (a) such that two clusters result. Which observations are in each cluster?

Answer. (1,2) and (3,4).

  1. Suppose that we cut the dendogram obtained in (b) such that two clusters result. Which observations are in each cluster?

Answer. (4) and (1,2,3).

  1. It is mentioned in the chapter that at each fusion in the dendrogram, the position of the two clusters being fused can be swapped without changing the meaning of the dendrogram. Draw a dendrogram that is equivalent to the dendrogram in (a), for which two or more of the leaves are repositioned, but for which the meaning of the dendrogram is the same.
plot(hclust(d, method="complete"), labels=c(2,1,4,3), xlab='')

3. In this problem, you will perform \(K\)-means clustering manually, with \(K=2\), on a small example with \(n=6\) observations and \(p=2\) features. The observations are as follows.
  1. Plot the observations.
obs <- data.frame(c(1, 1, 0, 5, 6, 4), c(4, 3, 4, 1, 2, 0))
colnames(obs) <- c("x1","x2")
plot(obs$x1, obs$x2, xlab = "x1", ylab = "x2")

  1. Randomly assign a cluster label to each observation. You can use the sample() command in R to do this. Report the cluster labels for each observation.
set.seed(1)
labels <- sample(2, nrow(obs), replace=T)
labels
## [1] 1 2 1 1 2 1
  1. Compute the centroid for each cluster.
centroid1 <- c(mean(obs[labels==1,1]), mean(obs[labels==1,2]))
centroid1
## [1] 2.50 2.25
centroid2 <- c(mean(obs[labels==2,1]), mean(obs[labels==2,2]))
centroid2
## [1] 3.5 2.5
  1. Assign each observation to the centroid to which it is closest, in terms of Euclidean distance. Report the cluster labels for each observation.
plot(obs$x1, obs$x2, xlab = "x1", ylab = "x2", col=labels*2, pch=20, cex=2)
points(centroid1[1], centroid1[2], col=2, pch=4, cex=1.5)
points(centroid2[1], centroid2[2], col=4, pch=4, cex=1.5)

eucl <- function(a,b){return(sqrt(sum((a-b)^2)))}
labels_new <- apply(obs,1, function(x){which.min(c(eucl(x, centroid1), eucl(x, centroid2)))})
labels_new
## [1] 1 1 1 2 2 2
  1. Repeat (c) and (d) until the answers obtained stop changing.
while (!all(labels_new == labels)) {
    labels <- labels_new
    centroid1 <- c(mean(obs[labels==1,1]), mean(obs[labels==1,2]))
    centroid2 <- c(mean(obs[labels==2,1]), mean(obs[labels==2,2]))
    labels_new <- apply(obs,1, function(x){which.min(c(eucl(x, centroid1), eucl(x, centroid2)))})
}
labels_new
## [1] 1 1 1 2 2 2
  1. In your plot from (a), color the observations according to the cluster labels obtained.
plot(obs$x1, obs$x2, xlab = "x1", ylab = "x2", col=labels_new*2, pch=20, cex=2)
points(centroid1[1], centroid1[2], col=2, pch=4, cex=1.5)
points(centroid2[1], centroid2[2], col=4, pch=4, cex=1.5)

10. In this problem, you will generate simulated data, and then perform PCA and \(K\)-means clustering on the data.

  1. Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables. Hint: There are a number of functions in R that you can use to generate data. One example is the rnorm() function; runif() is another option. Be sure to add a mean shift to the observations in each class so that there are three distinct classes.
set.seed(1)
x = matrix(rnorm(60*50), ncol=50)
x[21:60, 2:3] <- matrix(runif(40*2, min=9, max=11), ncol=2)
x[41:60, 4:5] <- matrix(rnorm(20*2, mean=-10, sd=1), ncol=2)
labels <- c(rep(1,20),rep(2,20),rep(3,20))
  1. Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (c). If not, then return to part (a) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (c) until the three classes show at least some separation in the first two principal component score vectors.
pca.out = prcomp(x)
# summary(pca.out)
plot(pca.out$x[,1:2], col=labels+1, pch=19) 

  1. Perform \(K\)-means clustering of the observations with \(K=3\). How well do the clusters that you obtained in \(K\)-means clustering compare to the true class labels? Hint: You can use the table() function in R to compare the true class labels to the class labels obtained by clustering. Be careful how you interpret the results: \(K\)-means clustering will arbitrarily number the clusters, so you cannot simply check whether the true class labels and clustering labels are the same.
set.seed(3)
km.out=kmeans(x,3,nstart=15)
table(km.out$cluster, c(rep(3,20), rep(1,20), rep(2,20)))
##    
##      1  2  3
##   1 20  0  0
##   2  0 20  0
##   3  0  0 20
  1. Perform \(K\)-means clustering with \(K=2\). Describe your results.
km.out = kmeans(x, 2, nstart=15)
km.out$cluster
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [39] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Two of the previous classes merge into one class.

plot(pca.out$x[,1:2], col=km.out$cluster+1, cex=2, pch=1, lwd=2)
points(pca.out$x[,1:2], col=labels+1, pch=19)

  1. Now perform \(K\)-means clustering with \(K=4\), and describe your results.
set.seed(1)
km.out = kmeans(x, 4, nstart=15)
km.out$cluster
##  [1] 1 3 1 3 1 1 3 1 3 1 3 1 3 3 1 1 3 3 3 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [39] 4 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

One of the previous classes is divided into two classes.

plot(pca.out$x[,1:2], col=km.out$cluster*2, cex=2, pch=1, lwd=2)
points(pca.out$x[,1:2], col=labels+1, pch=19)

  1. Now perform \(K\)-means clustering with \(K=3\) on the first two principal component score vectors, rather than on the raw data. That is, perform \(K\)-means clustering on the \(60 \times 2\) matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.
set.seed(2)
km.out = kmeans(pca.out$x[,1:2], 3, nstart=15)
table(km.out$cluster, c(rep(1,20), rep(2,20), rep(3,20)))
##    
##      1  2  3
##   1 20  0  0
##   2  0 20  0
##   3  0  0 20

The cluster results match the classes.

  1. Using the scale() function, perform \(K\)-means clustering with \(K=3\) on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (b)? Explain.
km.out = kmeans(scale(x), 3, nstart=15)
km.out$cluster
##  [1] 2 1 2 1 2 2 1 2 1 2 1 2 1 1 2 1 1 1 1 1 1 1 2 2 1 2 2 2 2 2 1 1 1 2 2 1 1 2
## [39] 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

The results are poorer after scaling. The three clusters are obtained mainly from shifting the mean, so the difference is weakened after scaling.

pca.out2 = prcomp(scale(x))
plot(pca.out2$x[,1:2], col=km.out$cluster+1, cex=2, pch=1, lwd=2)
points(pca.out2$x[,1:2], col=labels+1, pch=19)