1. This problem involves the \(K\)-means clustering algorithm.
\[ \frac{1}{\left|C_{k}\right|} \sum_{i, i^{\prime} \in C_{k}} \sum_{j=1}^{p}\left(x_{i j}-x_{i^{\prime} j}\right)^{2}=2 \sum_{i \in C_{k}} \sum_{j=1}^{p}\left(x_{i j}-\bar{x}_{k j}\right)^{2}, \]
where \(\bar{x}_{k j}=\frac{1}{\left|C_{k}\right|} \sum_{i \in C_{k}} x_{i j}\) is the mean for feature \(j\) in cluster \(C_{k}\).
Answer. Noting that \[ \sum_{i'\in C_{k}}(x_{i' j}-\bar x_{kj})=(\sum_{i'\in C_{k}}x_{i' j})-|C_{k}|\cdot\bar x_{kj}=0, \] so \[ \begin{aligned} \frac{1}{\left|C_{k}\right|} \sum_{i, i^{\prime} \in C_{k}} \sum_{j=1}^{p}\left(x_{i j}-x_{i^{\prime} j}\right)^{2} &= \frac{1}{\left|C_{k}\right|} \sum_{i, i^{\prime} \in C_{k}} \sum_{j=1}^{p}\left((x_{i j}-\bar x_{kj})-(x_{i^{\prime}j} -\bar x_{kj})\right)^{2}\\ &=\frac{1}{\left|C_{k}\right|} \sum_{i, i^{\prime} \in C_{k}} \sum_{j=1}^{p}\left((x_{i j}-\bar x_{kj})^2+(x_{i^{\prime}j} -\bar x_{kj})^2-2(x_{i j}-\bar x_{kj})(x_{i^{\prime}j} -\bar x_{kj})\right)\\ &=2\sum_{i\in C_{k}} \sum_{j=1}^{p}\left(x_{i j}-\bar x_{kj}\right)^2-\frac{2}{\left|C_{k}\right|}\sum_{j=1}^{p}\sum_{i\in C_{k}}\left((x_{i j}-\bar x_{kj})\sum_{i'\in C_{k}}(x_{i' j}-\bar x_{kj})\right)\\ &=2\sum_{i\in C_{k}} \sum_{j=1}^{p}\left(x_{i j}-\bar x_{kj}\right)^2. \end{aligned} \]
Answer. Because \(\sum_{j=1}^{p}\left(x_{i j}-\bar{x}_{k j}\right)^{2}\) is the Euclidean distance of the observation \(i\) and the belonged centroid, minimizing the objective is equivalent to minimize the within-cluster variance.
2. Suppose that we have four observations, for which we compute a dissimilarity matrix, given by \[ \left[\begin{array}{cccc} & 0.3 & 0.4 & 0.7 \\ 0.3 & & 0.5 & 0.8 \\ 0.4 & 0.5 & & 0.45 \\ 0.7 & 0.8 & 0.45 & \end{array}\right] \] For instance, the dissimilarity between the first and second observations is \(0.3\), and the dissimilarity between the second and fourth observations is \(0.8\).
d <- data.frame(c(0,0.3,0.4,0.7), c(0.3,0,0.5,0.8),
c(0.4,0.5,0,0.45), c(0.7,0.8,0.45,0))
colnames(d) <- c(1,2,3,4)
d <- as.dist(d)
plot(hclust(d, method = 'complete'), xlab = '')
plot(hclust(d, method='single'), xlab='')
Answer. (1,2) and (3,4).
Answer. (4) and (1,2,3).
plot(hclust(d, method="complete"), labels=c(2,1,4,3), xlab='')
3. In this problem, you will perform \(K\)-means clustering manually, with \(K=2\), on a small example with \(n=6\) observations and \(p=2\) features. The observations are as follows.
obs <- data.frame(c(1, 1, 0, 5, 6, 4), c(4, 3, 4, 1, 2, 0))
colnames(obs) <- c("x1","x2")
plot(obs$x1, obs$x2, xlab = "x1", ylab = "x2")
sample()
command in R
to do this. Report the cluster labels for each observation.set.seed(1)
labels <- sample(2, nrow(obs), replace=T)
labels
## [1] 1 2 1 1 2 1
centroid1 <- c(mean(obs[labels==1,1]), mean(obs[labels==1,2]))
centroid1
## [1] 2.50 2.25
centroid2 <- c(mean(obs[labels==2,1]), mean(obs[labels==2,2]))
centroid2
## [1] 3.5 2.5
plot(obs$x1, obs$x2, xlab = "x1", ylab = "x2", col=labels*2, pch=20, cex=2)
points(centroid1[1], centroid1[2], col=2, pch=4, cex=1.5)
points(centroid2[1], centroid2[2], col=4, pch=4, cex=1.5)
eucl <- function(a,b){return(sqrt(sum((a-b)^2)))}
labels_new <- apply(obs,1, function(x){which.min(c(eucl(x, centroid1), eucl(x, centroid2)))})
labels_new
## [1] 1 1 1 2 2 2
while (!all(labels_new == labels)) {
labels <- labels_new
centroid1 <- c(mean(obs[labels==1,1]), mean(obs[labels==1,2]))
centroid2 <- c(mean(obs[labels==2,1]), mean(obs[labels==2,2]))
labels_new <- apply(obs,1, function(x){which.min(c(eucl(x, centroid1), eucl(x, centroid2)))})
}
labels_new
## [1] 1 1 1 2 2 2
plot(obs$x1, obs$x2, xlab = "x1", ylab = "x2", col=labels_new*2, pch=20, cex=2)
points(centroid1[1], centroid1[2], col=2, pch=4, cex=1.5)
points(centroid2[1], centroid2[2], col=4, pch=4, cex=1.5)
10. In this problem, you will generate simulated data, and then perform PCA and \(K\)-means clustering on the data.
R
that you can use to generate data. One example is the rnorm()
function; runif()
is another option. Be sure to add a mean shift to the observations in each class so that there are three distinct classes.set.seed(1)
x = matrix(rnorm(60*50), ncol=50)
x[21:60, 2:3] <- matrix(runif(40*2, min=9, max=11), ncol=2)
x[41:60, 4:5] <- matrix(rnorm(20*2, mean=-10, sd=1), ncol=2)
labels <- c(rep(1,20),rep(2,20),rep(3,20))
pca.out = prcomp(x)
# summary(pca.out)
plot(pca.out$x[,1:2], col=labels+1, pch=19)
table()
function in R
to compare the true class labels to the class labels obtained by clustering. Be careful how you interpret the results: \(K\)-means clustering will arbitrarily number the clusters, so you cannot simply check whether the true class labels and clustering labels are the same.set.seed(3)
km.out=kmeans(x,3,nstart=15)
table(km.out$cluster, c(rep(3,20), rep(1,20), rep(2,20)))
##
## 1 2 3
## 1 20 0 0
## 2 0 20 0
## 3 0 0 20
km.out = kmeans(x, 2, nstart=15)
km.out$cluster
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [39] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Two of the previous classes merge into one class.
plot(pca.out$x[,1:2], col=km.out$cluster+1, cex=2, pch=1, lwd=2)
points(pca.out$x[,1:2], col=labels+1, pch=19)
set.seed(1)
km.out = kmeans(x, 4, nstart=15)
km.out$cluster
## [1] 1 3 1 3 1 1 3 1 3 1 3 1 3 3 1 1 3 3 3 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [39] 4 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
One of the previous classes is divided into two classes.
plot(pca.out$x[,1:2], col=km.out$cluster*2, cex=2, pch=1, lwd=2)
points(pca.out$x[,1:2], col=labels+1, pch=19)
set.seed(2)
km.out = kmeans(pca.out$x[,1:2], 3, nstart=15)
table(km.out$cluster, c(rep(1,20), rep(2,20), rep(3,20)))
##
## 1 2 3
## 1 20 0 0
## 2 0 20 0
## 3 0 0 20
The cluster results match the classes.
scale()
function, perform \(K\)-means clustering with \(K=3\) on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (b)? Explain.km.out = kmeans(scale(x), 3, nstart=15)
km.out$cluster
## [1] 2 1 2 1 2 2 1 2 1 2 1 2 1 1 2 1 1 1 1 1 1 1 2 2 1 2 2 2 2 2 1 1 1 2 2 1 1 2
## [39] 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
The results are poorer after scaling. The three clusters are obtained mainly from shifting the mean, so the difference is weakened after scaling.
pca.out2 = prcomp(scale(x))
plot(pca.out2$x[,1:2], col=km.out$cluster+1, cex=2, pch=1, lwd=2)
points(pca.out2$x[,1:2], col=labels+1, pch=19)