When you are clustering, what you are actually trying to do is to find groups of objects so that they are similar to one another, and different from the object of other groups. In other words, you want to minimize the intra-cluster distance and maximize the inter-cluster distance.
Clustering algorithms are rather subjective because the one who designs the algorithm chooses the features and the measures that define (dis)similarity. For many algorithms, such as a K-means clustering algorithm, it is up to the designer to set the amount of a clusters he or she would like to end up with. In this blog post I would like to elaborate on a way of determining the optimal number of clusters: the gap statistic. In this blog post I use Tibshirani’s initial paper to explain the concept.
Here are some concepts you need to understand the process.
This is how we calculate , the sum of pairwise distances () between all the objects () in cluster . For simplicity, we calculate as the squared euclidean distance between and .
The pooled within-cluster sum of squares round the cluster means has the following components. For every cluster, you divide the sum of pairwise distances by two times (the power of the Minkowski distance is set to 2 by Tibshirani) the amount of objects in that cluster.
Finally, we come to the gap statistic, the expected minus the , for a k number of clusters.
So, what is this expected ? The paper sets forth two methods. However, in essence, both methods start from generating a uniform distribution out of the observed values for the data set you’working with.
The paper outlines the three steps to get to the most optimal k.
First, (1) cluster your data a couple of times, varying k.
Next, (2) for each k, generate multiple B data sets out of the reference distributions, for example by bootstrapping it. Calculate the gap statistic for each k by subtracting the from the mean of the you got from each of the bootstrapped reference distributions.
There are multiple ways to (3) select the most optimal of clusters, but this is the method that Tibshirani proposes: compute the standard deviation of the ‘s you got from bootstrapping the reference distribution and compare with .
In the following example, you can clearly see there are two clusters within the data. The gap statistic is at its maximum at k = 2, clearly picking up what we can see in the data with our bare eyes.
In the Tibshirani paper, 5 scenario’s are tested. The gap statistic, using both reference distribution methods are matched up against several other methods. As you can see from the following table, the gap statistic picks up the optimal amount of clusters in most scenario’s.
Gap analysis in R
First, I load in the cluster package and I create some dummy data. As you can see, the dummy data consists out of three clusters of normally distributed data over two dimensions. I didn’t make it so that there’s a rediculous amount of space between the clusters: ome objects that are 3 standard deviations away from the mean could be ending up another cluster.
library(cluster) set.seed(1) a <- c(rnorm(25,6,1),rnorm(25,6,1),rnorm(25,3,1)) b <- c(rnorm(25,4,1),rnorm(25,8,1),rnorm(25,4,1)) m <- matrix(c(a,b),ncol=2, nrow=75) plot(m, pch = 16)
I cluster the data from 1 to 5 clusters using K-means, with 500 bootstrapped samples from the reference distribution and with d.power=2 (infra).
c <- clusGap(m,FUNcluster = kmeans, B = 500, K.max = 5, d.power=2) plot(c) print(c, method = 'Tibs2001SEmax')
The output tells me that, according to Tibshirani’s 2001 method, I should be using k=3 to cluster my data.
The documentation of the clusGap function provides us a warning.
Note that this chooses k = 1 when all standard deviations are larger than the differences f(k+1) – f(k).
I have been playing around with more dummy data examples and different seeds, and it appears, Tibshirani’s method seems to fallback to k=1 quite often. For this, I usually fall back to the ‘globalmax’ or ‘firstmax’ method, which will simply select the k that has the highest gap value.