K-means clustering is quick and dirty and generally provides some interesting results. However, the default kmeans function in R lacks features, such as actually storing the model to use the centroids for prediction purposes on unseen data. That’s where flexclust comes in.
Flexclust is a package that is designed around K-centroid cluster analysis. Its most important function is the acronym kcca().
First, let’s load the packages.
Let’s say you have a data frame (dt) that contains numeric data and factors. You’re gonna want to convert all factors to binaries.
dt <- dummy.data.frame(dt, dummy.classes='factor')
Next, we convert the data frame to a matrix. There are multiple ways to do this, however, to make sure that all variables are treated as equally important, I scale and center the data (and so should you).
mx <- data.matrix(dt) mx_scaled <- scale(mx)
Finally, I train the model and store it in a kModel variable.
kModel <- kcca(mx_scaled, 5, family = kccaFamily('kmeans'))
Now, we need to scale the new data with the same parameters as the old data. You should know that the scale() function returns a matrix, but it has two attributes that you can use: scaled:center and scaled:scale. You can use these as parameters to scale your new data.
mx2 <- data.matrix(dt2) mx2_scaled <- scale(mx2, attr(mx_scaled, "scaled:center"), attr(mx_scaled, "scaled:scale"))
Finally, you can use the predict() function to use the centroids from your first data set to cluster your new data.
By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!