Skip to main content

K-means clustering

This is a simple and widely used representation learning algorithm. It is an un-supervised learning method; there is no labels or target variables.It groups the dataset based on the pattern in the data.It is used in various recommendation engines (Spotify helping you find new songs based on the type of songs you listen, YouTube recommends you the next video to watch, Amazon suggest the product you might be interested to buy), customer segmentation, image segmentation etc.

First we randomly choose desired number (k) of centers. We connect all other points to their respective nearest centers. Then we calculate the centroid of the clusters. We find the nearest points with respect to those centroids, and define new centroids onces there is no change in centroid positions we stop searching and a solution is reached.

We have input data: x1,x2,,xnRDx_1, x_2, \dots, x_n \in \R^D. We have to find kk cluster centers at positions μ1,μ2,,μkRD\mu_1, \mu_2, \dots, \mu_k \in \R^D for which:

cost(μ1,μ2,,μk)=i=1nminjxiμi2\text{cost}(\mu_1, \mu_2, \dots, \mu_k) = \sum\limits_{i=1}^n \min\limits_j||x_i -\mu_i||^2

Unfortunately, this problem is NP hard, and depending on the initial choice of centers, it would lead to different end results (it will converge to a local minima). There are ways to mitigate this problem. One such method is k-means++, where once a cluster center is chosen, it is preferred to have next center that is farthest from previously chosen centers.

Notebooks

Resources