Understanding K-means Clustering

Naman Bansal
5 min readAug 11, 2021

K-means Clustering, one of the simplest and popular unsupervised machine learning algorithms. is an iterative algorithm that tries to partition the dataset into pre-defined distinct non-overlapping subgroups (referred to as Clusters) where ‘K’ represents the number of clusters and each data point belongs to only one group.

What are Clusters?

A cluster refers to a collection of data points aggregated together because of certain similarities.

What is Unsupervised Learning (UL)?

Unsupervised learning uses machine learning algorithms to analyze and cluster unlabeled datasets. These algorithms discover hidden patterns or data groupings without the need for human intervention and use them for labeling/clubbing the data under unique targets.

How does the K-means algorithm work?

src: https://www.youtube.com/watch?v=EItlUEPCIzM&ab_channel=codebasics

Step-1: Decide the number of clusters ‘K’.

Step-2: Randomly selects K points/centroids.

Step-3: Assign each data point to their closest centroid, forming local clusters.

Step-4: Calculate the centroids of the new clusters and then again separate the data points closest to their newfound centroids.

Step-5: Repeat the 4th step, till each data points settles for their closest centroid and no further separation can be made.

In the end, you will be left with K distinct non-overlapping clusters.

K-means Implementation

Let’s see the steps on how the K-means machine learning algorithm works using the Python programming language.

Step 1: Import libraries

Step 2: Import dataset

we’ll be using Mall customers | Kaggle dataset

(we’ll just be using ‘Annual Income (k$)’ and ‘Spending Score (1–100)’ for this example)

Step 3: Visualizing dataset

As we can see the data is separated into approximately 5 clusters

But in many cases the data might be like this

Therefore, in these cases we use Elbow method to determine the ideal value of K by comparing the values of square mean error of all the possible values of K

What is Elbow Method?

This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total variations within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2

+ ∑Pi in CLuster3 distance(Pi C3)2

To find the optimal value of clusters, the elbow method follows the below steps:

  • It executes the K-means clustering on a given dataset for different K values (ranges from 1–10).
  • For each value of K, calculates the WCSS value.
  • Plots a curve between calculated WCSS values and the number of clusters K.
  • The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the best value of K.

Step 4: Using Elbow method to determine value of K

we can the see the sudden change in slope right at the 5 cluster mark creating a elbow like structure

As we predicted the ideal value of K for this dataset is 5.

Step 5: Using Scikit-learn

Create a Kmean model with 5 clusters using the following command:

then fitting our dataset into the model,

As we an see the model has assigned each data points to their respective clusters.
Lets add this to our data set for better visualization

Step 6: Visualize

Lets visualize by plotting a graph having each clusters with different colors,

As we can observe the data has been divided into 5 distinct clusters

Step 7: Finding the centroid

We can access the final centroids by using the following command:

Lets add this to our visualization by marking each centroid on the graph with different color

Wrapping up

K-means clustering is an extensively used technique for data cluster analysis. which is widely used for many Industrial use-cases, for its versatile nature and ability to quickly add new data to the existing clusters

However, its performance is usually not as competitive as those of the other sophisticated clustering techniques because slight variations in the data could lead to high variance.

--

--