Intuition Behind Clustering

  • Let's take a example of a bank which has 1 million customers, is sending message to me by saying Iyou can increase your credit limit from 1 lakh to 3 lakh. Now a question arises:
    • Does bank send this message to it's 1 million customer? No.
      • Because each customer has its key performance metrics like credit score, age, etc. According to that, bank AI can send message to each customer.
      • This is possible only due to clustering a specific set of customers into a group. This is the whole concept of clustering.

K-Means Clustering

  • The K-means clustering algorithm computes centroids and repeats until the optimal centroid is found.

Stepwise Working of K-Means Clustering

clusterflow

  • Step 1: Choose the value of K.
    • Choosing the value of K is the first step of the K-means clustering algorithm and most difficult to FIND. We can use WCSS but it does not help always.
    • Sometimes we have to derive the number of cluster according to the needs of the bussiness model. For example for daraz website, there are a customer based on season, offer sale and many more so we have to make a cluster according to that.

cluster

  • Step 2: Choose the initial centroids.
    • Once the value of K is decided, we have to choose the initial centroids. It is done by randomly selecting central points from the specific cluster.
  • Step 3: Assign data on Centroid of specific cluster.

    • We have a 3 cluster from above figure, so we have to choose 3 centroids. Here data points are assighned to each cluster by measuring the distance between the data point and the centroid.
    • We can use Manhattan or Euclidean distance to calculate the distance between the data point and the centroid.
  • Step 4: Update the centroid.

    • Once the centroid is assigned, we have to update the centroid. This is done by calculating the mean of the data points assigned to the centroid.
    • Centroid is updated continously untill there is no change in the centroid and this is the point when our algorithm is converged.

Elbow Method to Find the Optimal K

  • To find the optimal value of K we have to use the method called Within Cluster Sum of Square Method.
  • In this method, according to our cluster we have to maintain two criteria:
    • Distance between the clusters should be minimum.
    • Distance of the points to the centroid in the cluster should be minimum.
  • We are actually varying the number of clusters ( K ) from 1 – 9. For each value of K, we are calculating WCSS ( Within-Cluster Sum of Square ). - - WCSS is the sum of squared distance between each point and the centroid in a cluster. It is calculated using the following formula:

    wcss

  • When we plot the WCSS with the K value, the plot looks like an Elbow.

  • As the number of clusters increases, the WCSS value will start to decrease. WCSS value is largest when K = 1. When we analyze the graph we can see that the graph will rapidly change at a point and thus creating an elbow shape.
  • From this point, the graph starts to move almost parallel to the X-axis.
  • The K value corresponding to this point is the optimal K value or an optimal number of clusters. In our case 3 is the optimal K.

elbow

Interview Question Related to K-Means Clustering

  • 1 What is the difference between K-means and K-means++?

    • In k-means algorithm centroid initialization is done randomly. But in k-means++ algorithm, centroid initialization is done in such a way that it converges faster.
  • 2 How to choose the value of K?

    • We can use the elbow method to find the optimal K value.
    • Everytime Elbow method does not help so we have to generate cluster based on our business need, by consulting with a domain expert.
  • 3 What are the assumption of K-Means algorithm?

    • It assume that the data is normally distributed.
    • It assumes all the data is numerical.
  • 4 Will k-means algorithm work for categorical data?

    • It will work for categorical data but model will not perform well on these data since it assumes all data to be a numerical data.
  • 5 What is accuracy of your k-means clustering?

    • We cannot compute accuracy for a unsupervised learning algorithm. We can validate our cluster by using Dunn's Index or Silhoutte Score.
  • 6 What is the pros and cons of K-means clustering?

    • It is not good for more number of categorical data, but with numerical data it is good.
    • It is easy to use, deploy and understand.
    • It is resource intensive when we have large number of data.
  • 7 Why elbow curve is elbow not other?

    • Since for the first cluster the WCSS value is much higher than for the second or third or more. After some time it will become lower and then it will start to move parallel to the X-axis.
    • Elbow curve is the curve between the WCSS value and the K value.
  • 8 Why scaling is more important in k-means algorithm?

    • Scaling is perform to bring all the data to the same scale so that we can use the same algorithm to cluster the data. Generally it is bring down between 0 and 1.