Skip to content

2.2 KMeans

1. Introduction

K-means clustering is one of the most popular unsupervised machine learning algorithms used for clustering data into groups or clusters based on similarities in the data points. It aims to partition observations into clusters where each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

2. Components

  1. Algorithm Overview

    • Step 1: Initialization

      • Choose initial centroids (cluster centers) randomly from the data points.
    • Step 2: Assignment

      • Assign each data point to the nearest centroid based on a distance metric (usually Euclidean distance): where is the set of data points assigned to cluster , and is the mean (centroid) of cluster .
    • Step 3: Update Centroids

      • Recalculate the centroids as the mean of all data points assigned to each cluster:
    • Step 4: Iteration

      • Repeat steps 2 and 3 until convergence criteria are met (e.g., centroids do not change significantly or a maximum number of iterations is reached).
  2. Mathematical Formulation

    • Objective Function: K-means minimizes the sum of squared distances between data points and their respective centroids: where is the centroid of cluster , and denotes the set of data points assigned to cluster .

    • Distance Metric: Typically, the Euclidean distance is used to measure the similarity between data points and centroids: where is the number of dimensions (features) in the data.

3. Key Concepts

  • Centroid: A representative point of a cluster, calculated as the mean of all points in that cluster.

  • Cluster Assignment: Determining which cluster each data point belongs to based on the closest centroid.

  • Initialization: The choice of initial centroids can impact convergence and the quality of clustering results. Common strategies include random selection, k-means++, or using results from a previous run.

  • Convergence: K-means iteratively minimizes the objective function until a stable configuration (minimal change in centroids) is achieved.

4. Advantages and Limitations

  • Advantages:

    • Simple and easy to implement.
    • Efficient for large datasets.
    • Generally works well with spherical clusters.
  • Limitations:

    • Requires predefined , which may not always be known a priori.
    • Sensitive to initialization and can converge to local optima.
    • Not suitable for non-linear or irregularly shaped clusters.

5. Applications

  • Market Segmentation: Grouping customers based on purchasing behavior.
  • Image Segmentation: Partitioning an image into distinct regions for analysis.
  • Document Clustering: Organizing documents based on content similarity.
  • Anomaly Detection: Identifying outliers based on deviation from cluster centroids.

6. Variants and Extensions

  • K-medoids: Uses medoids (actual data points) instead of centroids, robust to outliers.
  • Fuzzy C-means: Assigns membership probabilities to data points instead of hard assignments.
  • Kernel K-means: Extends K-means to non-linear feature spaces using kernel methods.

7. Important Traits

7.1 Choosing the Best K

Choosing the optimal number of clusters in K-means clustering is crucial for producing meaningful results. Since K-means requires as an input, selecting the best value is not always straightforward. Here are several techniques that can help in determining the optimal :

1. Elbow Method

The Elbow Method is one of the most commonly used approaches for determining the best . It works by plotting the sum of squared errors (SSE) or inertia as a function of .

Steps:

  1. Run K-means for different values of (e.g., ).
  2. For each , compute the sum of squared errors (SSE), which is the sum of the squared distance between each point and the centroid of the cluster it belongs to.
  3. Plot the SSE against .

How to interpret:

  • The SSE decreases as increases because the more clusters you have, the lower the distortion (each point is closer to a centroid).
  • Look for the “elbow” point in the plot where the SSE starts decreasing more slowly (i.e., a noticeable bend in the curve). This is typically where the optimal lies.
  • The elbow represents the point where adding more clusters doesn’t significantly improve the model’s performance.

Example:

where is a data point and is the centroid of cluster .

2. Silhouette Score

The Silhouette Score measures how similar a data point is to its own cluster compared to other clusters. It provides a score ranging from to , where:

  • means that points are well clustered and distinct from other clusters.
  • means that the points lie on the boundary of two clusters.
  • Negative values indicate that points are assigned to the wrong clusters.

Steps:

  1. Compute the silhouette score for different values of .
  2. Plot the silhouette score against .

How to interpret:

  • Choose that maximizes the silhouette score.
  • High silhouette scores indicate that the clusters are dense and well separated.

Example:

The silhouette score for a data point is given by: where:

  • is the average intra-cluster distance (average distance within its cluster).
  • is the average nearest-cluster distance (distance to the nearest cluster the point is not a part of).

3. Gap Statistic

The Gap Statistic compares the performance of K-means clustering on the data to the performance of K-means on a dataset that has no cluster structure (random uniform distribution). It quantifies how far the within-cluster dispersion deviates from that of a reference distribution.

Steps:

  1. For each , compute the within-cluster dispersion on the original data.
  2. Generate random reference datasets and compute the within-cluster dispersion for each reference dataset.
  3. The gap statistic is defined as the difference between the logarithms of the dispersions of the original data and the reference data: where is the dispersion of the reference datasets.

How to interpret:

  • Choose that maximizes the gap statistic.
  • It indicates that the clustering found in the original data is significantly better than that in the random data.

4. Davies-Bouldin Index

The Davies-Bouldin Index (DBI) measures the average similarity ratio between each cluster and its most similar cluster. A lower DBI indicates better clustering.

Steps:

  1. For each , compute the Davies-Bouldin Index.
  2. The index is given by: where:
    • is the average distance between each point in cluster and the centroid of .
    • is the distance between centroids of clusters and .

How to interpret:

  • A lower Davies-Bouldin Index indicates better clustering.

5. Within-Cluster Sum of Squares (WCSS)

This is similar to the Elbow Method but more focused on the internal variance of clusters. WCSS measures the compactness of the clusters by computing the sum of squared distances between each point and the centroid of its assigned cluster.

Steps:

  1. For each , compute the WCSS. where is cluster and is the centroid of cluster .

  2. Plot WCSS as a function of .

How to interpret:

  • Look for a point where increasing no longer significantly reduces WCSS, similar to the Elbow Method.

6. Cross-Validation (for semi-supervised problems)

In some cases where labels are available, but K-means is still desired for exploratory analysis, cross-validation techniques like silhouette or adjusted Rand index can be used to validate the clusters.

How to interpret:

  • Use cross-validation metrics (e.g., adjusted Rand index) to choose based on the best fit for available labels.

Summary of Techniques:

TechniquePurposeInterpretation
Elbow MethodFind the point where SSE starts to level off (“elbow”).Choose at the elbow of the curve.
Silhouette ScoreMeasure cluster separation and cohesion.Choose that maximizes the silhouette score.
Gap StatisticCompare clustering to random reference data.Choose that maximizes the gap statistic.
Davies-Bouldin IndexMeasure intra-cluster compactness and inter-cluster separation.Choose with the lowest DBI.
Within-Cluster Sum of Squares (WCSS)Measure compactness of clusters.Choose where WCSS shows diminishing returns.