2.2 KMeans
1. Introduction
K-means clustering is one of the most popular unsupervised machine learning algorithms used for clustering data into groups or clusters based on similarities in the data points. It aims to partition
2. Components
-
Algorithm Overview
-
Step 1: Initialization
- Choose
initial centroids (cluster centers) randomly from the data points.
- Choose
-
Step 2: Assignment
- Assign each data point to the nearest centroid based on a distance metric (usually Euclidean distance):
where is the set of data points assigned to cluster , and is the mean (centroid) of cluster .
- Assign each data point to the nearest centroid based on a distance metric (usually Euclidean distance):
-
Step 3: Update Centroids
- Recalculate the centroids as the mean of all data points assigned to each cluster:
- Recalculate the centroids as the mean of all data points assigned to each cluster:
-
Step 4: Iteration
- Repeat steps 2 and 3 until convergence criteria are met (e.g., centroids do not change significantly or a maximum number of iterations is reached).
-
-
Mathematical Formulation
-
Objective Function: K-means minimizes the sum of squared distances between data points and their respective centroids:
where is the centroid of cluster , and denotes the set of data points assigned to cluster . -
Distance Metric: Typically, the Euclidean distance is used to measure the similarity between data points and centroids:
where is the number of dimensions (features) in the data.
-
3. Key Concepts
-
Centroid: A representative point of a cluster, calculated as the mean of all points in that cluster.
-
Cluster Assignment: Determining which cluster each data point belongs to based on the closest centroid.
-
Initialization: The choice of initial centroids can impact convergence and the quality of clustering results. Common strategies include random selection, k-means++, or using results from a previous run.
-
Convergence: K-means iteratively minimizes the objective function until a stable configuration (minimal change in centroids) is achieved.
4. Advantages and Limitations
-
Advantages:
- Simple and easy to implement.
- Efficient for large datasets.
- Generally works well with spherical clusters.
-
Limitations:
- Requires predefined
, which may not always be known a priori. - Sensitive to initialization and can converge to local optima.
- Not suitable for non-linear or irregularly shaped clusters.
- Requires predefined
5. Applications
- Market Segmentation: Grouping customers based on purchasing behavior.
- Image Segmentation: Partitioning an image into distinct regions for analysis.
- Document Clustering: Organizing documents based on content similarity.
- Anomaly Detection: Identifying outliers based on deviation from cluster centroids.
6. Variants and Extensions
- K-medoids: Uses medoids (actual data points) instead of centroids, robust to outliers.
- Fuzzy C-means: Assigns membership probabilities to data points instead of hard assignments.
- Kernel K-means: Extends K-means to non-linear feature spaces using kernel methods.
7. Important Traits
7.1 Choosing the Best K
Choosing the optimal number of clusters
1. Elbow Method
The Elbow Method is one of the most commonly used approaches for determining the best
Steps:
- Run K-means for different values of
(e.g., ). - For each
, compute the sum of squared errors (SSE), which is the sum of the squared distance between each point and the centroid of the cluster it belongs to. - Plot the SSE against
.
How to interpret:
- The SSE decreases as
increases because the more clusters you have, the lower the distortion (each point is closer to a centroid). - Look for the “elbow” point in the plot where the SSE starts decreasing more slowly (i.e., a noticeable bend in the curve). This is typically where the optimal
lies. - The elbow represents the point where adding more clusters doesn’t significantly improve the model’s performance.
Example:
where
2. Silhouette Score
The Silhouette Score measures how similar a data point is to its own cluster compared to other clusters. It provides a score ranging from
means that points are well clustered and distinct from other clusters. means that the points lie on the boundary of two clusters. - Negative values indicate that points are assigned to the wrong clusters.
Steps:
- Compute the silhouette score for different values of
. - Plot the silhouette score against
.
How to interpret:
- Choose
that maximizes the silhouette score. - High silhouette scores indicate that the clusters are dense and well separated.
Example:
The silhouette score for a data point is given by:
is the average intra-cluster distance (average distance within its cluster). is the average nearest-cluster distance (distance to the nearest cluster the point is not a part of).
3. Gap Statistic
The Gap Statistic compares the performance of K-means clustering on the data to the performance of K-means on a dataset that has no cluster structure (random uniform distribution). It quantifies how far the within-cluster dispersion deviates from that of a reference distribution.
Steps:
- For each
, compute the within-cluster dispersion on the original data. - Generate random reference datasets and compute the within-cluster dispersion for each reference dataset.
- The gap statistic is defined as the difference between the logarithms of the dispersions of the original data and the reference data:
where is the dispersion of the reference datasets.
How to interpret:
- Choose
that maximizes the gap statistic. - It indicates that the clustering found in the original data is significantly better than that in the random data.
4. Davies-Bouldin Index
The Davies-Bouldin Index (DBI) measures the average similarity ratio between each cluster and its most similar cluster. A lower DBI indicates better clustering.
Steps:
- For each
, compute the Davies-Bouldin Index. - The index is given by:
where: is the average distance between each point in cluster and the centroid of . is the distance between centroids of clusters and .
How to interpret:
- A lower Davies-Bouldin Index indicates better clustering.
5. Within-Cluster Sum of Squares (WCSS)
This is similar to the Elbow Method but more focused on the internal variance of clusters. WCSS measures the compactness of the clusters by computing the sum of squared distances between each point and the centroid of its assigned cluster.
Steps:
-
For each
, compute the WCSS. where is cluster and is the centroid of cluster . -
Plot WCSS as a function of
.
How to interpret:
- Look for a point where increasing
no longer significantly reduces WCSS, similar to the Elbow Method.
6. Cross-Validation (for semi-supervised problems)
In some cases where labels are available, but K-means is still desired for exploratory analysis, cross-validation techniques like silhouette or adjusted Rand index can be used to validate the clusters.
How to interpret:
- Use cross-validation metrics (e.g., adjusted Rand index) to choose
based on the best fit for available labels.
Summary of Techniques:
Technique | Purpose | Interpretation |
---|---|---|
Elbow Method | Find the point where SSE starts to level off (“elbow”). | Choose |
Silhouette Score | Measure cluster separation and cohesion. | Choose |
Gap Statistic | Compare clustering to random reference data. | Choose |
Davies-Bouldin Index | Measure intra-cluster compactness and inter-cluster separation. | Choose |
Within-Cluster Sum of Squares (WCSS) | Measure compactness of clusters. | Choose |