2.1 Clustering Overview
Clustering in machine learning is the task of grouping a set of objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other groups. It is an unsupervised learning approach where the algorithm automatically finds the patterns or structure in data without prior labels.
1. Techniques of Clustering
-
K-Means Clustering
Overview: K-means clustering aims to partition
observations into clusters where each observation belongs to the cluster with the nearest mean. -
Algorithm:
- Initialization: Choose
initial centroids randomly. - Assignment: Assign each data point to the nearest centroid.
- Update Centroids: Recalculate the centroid as the mean of all points assigned to it.
- Repeat: Iterate steps 2 and 3 until convergence (no change in centroids or minimal change).
- Initialization: Choose
-
Pros: Simple, efficient for large datasets, easy to interpret clusters.
-
Cons: Requires predefined
, sensitive to initial centroids, may converge to local optima.
-
-
Hierarchical Clustering
Overview: Hierarchical clustering builds a hierarchy of clusters either from the bottom up (agglomerative) or from the top down (divisive).
-
Agglomerative Algorithm:
- Start: Each data point is a cluster.
- Merge: Iteratively merge the closest pairs of clusters until all points belong to one cluster.
- Hierarchy: Represented as a dendrogram showing the merging process.
-
Divisive Algorithm: Starts with one cluster containing all data points and recursively splits into smaller clusters.
-
Pros: No need to specify
, provides a hierarchy of clusters. -
Cons: Computationally intensive for large datasets, dendrogram interpretation can be subjective.
-
-
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Overview: DBSCAN groups together points that are closely packed together (dense regions) and marks points as outliers (noise) that lie alone in low-density regions.
-
Algorithm:
- Core Points: Points with at least
points within a radius . - Border Points: Points within
of a core point but with fewer than neighbors. - Noise Points: Points that are neither core nor border points.
- Core Points: Points with at least
-
Pros: Can discover clusters of arbitrary shapes, robust to noise and outliers.
-
Cons: Requires tuning of
and , sensitive to density variations.
-
-
Gaussian Mixture Models (GMM)
Overview: GMM assumes that the data is generated from a mixture of several Gaussian distributions, each associated with a cluster.
-
Algorithm:
- Initialization: Choose initial parameters for Gaussian distributions.
- Expectation-Maximization (EM) Algorithm: Iteratively update parameters:
- E-step: Calculate probability of each point belonging to each cluster (using current parameters).
- M-step: Update parameters (mean, covariance, and mixture coefficients) based on the expected cluster assignments.
-
Pros: Can model complex cluster shapes, provides probabilistic cluster assignments.
-
Cons: Sensitive to initialization, slower convergence compared to K-means.
-
2. Progression of Techniques
-
Early Techniques:
- 1960s-1970s: Basic clustering algorithms like K-means and hierarchical clustering were developed, primarily focusing on partitioning data into non-overlapping clusters.
-
Advancements in 1980s-1990s:
- Density-Based Methods: DBSCAN introduced the concept of density-based clustering, which could handle noise and outliers effectively.
- Hierarchical Clustering: Algorithms for hierarchical clustering improved, providing better visualization and hierarchy of clusters.
-
Modern Techniques (2000s onwards):
- Probabilistic Models: GMM and Bayesian methods brought probabilistic interpretations to clustering, allowing uncertainty modeling.
- Deep Learning Approaches: Embedding-based clustering using autoencoders and deep neural networks to learn data representations for clustering.
-
Current Trends:
- Big Data Clustering: Scalable algorithms for large datasets, such as parallel K-means and streaming clustering algorithms.
- Hybrid Approaches: Combining different clustering techniques or integrating clustering with other machine learning tasks (e.g., semi-supervised learning).