2.3 Hirarchical Clustering
1. Introduction
Hierarchical clustering is a method of cluster analysis that builds a hierarchy of clusters. It aims to group similar objects into clusters based on their proximity or similarity, creating a tree-like structure known as a dendrogram. This dendrogram can be cut at different levels to form clusters at varying granularities, making hierarchical clustering versatile for exploratory data analysis.
2. Components
-
Algorithm Overview
-
Agglomerative Clustering:
- Start with each data point as a separate cluster.
- Merge the closest pairs of clusters iteratively until all data points belong to a single cluster.
- Construct a dendrogram to visualize the merging process.
-
Divisive Clustering:
- Start with all data points in one cluster.
- Recursively split the cluster into smaller clusters until each data point forms its own cluster.
- Construct a dendrogram showing the splitting process.
-
-
Mathematical Formulation
-
Distance Metric: Hierarchical clustering uses a distance measure (e.g., Euclidean distance, Manhattan distance) to determine the similarity between data points or clusters.
-
Linkage Criteria: Defines how the distance between clusters is computed during merging or splitting:
- Single Linkage: Minimum distance between points in different clusters:
- Complete Linkage: Maximum distance between points in different clusters:
- Average Linkage: Average distance between all pairs of points from different clusters.
- Single Linkage: Minimum distance between points in different clusters:
-
Dendrogram Construction: A dendrogram visually represents the hierarchical clustering process, showing the order and distances at which clusters are merged or split.
-
3. Key Concepts
-
Cluster Fusion: Agglomerative clustering merges clusters based on similarity until a single cluster is formed.
-
Cluster Splitting: Divisive clustering recursively splits clusters into smaller clusters or individual data points.
-
Cutting the Dendrogram: Determining the number of clusters by cutting the dendrogram at a desired similarity level or height.
-
Cluster Similarity Measures: Choosing an appropriate distance metric and linkage criterion influences the resulting clusters’ shape and compactness.
4. Advantages and Limitations
-
Advantages:
- No need to specify the number of clusters
beforehand. - Provides a hierarchy of clusters, aiding in exploratory data analysis and interpretation.
- Handles non-convex clusters and varying cluster sizes.
- No need to specify the number of clusters
-
Limitations:
- Computationally intensive for large datasets, especially with agglomerative methods.
- Dendrogram interpretation can be subjective, requiring domain knowledge.
- Sensitivity to distance metrics and linkage criteria selection.
5. Applications
- Biology: Clustering genes based on expression patterns.
- Marketing: Segmenting customers based on purchasing behavior.
- Image Processing: Grouping similar regions in image segmentation.
- Anomaly Detection: Identifying unusual patterns or outliers in data.
6. Variants and Extensions
- Ward’s Method: Minimizes the variance within each cluster during merging.
- Centroid-based Hierarchical Clustering: Uses centroids of clusters instead of individual data points.
- Hierarchical Density-Based Clustering: Integrates density-based methods for robust clustering.