Skip to content

2.3 Hirarchical Clustering

1. Introduction

Hierarchical clustering is a method of cluster analysis that builds a hierarchy of clusters. It aims to group similar objects into clusters based on their proximity or similarity, creating a tree-like structure known as a dendrogram. This dendrogram can be cut at different levels to form clusters at varying granularities, making hierarchical clustering versatile for exploratory data analysis.

2. Components

  1. Algorithm Overview

    • Agglomerative Clustering:

      • Start with each data point as a separate cluster.
      • Merge the closest pairs of clusters iteratively until all data points belong to a single cluster.
      • Construct a dendrogram to visualize the merging process.
    • Divisive Clustering:

      • Start with all data points in one cluster.
      • Recursively split the cluster into smaller clusters until each data point forms its own cluster.
      • Construct a dendrogram showing the splitting process.
  2. Mathematical Formulation

    • Distance Metric: Hierarchical clustering uses a distance measure (e.g., Euclidean distance, Manhattan distance) to determine the similarity between data points or clusters.

    • Linkage Criteria: Defines how the distance between clusters is computed during merging or splitting:

      • Single Linkage: Minimum distance between points in different clusters:
      • Complete Linkage: Maximum distance between points in different clusters:
      • Average Linkage: Average distance between all pairs of points from different clusters.
    • Dendrogram Construction: A dendrogram visually represents the hierarchical clustering process, showing the order and distances at which clusters are merged or split.

3. Key Concepts

  • Cluster Fusion: Agglomerative clustering merges clusters based on similarity until a single cluster is formed.

  • Cluster Splitting: Divisive clustering recursively splits clusters into smaller clusters or individual data points.

  • Cutting the Dendrogram: Determining the number of clusters by cutting the dendrogram at a desired similarity level or height.

  • Cluster Similarity Measures: Choosing an appropriate distance metric and linkage criterion influences the resulting clusters’ shape and compactness.

4. Advantages and Limitations

  • Advantages:

    • No need to specify the number of clusters beforehand.
    • Provides a hierarchy of clusters, aiding in exploratory data analysis and interpretation.
    • Handles non-convex clusters and varying cluster sizes.
  • Limitations:

    • Computationally intensive for large datasets, especially with agglomerative methods.
    • Dendrogram interpretation can be subjective, requiring domain knowledge.
    • Sensitivity to distance metrics and linkage criteria selection.

5. Applications

  • Biology: Clustering genes based on expression patterns.
  • Marketing: Segmenting customers based on purchasing behavior.
  • Image Processing: Grouping similar regions in image segmentation.
  • Anomaly Detection: Identifying unusual patterns or outliers in data.

6. Variants and Extensions

  • Ward’s Method: Minimizes the variance within each cluster during merging.
  • Centroid-based Hierarchical Clustering: Uses centroids of clusters instead of individual data points.
  • Hierarchical Density-Based Clustering: Integrates density-based methods for robust clustering.