Skip to content

1.1 Fundamentals

1. Elbow Method

The elbow method is a heuristic technique used to determine the optimal number of clusters () in -means clustering.

How it works:

  • Run -means clustering for different values of (e.g., to ).
  • For each , calculate the within-cluster sum of squares (WCSS), also called inertia — it measures the total squared distance between each point and the centroid of its cluster.
  • Plot on the x-axis and WCSS on the y-axis.

Interpretation:

  • As increases, WCSS decreases because clusters are smaller.
  • The “elbow point” is where the rate of WCSS reduction sharply slows — like the bend in an arm.
  • This elbow is considered the optimal because it balances compact clusters with model simplicity.

2. -means with Outliers?

-means is sensitive to outliers due to its use of Euclidean distance:

  • Effect of outliers:

    • Outliers can pull centroids toward themselves, distorting cluster assignments.
    • Can lead to poor cluster quality, especially if outliers lie far from actual data clusters.
  • Why this happens:

    • -means minimizes squared distances, which exaggerates the impact of points far from the centroid.

Mitigations:

  • Use -medoids or DBSCAN, which are more robust.
  • Preprocess data with outlier detection and removal.

3. Kernel PCA for Non-linear Dimensionality Reduction

Kernel PCA (Principal Component Analysis) extends PCA to non-linear relationships by using the kernel trick.

Key ideas:

  • Standard PCA works by projecting data into directions of maximum variance — assumes linearity.

  • Kernel PCA:

    • First maps data into a higher-dimensional feature space via a non-linear kernel function (e.g., RBF, polynomial).
    • Performs PCA in this new space without computing the coordinates explicitly — this is the kernel trick.

Steps:

  1. Choose a kernel function .
  2. Compute the kernel matrix .
  3. Center , then compute its eigenvalues and eigenvectors.
  4. Project data onto top eigenvectors (principal components) of .

Use case: Ideal for when data lies on a non-linear manifold, e.g., spiral or concentric circles.

4. Compare and Contrast PCA and t-SNE

FeaturePCAt-SNE
TypeLinearNon-linear
GoalMaximize global variancePreserve local structure (similarity)
Distance metricEuclideanConditional probabilities (based on similarity)
InterpretabilityHigh (components are linear combinations of features)Low (axes have no clear meaning)
ScalabilityFast, scalable to large datasetsSlow, computationally expensive
OutputCan be any number of dimensionsTypically 2 or 3D
Use casePreprocessing, compression, noise reductionData visualization (e.g., clusters in high-dimensional data)

Summary:

  • Use PCA when your goal is compression or noise reduction.
  • Use t-SNE when your goal is visualizing clusters or manifolds in high-dimensional datasets.