1.1 Fundamentals
1. Elbow Method
The elbow method is a heuristic technique used to determine the optimal number of clusters (
How it works:
- Run
-means clustering for different values of (e.g., to ). - For each
, calculate the within-cluster sum of squares (WCSS), also called inertia — it measures the total squared distance between each point and the centroid of its cluster. - Plot
on the x-axis and WCSS on the y-axis.
Interpretation:
- As
increases, WCSS decreases because clusters are smaller. - The “elbow point” is where the rate of WCSS reduction sharply slows — like the bend in an arm.
- This elbow is considered the optimal
because it balances compact clusters with model simplicity.
2. -means with Outliers?
-
Effect of outliers:
- Outliers can pull centroids toward themselves, distorting cluster assignments.
- Can lead to poor cluster quality, especially if outliers lie far from actual data clusters.
-
Why this happens:
-means minimizes squared distances, which exaggerates the impact of points far from the centroid.
Mitigations:
- Use
-medoids or DBSCAN, which are more robust. - Preprocess data with outlier detection and removal.
3. Kernel PCA for Non-linear Dimensionality Reduction
Kernel PCA (Principal Component Analysis) extends PCA to non-linear relationships by using the kernel trick.
Key ideas:
-
Standard PCA works by projecting data into directions of maximum variance — assumes linearity.
-
Kernel PCA:
- First maps data into a higher-dimensional feature space via a non-linear kernel function (e.g., RBF, polynomial).
- Performs PCA in this new space without computing the coordinates explicitly — this is the kernel trick.
Steps:
- Choose a kernel function
. - Compute the kernel matrix
. - Center
, then compute its eigenvalues and eigenvectors. - Project data onto top eigenvectors (principal components) of
.
Use case: Ideal for when data lies on a non-linear manifold, e.g., spiral or concentric circles.
4. Compare and Contrast PCA and t-SNE
Feature | PCA | t-SNE |
---|---|---|
Type | Linear | Non-linear |
Goal | Maximize global variance | Preserve local structure (similarity) |
Distance metric | Euclidean | Conditional probabilities (based on similarity) |
Interpretability | High (components are linear combinations of features) | Low (axes have no clear meaning) |
Scalability | Fast, scalable to large datasets | Slow, computationally expensive |
Output | Can be any number of dimensions | Typically 2 or 3D |
Use case | Preprocessing, compression, noise reduction | Data visualization (e.g., clusters in high-dimensional data) |
Summary:
- Use PCA when your goal is compression or noise reduction.
- Use t-SNE when your goal is visualizing clusters or manifolds in high-dimensional datasets.