Skip to content

3.1 DR Overview

1. Introduction

Dimensionality Reduction is a technique in machine learning used to reduce the number of features or variables in a dataset while retaining as much of the relevant information as possible. It is crucial for simplifying models, reducing computational costs, mitigating the curse of dimensionality, and improving visualization. Dimensionality reduction can be broadly categorized into feature selection and feature extraction.

  • Feature Selection: Involves selecting a subset of the original features based on certain criteria, without creating new features.
  • Feature Extraction: Involves transforming the original features into a new set of features, often reducing the dimensionality.

2. Techniques

2.1 Principal Component Analysis (PCA)

  • Overview: PCA is a widely used linear technique for feature extraction. It transforms the original features into a new set of orthogonal features called principal components, which capture the maximum variance in the data.

  • Mathematical Formulation:

    • Compute the covariance matrix (C) of the data.
    • Perform eigenvalue decomposition on (C) to obtain eigenvalues and eigenvectors.
    • The principal components are the eigenvectors corresponding to the largest eigenvalues.
    • The data is projected onto the subspace spanned by the top (k) principal components.
  • Advantages:

    • Reduces dimensionality while retaining significant variance.
    • Helps in visualizing high-dimensional data.
  • Limitations:

    • Assumes linear relationships between features.
    • Sensitive to scaling of features.

2.2 Linear Discriminant Analysis (LDA)

  • Overview: LDA is a supervised dimensionality reduction technique that seeks to find a linear combination of features that maximizes the separation between multiple classes.

  • Mathematical Formulation:

    • Compute the within-class scatter matrix (S_W) and between-class scatter matrix (S_B).
    • Solve the generalized eigenvalue problem (S_W^-1 S_B) to obtain the projection matrix.
    • Project the data onto the subspace defined by the top (k) eigenvectors.
  • Advantages:

    • Enhances class separability.
    • Suitable for classification tasks.
  • Limitations:

    • Assumes normally distributed data with equal covariance among classes.
    • May not perform well if class distributions are highly overlapping.

2.3 t-Distributed Stochastic Neighbor Embedding (t-SNE)

  • Overview: t-SNE is a non-linear technique that emphasizes preserving local structures in the data, making it particularly useful for visualizing high-dimensional data in 2D or 3D.

  • Mathematical Formulation:

    • Compute pairwise similarities in the high-dimensional space using a Gaussian distribution.
    • Compute pairwise similarities in the low-dimensional space using a Student’s t-distribution.
    • Minimize the Kullback-Leibler divergence between these two distributions to find the low-dimensional representation.
  • Advantages:

    • Effective at visualizing complex data structures and clusters.
    • Preserves local relationships well.
  • Limitations:

    • Computationally intensive.
    • Does not preserve global structures.

2.4 Uniform Manifold Approximation and Projection (UMAP)

  • Overview: UMAP is a non-linear dimensionality reduction technique that preserves both local and global structures by modeling data as a topological manifold.

  • Mathematical Formulation:

    • Construct a high-dimensional graph based on local distances using a Gaussian kernel.
    • Construct a low-dimensional graph using a fuzzy simplicial complex.
    • Optimize the layout of the low-dimensional graph to match the high-dimensional graph.
  • Advantages:

    • Balances local and global structure preservation.
    • Generally faster and scalable compared to t-SNE.
  • Limitations:

    • Choice of parameters can significantly affect results.
    • Less intuitive than linear methods.

2.5 Independent Component Analysis (ICA)

  • Overview: ICA is a technique used to separate a multivariate signal into additive, statistically independent components. It is often used in signal processing.

  • Mathematical Formulation:

    • Model the observed data as a linear combination of independent components.
    • Use statistical measures (e.g., kurtosis or mutual information) to identify independent components.
    • Compute the mixing matrix and separate the signals.
  • Advantages:

    • Effective for signal separation tasks.
    • Does not require Gaussianity of data.
  • Limitations:

    • Assumes the number of components is known.
    • Sensitive to noise and outliers.

2.6 Autoencoders

  • Overview: Autoencoders are neural network-based models that learn to encode data into a lower-dimensional representation and then reconstruct the original data. They can capture complex non-linear relationships.

  • Mathematical Formulation:

    • Train a neural network with an encoder that maps data to a latent space and a decoder that reconstructs the original data from the latent space.
    • Minimize reconstruction error using a loss function (e.g., mean squared error).
  • Advantages:

    • Capable of capturing non-linear relationships.
    • Flexible architecture that can be tailored to specific tasks.
  • Limitations:

    • Requires substantial computational resources.
    • Hyperparameter tuning and network architecture selection can be challenging.

3. Progression of Techniques

  1. Linear Methods (PCA, LDA):

    • Initial Approach: Start with linear methods like PCA and LDA to reduce dimensionality, as these are often simpler to implement and understand. They work well when the relationships between features are linear.
  2. Non-Linear Methods (t-SNE, UMAP):

    • Advanced Approach: Move to non-linear methods like t-SNE and UMAP for complex data where linear methods fail to capture the underlying structure. These methods are particularly useful for visualization and capturing non-linear relationships.
  3. Advanced Techniques (ICA, Autoencoders):

    • Specialized Approach: Use techniques like ICA for specific applications such as signal processing or autoencoders for capturing complex non-linear patterns and reconstructing data. These methods provide flexibility but come with higher computational costs and complexity.