Skip to content

1.3 Problems

1. Multicollinearity

In linear regression, the correlation of features (also known as multicollinearity) can be problematic for several reasons:

  1. Unstable Coefficient Estimates: When features are highly correlated, the coefficients of the linear regression model can become very sensitive to small changes in the data. This means that the estimated coefficients can vary widely, making them unreliable and difficult to interpret.

  2. Reduced Interpretability: In the presence of multicollinearity, it becomes challenging to determine the individual effect of each feature on the target variable. Since the features are correlated, the model may assign importance to one feature over another arbitrarily, leading to misleading conclusions about the relationship between the features and the target.

  3. Increased Variance of Coefficient Estimates: Multicollinearity inflates the variance of the coefficient estimates, which can lead to overfitting. Overfitting occurs when the model captures noise in the training data rather than the underlying relationship, resulting in poor generalization to new data.

  4. Difficulty in Feature Selection: When features are correlated, it becomes difficult to identify which features are truly important for predicting the target variable. This can complicate the process of feature selection and model simplification.

  5. Numerical Instability: Highly correlated features can lead to numerical instability in the computation of the regression coefficients. This is particularly problematic when using methods like ordinary least squares (OLS), which involve matrix inversion. If the feature matrix is nearly singular due to multicollinearity, the inversion process can become unstable, leading to large errors in the coefficient estimates.

1.1 Addressing Multicollinearity

To mitigate the issues caused by multicollinearity, you can consider the following approaches:

  1. Remove Correlated Features: Identify and remove one or more of the correlated features. This can be done by examining the correlation matrix or using techniques like Variance Inflation Factor (VIF) to detect multicollinearity.

  2. Principal Component Analysis (PCA): Use PCA to transform the correlated features into a set of uncorrelated principal components. These components can then be used as inputs to the linear regression model.

  3. Ridge Regression: Apply ridge regression, which introduces a penalty term to the loss function to shrink the coefficients. This can help reduce the impact of multicollinearity and stabilize the coefficient estimates.

  4. Lasso Regression: Use lasso regression, which also introduces a penalty term but has the additional benefit of performing feature selection by shrinking some coefficients to zero.

  5. Elastic Net: Combine the benefits of ridge and lasso regression by using elastic net, which includes both L1 and L2 penalty terms.

1.2 Example

Following example to illustrate how unstable coefficient estimates arise in linear regression when features are highly correlated.

Example Scenario:

Suppose you are trying to predict a person’s income (target variable) based on two features:

  1. Years of Education (Feature )
  2. Years of Postgraduate Education (Feature )

These two features are highly correlated because postgraduate education is a subset of total education. For example:

  • If someone has 16 years of education, they might have 4 years of postgraduate education.
  • If someone has 18 years of education, they might have 6 years of postgraduate education.

Data

Here’s a small dataset:

Years of Education ()Years of Postgraduate Education ()Income ()
16470,000
18690,000
208110,000
2210130,000

Problem

When you fit a linear regression model:

the model tries to estimate the coefficients and . However, because and are highly correlated, the model struggles to distinguish the individual contribution of each feature to the target .

Why Coefficients Become Unstable

  1. Interchangeable Features:

    • The model might assign most of the importance to (Years of Education) and little to (Years of Postgraduate Education), or vice versa.
    • For example:
      • One model might estimate: , .
      • Another model (trained on slightly different data) might estimate: , .

    Both sets of coefficients could produce similar predictions for , but the estimates themselves are unstable and unreliable.

  2. Small Changes in Data:

    • If you add or remove a data point, the coefficients might change drastically. For example:
      • Suppose you add a new data point: , , .
      • The new model might now estimate: , .

    The coefficients have shifted significantly, even though the underlying relationship between the features and the target hasn’t changed much.

  3. Interpretation Issues:

    • If and keep changing, you can’t confidently say how much each feature contributes to the target. For example:
      • Is it the total years of education () that matters more, or the postgraduate years ()?
      • The model’s answer might change depending on the data, making it unreliable for drawing conclusions.

Mathematical Intuition

In linear regression, the coefficients are estimated by solving the equation:

where is the feature matrix. When features are highly correlated, the matrix becomes nearly singular (its determinant is close to zero), making its inverse unstable. This instability propagates to the coefficient estimates , causing them to vary widely.

Solution

To address this issue, you could:

  1. Remove one of the correlated features (e.g., keep only and drop ).
  2. Combine the features (e.g., create a new feature like “Years of Undergraduate Education” = ).
  3. Use regularization techniques like Ridge Regression to stabilize the coefficients.