1.3 Problems
1. Multicollinearity
In linear regression, the correlation of features (also known as multicollinearity) can be problematic for several reasons:
-
Unstable Coefficient Estimates: When features are highly correlated, the coefficients of the linear regression model can become very sensitive to small changes in the data. This means that the estimated coefficients can vary widely, making them unreliable and difficult to interpret.
-
Reduced Interpretability: In the presence of multicollinearity, it becomes challenging to determine the individual effect of each feature on the target variable. Since the features are correlated, the model may assign importance to one feature over another arbitrarily, leading to misleading conclusions about the relationship between the features and the target.
-
Increased Variance of Coefficient Estimates: Multicollinearity inflates the variance of the coefficient estimates, which can lead to overfitting. Overfitting occurs when the model captures noise in the training data rather than the underlying relationship, resulting in poor generalization to new data.
-
Difficulty in Feature Selection: When features are correlated, it becomes difficult to identify which features are truly important for predicting the target variable. This can complicate the process of feature selection and model simplification.
-
Numerical Instability: Highly correlated features can lead to numerical instability in the computation of the regression coefficients. This is particularly problematic when using methods like ordinary least squares (OLS), which involve matrix inversion. If the feature matrix is nearly singular due to multicollinearity, the inversion process can become unstable, leading to large errors in the coefficient estimates.
1.1 Addressing Multicollinearity
To mitigate the issues caused by multicollinearity, you can consider the following approaches:
-
Remove Correlated Features: Identify and remove one or more of the correlated features. This can be done by examining the correlation matrix or using techniques like Variance Inflation Factor (VIF) to detect multicollinearity.
-
Principal Component Analysis (PCA): Use PCA to transform the correlated features into a set of uncorrelated principal components. These components can then be used as inputs to the linear regression model.
-
Ridge Regression: Apply ridge regression, which introduces a penalty term to the loss function to shrink the coefficients. This can help reduce the impact of multicollinearity and stabilize the coefficient estimates.
-
Lasso Regression: Use lasso regression, which also introduces a penalty term but has the additional benefit of performing feature selection by shrinking some coefficients to zero.
-
Elastic Net: Combine the benefits of ridge and lasso regression by using elastic net, which includes both L1 and L2 penalty terms.
1.2 Example
Following example to illustrate how unstable coefficient estimates arise in linear regression when features are highly correlated.
Example Scenario:
Suppose you are trying to predict a person’s income (target variable) based on two features:
- Years of Education (Feature
) - Years of Postgraduate Education (Feature
)
These two features are highly correlated because postgraduate education is a subset of total education. For example:
- If someone has 16 years of education, they might have 4 years of postgraduate education.
- If someone has 18 years of education, they might have 6 years of postgraduate education.
Data
Here’s a small dataset:
Years of Education ( | Years of Postgraduate Education ( | Income ( |
---|---|---|
16 | 4 | 70,000 |
18 | 6 | 90,000 |
20 | 8 | 110,000 |
22 | 10 | 130,000 |
Problem
When you fit a linear regression model:
the model tries to estimate the coefficients
Why Coefficients Become Unstable
-
Interchangeable Features:
- The model might assign most of the importance to
(Years of Education) and little to (Years of Postgraduate Education), or vice versa. - For example:
- One model might estimate:
, . - Another model (trained on slightly different data) might estimate:
, .
- One model might estimate:
Both sets of coefficients could produce similar predictions for
, but the estimates themselves are unstable and unreliable. - The model might assign most of the importance to
-
Small Changes in Data:
- If you add or remove a data point, the coefficients might change drastically. For example:
- Suppose you add a new data point:
, , . - The new model might now estimate:
, .
- Suppose you add a new data point:
The coefficients have shifted significantly, even though the underlying relationship between the features and the target hasn’t changed much.
- If you add or remove a data point, the coefficients might change drastically. For example:
-
Interpretation Issues:
- If
and keep changing, you can’t confidently say how much each feature contributes to the target. For example: - Is it the total years of education (
) that matters more, or the postgraduate years ( )? - The model’s answer might change depending on the data, making it unreliable for drawing conclusions.
- Is it the total years of education (
- If
Mathematical Intuition
In linear regression, the coefficients are estimated by solving the equation:
where
Solution
To address this issue, you could:
- Remove one of the correlated features (e.g., keep only
and drop ). - Combine the features (e.g., create a new feature like “Years of Undergraduate Education” =
). - Use regularization techniques like Ridge Regression to stabilize the coefficients.