Skip to content

1.1 Fundamentals

1. Definition

1.1 What is a Regression Problem?

A regression problem is a type of supervised machine learning problem where the goal is to predict a continuous numerical value (output/target variable) based on one or more input features. In other words, regression models aim to model the relationship between input variables and a continuous output variable.

Examples of Regression Problems:
  • Predicting house prices based on features like size, location, and number of bedrooms.
  • Estimating the temperature based on weather data like humidity, wind speed, and pressure.
  • Forecasting stock prices based on historical data.
Key Characteristics:
  • The target variable is continuous (e.g., real numbers like price, temperature, or income).
  • The goal is to find a relationship between the input features and the target variable.

1.2 What is Linear Regression?

  • Linear regression is one of the simplest and most commonly used techniques for solving regression problems.
  • Model hypothesis: it assumes a linear relationship between input x and output y, and fits a linear equation to observed data.
  • Simple linear regression: single feature single target
  • Multiple linear regression: multiple features single target
  • Multivariate linear regression: multiple features multiple targets

1.3 What is a Linear Model?

A linear model is a broader term that refers to any model that assumes a linear relationship between the input features and the target variable.

The model tries to fit a straight line (in 2D) or a hyperplane (in higher dimensions) that best predicts the target variable based on the input features.

Equation of Linear Regression

For a single feature (), the equation of a linear regression model is:

  • : Target variable (output).
  • : Input feature.
  • : Intercept (value of when ).
  • : Slope (change in for a unit change in ).
  • : Error term (accounts for noise or unexplained variability).

For multiple features (), the equation becomes:

Goal of Linear Regression

The goal is to find the coefficients () that minimize the difference between the predicted values () and the actual values (). This is typically done using a method like Ordinary Least Squares (OLS), which minimizes the sum of squared errors.

Linear regression is a specific type of linear model, but there are other linear models as well, such as:

  • Ridge Regression: Adds L2 regularization to linear regression to prevent overfitting.
  • Lasso Regression: Adds L1 regularization to linear regression, which can also perform feature selection.
  • Elastic Net: Combines L1 and L2 regularization.

Example of a Linear Model

Suppose you want to predict a person’s salary based on their years of experience and education level. A linear model might look like this:

Here:

  • is the base salary (intercept).
  • is the increase in salary for each additional year of experience.
  • is the increase in salary for each additional level of education.

2. Basics of Gradient Descent

2.1 Gradient

  • Gradient tells the direction where the function value increases (steepest), and thus to reduce loss we need to adjust parameter towards the opposite direction of its gradient.

2.2 Gradient descent

  • Gradient descent is an optimization algorithm used to minimize the cost function in machine learning.
  • The essence of gradient descent is to iteratively move towards the minimum of the cost function by updating the model’s parameters in the opposite direction of the gradient (the slope) of the cost function.

2.3 How does it work

  • Step 1): initialization. Random or based on heuristic
  • Step 2): Compute the gradient. Compute the gradient of the loss function with respect to each parameter, manually or auto-differentiation
  • Step 3): Update the parameter. The step size is controlled by the learning rate, (but the magnitude of gradient matters as well, vanishing or exploding problem will be discussed later in the unit)
  • Step 4): Repeat 2) and 3)

2.4 Find the optimal

  • Learning rate scale matters
  • Global or Local minima: For convex problems, like linear regression, gradient descent is guaranteed to converge to the global minimum. For non-convex problems, like training deep neural networks, gradient descent may converge to a local minimum. However, in practice, even local minima can provide useful solutions, especially in deep neural networks, as gradient descent tend to converge to flat minima rather than sharp minima in high-dimensional parameter space.

3. Feature Normalization

Feature normalization or standardization is an important preprocessing step in linear regression (and many other machine learning algorithms) because it ensures that all input features are on a similar scale. This has several benefits, particularly when using optimization algorithms like gradient descent to train the model. Let’s break this down in detail:

3.1 What is Feature Normalization/Standardization?

  1. Normalization:

    • Rescales features to a range of or .
    • Formula:
  2. Standardization:

    • Rescales features to have a mean of 0 and a standard deviation of 1.
    • Formula: where is the mean and is the standard deviation.

3.2 Why is Normalization/Standardization Important in Linear Regression?

  1. Improves Convergence of Gradient Descent:

    • Gradient descent performs better when features are on a similar scale. If features have vastly different scales (e.g., one feature ranges from 0 to 1 and another from 0 to 1000), the loss function becomes elongated and asymmetrical. This causes gradient descent to take longer to converge because it oscillates inefficiently toward the minimum.
    • Normalized/standardized features create a more symmetrical and well-conditioned loss surface, allowing gradient descent to converge faster.
  2. Prevents Dominance of Large-Scale Features:

    • Features with larger scales can dominate the model because their coefficients will have a larger impact on the predictions. This can lead to biased results where the model over-weights features simply because of their scale, rather than their actual importance.
    • Normalization/standardization ensures that all features contribute equally to the model.
  3. Improves Numerical Stability:

    • Algorithms like gradient descent involve computations like matrix multiplications and inversions. If features are on vastly different scales, these computations can become numerically unstable, leading to errors or slow convergence.
  4. Helps Regularization Work Effectively:

    • Regularization techniques like Ridge or Lasso regression penalize large coefficients. If features are not scaled, the regularization penalty may disproportionately affect features with larger scales, leading to suboptimal results.

3.3 How Does Normalization/Standardization Affect Gradient Descent?

  1. Smoother Loss Surface:

    • When features are on similar scales, the loss function becomes more spherical (symmetric) rather than elongated. This allows gradient descent to take more direct steps toward the minimum, improving convergence speed.
  2. Consistent Step Sizes:

    • Without normalization, features with larger scales will have larger gradients, causing gradient descent to take larger steps in those directions. This can lead to oscillations and slow convergence.
    • With normalization, gradients are more balanced, and gradient descent can take consistent step sizes in all directions.
  3. Avoids Local Minima:

    • In some cases, unscaled features can create a loss surface with sharp curvatures or local minima, making it harder for gradient descent to find the global minimum. Normalization helps create a smoother loss surface, reducing the risk of getting stuck in local minima.

3.4 Example: Why Normalization Matters

Suppose we have two features:

  • : Age (ranges from 0 to 100)
  • : Income (ranges from 0 to 1,000,000)

The linear regression model is:

Problem Without Normalization:

  • The gradient for (Income) will be much larger than the gradient for (Age) because has a much larger scale.
  • This causes gradient descent to take very large steps in the direction of and very small steps in the direction of , leading to inefficient convergence.

Solution With Normalization:

  • Normalize and to the same scale (e.g., [0, 1] or mean 0, standard deviation 1).
  • Now, the gradients for and will be on a similar scale, allowing gradient descent to converge more efficiently.

3.5 When is Normalization/Standardization Not Necessary?

  • If all features are already on a similar scale (e.g., pixel values in images, which range from 0 to 255).
  • If you are using algorithms that are not sensitive to feature scales, such as decision trees or random forests.

4. Life Cycle of ML Model Development

The life cycle of machine learning (ML) model development is a structured process that guides the creation, deployment, and maintenance of ML models. It involves several stages, from understanding the problem to deploying the model and monitoring its performance in production. Below is a detailed breakdown of the typical life cycle:

1. Problem Definition

  • Objective: Clearly define the problem you want to solve and the goals of the ML model.
  • Key Activities:
    • Understand the business problem or use case.
    • Define the target variable (e.g., what you want to predict or classify).
    • Identify the success metrics (e.g., accuracy, precision, recall, RMSE).
    • Determine the constraints (e.g., latency, interpretability, scalability).
  • Output: A well-defined problem statement and project plan.

2. Data Collection

  • Objective: Gather the data required to train and evaluate the ML model.
  • Key Activities:
    • Identify data sources (e.g., databases, APIs, sensors, third-party data).
    • Collect raw data (structured, unstructured, or semi-structured).
    • Ensure data quality (e.g., handle missing values, remove duplicates).
  • Output: A dataset ready for preprocessing.

3. Data Preprocessing

  • Objective: Clean and prepare the data for modeling.
  • Key Activities:
    • Handle missing values (e.g., imputation or removal).
    • Remove outliers or noisy data.
    • Normalize/standardize features (if necessary).
    • Encode categorical variables (e.g., one-hot encoding, label encoding).
    • Split the data into training, validation, and test sets.
  • Output: A clean, preprocessed dataset.

4. Exploratory Data Analysis (EDA)

  • Objective: Understand the data and uncover patterns, relationships, and insights.
  • Key Activities:
    • Perform statistical analysis (e.g., mean, median, standard deviation).
    • Visualize data distributions (e.g., histograms, box plots).
    • Identify correlations between features and the target variable.
    • Detect multicollinearity or feature redundancy.
  • Output: Insights that guide feature engineering and model selection.

5. Feature Engineering

  • Objective: Create meaningful features that improve model performance.
  • Key Activities:
    • Create new features from existing data (e.g., ratios, aggregates).
    • Perform feature selection to remove irrelevant or redundant features.
    • Transform features (e.g., log transformation, polynomial features).
    • Use domain knowledge to engineer relevant features.
  • Output: A refined set of features for modeling.

6. Model Selection

  • Objective: Choose the best algorithm(s) for the problem.
  • Key Activities:
    • Select candidate models based on the problem type (e.g., regression, classification, clustering).
    • Experiment with simple models (e.g., linear regression, decision trees) and more complex models (e.g., random forests, neural networks).
    • Use cross-validation to evaluate model performance.
  • Output: A shortlist of promising models.

7. Model Training

  • Objective: Train the selected models on the training data.
  • Key Activities:
    • Split the data into training and validation sets.
    • Train the models using the training set.
    • Tune hyperparameters using techniques like grid search or random search.
    • Use techniques like k-fold cross-validation to ensure robustness.
  • Output: Trained models with optimized hyperparameters.

8. Model Evaluation

  • Objective: Assess the performance of the trained models.
  • Key Activities:
    • Evaluate models on the validation/test set using appropriate metrics (e.g., accuracy, F1-score, RMSE, AUC-ROC).
    • Compare model performance and select the best one.
    • Perform error analysis to understand where the model fails.
  • Output: A final model with documented performance metrics.

9. Model Deployment

  • Objective: Deploy the model into a production environment.
  • Key Activities:
    • Package the model (e.g., using Docker, Flask, or FastAPI).
    • Integrate the model into the existing system (e.g., APIs, web apps).
    • Set up monitoring for model performance and data drift.
    • Ensure scalability and reliability.
  • Output: A deployed model that is accessible to end-users.

10. Model Monitoring and Maintenance

  • Objective: Ensure the model continues to perform well in production.
  • Key Activities:
    • Monitor model performance (e.g., accuracy, latency).
    • Detect data drift or concept drift (changes in data distribution or relationships).
    • Retrain the model periodically with new data.
    • Update the model as needed to maintain performance.
  • Output: A maintained and up-to-date model.

11. Model Retraining

  • Objective: Keep the model relevant as new data becomes available.
  • Key Activities:
    • Collect new data and preprocess it.
    • Retrain the model using the updated dataset.
    • Evaluate the retrained model and compare it to the previous version.
    • Deploy the updated model if performance improves.
  • Output: A retrained and improved model.

12. Documentation and Reporting

  • Objective: Document the entire process and results for future reference.
  • Key Activities:
    • Document the problem statement, data sources, and preprocessing steps.
    • Record model selection, training, and evaluation details.
    • Report deployment and monitoring processes.
    • Share insights and lessons learned.
  • Output: Comprehensive documentation for stakeholders and future projects.

5. Common Issues

Implementing linear regression on real-world datasets often comes with several challenges due to the complexity and imperfections of real-world data. Below are some common issues and their solutions:

1. Multicollinearity

  • Issue: When two or more features are highly correlated, it becomes difficult to determine the individual effect of each feature on the target variable. This can lead to unstable and unreliable coefficient estimates.
  • Solution:
    • Identify and remove correlated features using correlation matrix
    • Use Variance Inflation Factor (VIF) to detect multicollinearity. A VIF > 5 or 10 indicates high multicollinearity.
    • Remove one of the correlated features or combine them into a single feature.
    • Use regularization techniques like Ridge Regression or Lasso Regression to penalize large coefficients.

2. Non-Linearity

  • Issue: Linear regression assumes a linear relationship between the features and the target variable. If the relationship is non-linear, the model will perform poorly.
  • Solution:
    • Transform features using polynomial or logarithmic transformations.
    • Use non-linear models like decision trees, random forests, or neural networks if the relationship is highly non-linear.

3. Outliers

  • Issue: Outliers can disproportionately influence the model, leading to biased coefficient estimates and poor performance.
  • Solution:
    • Detect outliers using visualization techniques (e.g., box plots, scatter plots) or statistical methods (e.g., Z-scores, IQR).
    • Remove or cap outliers if they are due to data entry errors.
    • Use robust regression techniques like RANSAC (Random Sample Consensus) that are less sensitive to outliers.

4. Missing Data

  • Issue: Missing data can reduce the amount of usable data and introduce bias if not handled properly.
  • Solution:
    • Remove rows with missing data if the dataset is large enough.
    • Impute missing values using techniques like mean/median imputation, k-nearest neighbors (KNN), or advanced methods like MICE (Multiple Imputation by Chained Equations).

5. Heteroscedasticity

  • Issue: Heteroscedasticity occurs when the variance of the residuals is not constant across all levels of the predicted values. This violates the assumption of homoscedasticity in linear regression.
  • Solution:
    • Transform the target variable (e.g., log transformation).
    • Use weighted least squares (WLS) regression to account for varying variance.

6. Feature Scaling

  • Issue: Features with different scales can cause gradient descent to converge slowly and make the model sensitive to the scale of the features.
  • Solution:
    • Normalize or standardize features to bring them to a similar scale.
    • Use techniques like Min-Max Scaling or Standardization (mean = 0, standard deviation = 1).

7. Overfitting

  • Issue: Overfitting occurs when the model learns the noise in the training data, leading to poor generalization on unseen data.
  • Solution:
    • Use regularization techniques like Ridge Regression (L2 penalty) or Lasso Regression (L1 penalty) to penalize large coefficients.
    • Perform cross-validation to evaluate the model’s performance on unseen data.
    • Simplify the model by reducing the number of features (feature selection).

8. Underfitting

  • Issue: Underfitting occurs when the model is too simple to capture the underlying patterns in the data, leading to poor performance on both training and test data.
  • Solution:
    • Add more relevant features or create new features through feature engineering.
    • Use a more complex model (e.g., polynomial regression) if the relationship is non-linear.

9. Categorical Variables

  • Issue: Linear regression cannot directly handle categorical variables. They need to be encoded into numerical values.
  • Solution:
    • Use one-hot encoding or label encoding to convert categorical variables into numerical format.
    • Be cautious with one-hot encoding to avoid the dummy variable trap (remove one category to avoid multicollinearity).

10. Autocorrelation

  • Issue: In time-series data, residuals may be correlated with each other (autocorrelation), violating the assumption of independence in linear regression.
  • Solution:
    • Use Durbin-Watson test to detect autocorrelation.
    • Consider using time-series models like ARIMA or include lagged variables in the regression model.

11. High-Dimensional Data

  • Issue: When the number of features is large compared to the number of observations, the model may overfit or become computationally expensive.
  • Solution:
    • Perform dimensionality reduction using techniques like PCA (Principal Component Analysis) or feature selection methods.
    • Use regularization techniques like Lasso Regression to shrink less important features to zero.

12. Interpretability

  • Issue: Linear regression models are generally interpretable, but adding too many features or transformations can make the model harder to interpret.
  • Solution:
    • Limit the number of features and avoid overly complex transformations.
    • Use tools like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to explain model predictions.

13. Data Leakage

  • Issue: Data leakage occurs when information from the test set is inadvertently used during training, leading to overly optimistic performance estimates.
  • Solution:
    • Ensure proper separation of training and test data.
    • Perform preprocessing (e.g., scaling, imputation) separately on the training and test sets.

14. Non-Normal Distribution of Residuals

  • Issue: Linear regression assumes that residuals are normally distributed. If this assumption is violated, the model’s predictions may be unreliable.
  • Solution:
    • Transform the target variable (e.g., log transformation).
    • Use non-linear models if the relationship is inherently non-linear.

15. Computational Complexity

  • Issue: For very large datasets, training a linear regression model can become computationally expensive.
  • Solution:
    • Use stochastic gradient descent (SGD) or mini-batch gradient descent for optimization.
    • Leverage distributed computing frameworks like Apache Spark for large-scale data.

Summary Table

IssueSolution
MulticollinearityRemove correlated features, use regularization (Ridge/Lasso).
Non-LinearityTransform features, use non-linear models.
OutliersDetect and remove/cap outliers, use robust regression.
Missing DataImpute missing values or remove rows.
HeteroscedasticityTransform target variable, use weighted least squares.
Feature ScalingNormalize/standardize features.
OverfittingUse regularization, cross-validation, or feature selection.
UnderfittingAdd more features, use a more complex model.
Categorical VariablesUse one-hot encoding or label encoding.
AutocorrelationUse time-series models or include lagged variables.
High-Dimensional DataPerform dimensionality reduction or feature selection.
InterpretabilityLimit features, use explainability tools (SHAP, LIME).
Data LeakageEnsure proper separation of training and test data.
Non-Normal ResidualsTransform target variable, use non-linear models.
Computational ComplexityUse SGD/mini-batch gradient descent, leverage distributed computing.

6. Simple Linear Regression or Complex Model ?

Deciding between a simple linear model (e.g., linear regression) and a more complex model (e.g., random forests, neural networks) is a critical step in machine learning. The choice depends on several factors, including the nature of the problem, the data, and the constraints of the project. Below are the key factors to consider and their implications:

1. Problem Complexity

  • Simple Linear Model:
    • Use when the relationship between the features and the target variable is linear or approximately linear.
    • Suitable for problems where interpretability is more important than predictive power.
  • Complex Model:
    • Use when the relationship is non-linear or involves complex interactions between features.
    • Suitable for problems where predictive accuracy is prioritized over interpretability.

Why: A simple linear model will underfit if the true relationship is non-linear, while a complex model may overfit if the relationship is simple.

2. Dataset Size

  • Simple Linear Model:
    • Works well with small datasets because it has fewer parameters to estimate and is less prone to overfitting.
  • Complex Model:
    • Requires large datasets to generalize well. Complex models have many parameters and can overfit small datasets.

Why: Complex models need sufficient data to learn the underlying patterns without memorizing noise.

3. Interpretability

  • Simple Linear Model:
    • Highly interpretable. The coefficients directly indicate the relationship between features and the target.
    • Preferred in domains like healthcare, finance, or policy-making, where understanding the model’s decisions is crucial.
  • Complex Model:
    • Less interpretable. For example, neural networks are often considered “black boxes.”
    • Use when interpretability is not a priority, and the focus is on maximizing predictive performance.

Why: Interpretability is critical in regulated industries or when the model’s decisions need to be explained to stakeholders.

4. Computational Resources

  • Simple Linear Model:
    • Requires minimal computational resources for training and inference.
    • Suitable for environments with limited computational power (e.g., edge devices).
  • Complex Model:
    • Requires significant computational resources (e.g., GPUs/TPUs) and time for training, especially for large datasets.
    • Suitable for environments with access to high-performance computing.

Why: Resource constraints can limit the feasibility of using complex models.

5. Training Time

  • Simple Linear Model:
    • Trains quickly, even on large datasets.
  • Complex Model:
    • Training can be time-consuming, especially for deep learning models or large datasets.

Why: If the project has tight deadlines, a simple model may be preferred.

6. Performance Requirements

  • Simple Linear Model:
    • May not achieve state-of-the-art performance on complex tasks (e.g., image recognition, natural language processing).
  • Complex Model:
    • Can achieve higher accuracy and better performance on complex tasks.

Why: If the problem requires high predictive accuracy, a complex model may be necessary.

7. Overfitting Risk

  • Simple Linear Model:
    • Less prone to overfitting, especially with small datasets.
  • Complex Model:
    • More prone to overfitting, particularly if the dataset is small or noisy.

Why: Overfitting can lead to poor generalization on unseen data.

8. Feature Engineering

  • Simple Linear Model:
    • Requires careful feature engineering to capture non-linear relationships (e.g., polynomial features, interaction terms).
  • Complex Model:
    • Can automatically learn complex patterns and interactions from raw data, reducing the need for manual feature engineering.

Why: Complex models can save time and effort in feature engineering but may require more data.

9. Domain Knowledge

  • Simple Linear Model:
    • Works well when domain knowledge can guide feature selection and engineering.
  • Complex Model:
    • Useful when domain knowledge is limited, and the model needs to learn patterns directly from data.

Why: Domain knowledge can help simplify the problem and reduce the need for complex models.

10. Maintenance and Scalability

  • Simple Linear Model:
    • Easier to maintain, debug, and update.
    • Scales well with increasing data size.
  • Complex Model:
    • Requires more effort to maintain, debug, and update.
    • May face scalability issues if not properly optimized.

Why: Maintenance and scalability are critical for long-term deployment.

Decision Framework

FactorSimple Linear ModelComplex Model
Problem ComplexityLinear relationshipsNon-linear or complex relationships
Dataset SizeSmall to mediumLarge
InterpretabilityHighLow
Computational ResourcesLimitedAbundant
Training TimeFastSlow
Performance RequirementsModerate accuracy requiredHigh accuracy required
Overfitting RiskLowHigh
Feature EngineeringRequires manual effortAutomatically learns features
Domain KnowledgeAvailable and usefulLimited or unavailable
Maintenance and ScalabilityEasy to maintain and scaleRequires more effort to maintain and scale

Practical Example

Scenario 1: Predicting House Prices

  • Dataset: 10,000 rows with features like size, location, and number of bedrooms.
  • Considerations:
    • The relationship between features and price is likely non-linear.
    • Interpretability is important for stakeholders.
  • Decision:
    • Start with a simple linear model for interpretability.
    • If performance is insufficient, try a moderately complex model like random forests or gradient boosting.

Scenario 2: Image Classification

  • Dataset: 100,000 images of cats and dogs.
  • Considerations:
    • The relationship between pixel values and the target is highly non-linear.
    • Interpretability is less critical than accuracy.
  • Decision:
    • Use a complex model like a convolutional neural network (CNN).

7. Regularization

Regularization plays a crucial role in linear regression models by addressing issues like overfitting and multicollinearity. It introduces a penalty term to the loss function to constrain the magnitude of the coefficients, leading to simpler and more generalizable models. The two most common types of regularization are L1 (Lasso) and L2 (Ridge) regularization. Let’s explore their roles, differences, and applications in detail.

1. What is Regularization?

Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function. The penalty term discourages the model from assigning excessively large values to the coefficients, which can lead to overfitting.

Loss Function Without Regularization:

where:

  • : Loss function (e.g., Mean Squared Error).
  • : Model coefficients.
  • : Number of training examples.
  • : Actual value.
  • : Predicted value.

Loss Function With Regularization:

where:

  • : Regularization parameter (controls the strength of regularization).
  • : Regularization term (depends on the type of regularization).

2. L2 Regularization (Ridge Regression)

L2 regularization adds a penalty equal to the sum of the squared magnitudes of the coefficients.

Regularization Term:

where are the coefficients.

Loss Function for Ridge Regression:

Key Characteristics:

  • Shrinks Coefficients: Reduces the magnitude of coefficients but does not set them to zero.
  • Handles Multicollinearity: Works well when features are correlated by distributing the effect among correlated features.
  • Improves Generalization: Prevents overfitting by discouraging large coefficients.

When to Use:

  • When you want to retain all features but reduce their impact.
  • When multicollinearity is present in the data.

3. L1 Regularization (Lasso Regression)

L1 regularization adds a penalty equal to the sum of the absolute values of the coefficients.

Regularization Term:

Loss Function for Lasso Regression:

Key Characteristics:

  • Sparsity: Can shrink some coefficients to exactly zero, effectively performing feature selection.
  • Feature Selection: Useful when you suspect that only a subset of features is relevant.
  • Handles Multicollinearity: Like Ridge, but tends to select one feature from a group of correlated features.

When to Use:

  • When you want to perform feature selection and reduce the number of features.
  • When you suspect that many features are irrelevant.

4. Key Differences Between L1 and L2 Regularization

AspectL1 (Lasso)L2 (Ridge)
Penalty TermSum of absolute values of coefficients.Sum of squared values of coefficients.
Effect on CoefficientsCan shrink coefficients to zero.Shrinks coefficients but does not set to zero.
Feature SelectionYes (sparse solutions).No (retains all features).
Handling MulticollinearitySelects one feature from correlated group.Distributes effect among correlated features.
Use CaseFeature selection, high-dimensional data.Generalization, multicollinearity.

5. Choosing the Regularization Parameter ()

The regularization parameter controls the strength of regularization:

  • Small : Weak regularization (model behaves like ordinary linear regression).
  • Large : Strong regularization (coefficients are heavily penalized).

How to Choose :

  • Use cross-validation to find the optimal value of that minimizes the validation error.
  • Techniques like grid search or random search can be used to tune .

6. Elastic Net: Combining L1 and L2 Regularization

Elastic Net is a hybrid approach that combines L1 and L2 regularization. It is useful when:

  • There are many correlated features.
  • You want to balance feature selection (L1) and coefficient shrinkage (L2).

Regularization Term:

where controls the mix between L1 and L2 penalties.

Loss Function for Elastic Net:

7. Practical Example

Scenario: Predicting House Prices

  • Features: Size, number of bedrooms, location, age, etc.
  • Issue: Some features may be irrelevant or highly correlated.

Solution

  • Use Lasso Regression if you suspect that only a subset of features is relevant (e.g., location may not matter in some cases).
  • Use Ridge Regression if you want to retain all features but reduce their impact (e.g., size and number of bedrooms are both important but correlated).
  • Use Elastic Net if you want a balance between feature selection and coefficient shrinkage.