Skip to content

1.1 Fundamentals

1. Data Preprocessing Techniques

Data preprocessing is a crucial step in any data analysis or machine learning workflow, especially when dealing with tabular data. It involves transforming raw data into a format that improves the performance of models and ensures accurate results. Here are key data preprocessing steps for tabular data:

1.1 Data Cleaning

  • Handling Missing Values:
    • Remove rows/columns with too many missing values.
    • Impute missing values using mean, median, mode, or predictive models (e.g., k-NN imputation).
  • Handling Duplicates:
    • Identify and remove duplicate rows to prevent bias.
  • Correcting Errors:
    • Identify outliers, inconsistent formatting, and incorrect entries.
  • Dealing with Inconsistent Data:
    • Standardize formats (e.g., date/time formats, categorical values).

1.2 Data Transformation

  • Scaling & Normalization:
    • Standardization (Z-score normalization): Converts data to have a mean of 0 and a standard deviation of 1.
    • Min-Max Scaling: Scales values between a fixed range (e.g., [0,1]).
    • Robust Scaling: Uses median and interquartile range to scale data (useful for outlier-resistant scaling).
  • Encoding Categorical Variables:
    • One-hot encoding: Converts categorical variables into binary columns.
    • Label encoding: Assigns a unique integer to each category.
    • Target encoding: Uses the mean of the target variable for each category.
  • Feature Engineering:
    • Creating new features based on domain knowledge.
    • Combining or splitting existing features (e.g., extracting year from a date column).
    • Applying transformations (e.g., log, polynomial features).

1.3 Data Reduction

  • Feature Selection:
    • Removing irrelevant or highly correlated features to reduce dimensionality.
    • Using statistical methods (e.g., chi-square test, mutual information).
    • Using model-based techniques (e.g., feature importance from decision trees).
  • Dimensionality Reduction:
    • Principal Component Analysis (PCA) for reducing the number of features while preserving variance.
    • t-SNE and UMAP for visualization and exploratory analysis.

1.4 Data Integration

  • Merging Datasets:
    • Combining data from multiple sources (SQL joins, pandas merge()).
  • Resolving Inconsistencies:
    • Ensuring uniform data formats across merged datasets.

1.5 Data Splitting

  • Train-Test Split:
    • Splitting the dataset into training and testing sets (e.g., 80-20 or 70-30).
  • Stratified Sampling:
    • Ensuring class distribution is maintained (important for imbalanced datasets).
  • Cross-Validation:
    • Using k-fold or stratified k-fold to ensure robust model performance.

1.6 Handling Imbalanced Data

  • Resampling Techniques:
    • Oversampling (e.g., SMOTE - Synthetic Minority Over-sampling Technique).
    • Undersampling to balance classes.
  • Using Weighted Models:
    • Assigning class weights to the loss function in machine learning algorithms.

1.7 Outlier Detection & Treatment

  • Statistical Methods:
    • Using z-scores, IQR (Interquartile Range), or box plots.
  • Model-based Methods:
    • Isolation Forest, DBSCAN clustering.
  • Capping/Truncation:
    • Replacing outliers with a fixed threshold (e.g., 99th percentile).

1.8 Data Formatting & Type Conversion

  • Converting Data Types:
    • Ensuring numerical data is in the correct type (int/float).
    • Parsing dates properly.
  • Ensuring Consistent Column Naming:
    • Standardizing column names for ease of processing.

2. Bootstrapping

Play

What is Bootstrapping?

Bootstrapping is generally considered a resampling technique rather than a core data preprocessing technique. However, it can be used during preprocessing in certain scenarios. Here’s how bootstrapping relates to data preprocessing:

2.1 What is Bootstrapping?

Bootstrapping is a statistical resampling method that involves repeatedly sampling data with replacement to generate multiple datasets. It is commonly used for estimating confidence intervals, reducing variance, and improving model robustness.

2.2 When is Bootstrapping Used in Data Preprocessing?

While bootstrapping is not a standard preprocessing step like missing value imputation or scaling, it can be useful in the following cases:

  1. Handling Small Datasets

    • If the dataset is too small to train a reliable model, bootstrapping can artificially increase the amount of data by creating multiple training samples.
  2. Reducing Overfitting

    • In ensemble learning (e.g., Bagging, Random Forests), bootstrapping is used to create multiple training datasets, helping reduce overfitting.
  3. Dealing with Imbalanced Data

    • Bootstrapping can be used for oversampling the minority class in imbalanced classification problems (though SMOTE is often preferred).
  4. Improving Model Evaluation

    • Instead of a single train-test split, bootstrapping can be used to create multiple training and validation sets, leading to more robust performance estimates.

3. Characteristics of Image Data

Image data is fundamentally different from tabular data in both its structure and the way it needs to be processed for machine learning tasks. Here are some key characteristics of image data:

3.1. Structured as Grids of Pixels

  • Images are represented as 2D or 3D grids of pixels (depending on the number of color channels).

    • For grayscale images: The data is a 2D array where each element is a pixel value.
    • For color images: A 3D array, with dimensions (height × width × channels), where channels represent color (e.g., RGB = 3 channels).
  • Resolution: The size of the image (height and width) is a crucial characteristic. Larger images contain more information but require more computation.

3.2 High Dimensionality

  • Each pixel can be a feature, and with high-resolution images, you can have hundreds of thousands or even millions of features (one for each pixel).
    • For example, a 224×224 RGB image has 224 × 224 × 3 = 150,528 features.

3.3 Local Spatial Correlations

  • Pixels close to each other (spatially) tend to be related. Nearby pixels in an image usually contain similar information, especially for tasks like object detection or image recognition.
    • This property is important because spatial relationships matter (e.g., pixels near a face in a photo are likely to be part of the face).

3.4 Hierarchical Structure

  • In complex images, features at lower levels (edges, textures, colors) combine to form higher-level structures (e.g., shapes, objects).
    • For example, a neural network might learn edges at the first layer, shapes at the second, and objects at the third.

3.5 Large Size

  • Image datasets, especially high-resolution ones, tend to be larger in size compared to tabular data. They can require significant memory and processing power.

Difference Between Image Data and Tabular Data

AspectImage DataTabular Data
Representation2D or 3D grids of pixels (images are spatial)1D arrays (rows of data) with each feature as a column
Data TypePixels with numerical values (e.g., RGB values)Numerical or categorical data (e.g., age, income)
DimensionalityHigh-dimensional (e.g., 224×224×3)Lower-dimensional (often fewer features)
CorrelationLocal spatial correlations between pixelsPotential global relationships between features
Data ProcessingRequires preprocessing techniques like scaling, normalization, or augmentationUsually requires cleaning and encoding (for categorical data)
Learning ModelsConvolutional Neural Networks (CNNs) are most effectiveTypically uses fully connected networks (e.g., MLPs) or decision trees

3.6 How MLP Deals with Image Data

A Multilayer Perceptron (MLP) is a type of fully connected neural network that works well for tabular data but can be applied to image data with some limitations. Let’s explore how it works on image data:

1. Flattening the Image

  • Images need to be flattened into one long vector for the MLP to process them. For instance:

    • A 28x28 grayscale image becomes a vector of 784 values.
    • A 224x224 RGB image becomes a vector of 224 * 224 * 3 = 150,528 values.
  • Loss of spatial structure: Flattening the image removes the spatial structure, which is a drawback because MLPs cannot capture the local relationships between pixels, like convolutional networks (CNNs) can.

2. Feedforward Layers

  • In an MLP, each neuron in a layer is connected to every neuron in the previous layer. So, after flattening the image, MLP will learn a representation based on these dense connections.
  • Nonlinear activations (like ReLU) are applied after each layer to introduce non-linearity, which helps in learning complex patterns.

3. No Spatial Awareness

  • Since MLPs don’t inherently have any convolutional or pooling layers, they don’t consider the spatial relationship between pixels.
    • Local features like edges, textures, and shapes (important in image data) cannot be effectively learned.
    • MLPs learn patterns based on global relationships (across the entire image) rather than on local pixel patterns.

4. Performance and Efficiency

  • MLPs can be quite inefficient for image data:
    • They may require large numbers of neurons and layers to approximate the complexity of an image, making them computationally expensive.
    • As a result, MLPs are typically outperformed by Convolutional Neural Networks (CNNs) for image-based tasks.

Example

import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.datasets import mnist
# Load MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# Normalize images to values between 0 and 1
x_train = x_train / 255.0
x_test = x_test / 255.0
# Flatten images into 1D vectors
x_train = x_train.reshape(-1, 28*28)
x_test = x_test.reshape(-1, 28*28)
# Create MLP model
model = Sequential([
Dense(128, activation='relu', input_shape=(28*28,)),
Dense(10, activation='softmax') # Output 10 classes
])
# Compile and train the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5)
# Evaluate the model
model.evaluate(x_test, y_test)

3.7 Why CNNs Are Better for Images

  • CNNs (Convolutional Neural Networks) are specifically designed to preserve spatial relationships by applying convolutions (filters) across the image.
  • CNNs learn local patterns (edges, textures) and progressively combine them into higher-level features (like faces or objects), which makes them much better suited for image data than MLPs.