3.1 Fundamentals

1. Definition

A Decision Tree is a supervised learning algorithm used for classification and regression tasks. It models decisions as a tree-like structure, where:

Internal nodes represent feature-based decisions (splits).
Branches represent outcomes of these decisions.
Leaf nodes represent final predictions (class labels or numerical values).

1.1 Problems Decision Tree Solve?

Decision and Classification Trees, Impurity Score | Gini Index

Regression Trees

Feature Selection and Missing Data

Classification Problems

Predicts discrete class labels.
Example: Spam detection (Email = Spam or Not Spam).

Regression Problems

Predicts continuous values.
Example: House price prediction (based on square footage, location, etc.).

Decision-Making & Rule Extraction

Used in expert systems and business decision-making.
Example: Loan approval system (deciding based on income, credit score, etc.).

1.2 Why Use a Decision Tree?

✅ Interpretable & Explainable (easy to visualize).
✅ Handles Both Numerical & Categorical Data.
✅ Non-Parametric (no assumption about data distribution).
✅ Feature Selection Built-In (chooses important features automatically).

2. Building a DT

Building a decision tree involves a series of steps where the data is split according to certain criteria to create a tree structure that predicts the output variable. The process can be applied to both classification and regression tasks, with the main difference being the metric used to decide on the best split. Here’s a general overview of how a decision tree is built:

Start at the root:
- Begin with entire dataset at the root.
- Define the criterial for spliting. Entropy or Gini index for classification, variance reduction for regression
- determine the feature and value to split: for each feature, evaluates thresholds or categories to find the best split
Recursively split each subset:
- Apply the same splitting process recursively to each child node
- Stopping Criteria: The recursion stops when a stopping criterion is met for each node, such as: node is pure, maxidum depth reached, fewer data than minimum number, etc.
Tree pruning: reduce the size of tree to prevent overfitting

3. DT vs kNN Comparison

(This is a type of question that likely to appear in test.The answer to this type of question is open-ended. The idea is to test general understanding of ML algorithms so that they have a rough idea what algorithm to use in application. A basic template to answer these questions is to give a brief summary of how the algorithm works, and then list several typical characteristics of the algorithm. There are many things can be discussed but should focus on the unique or difference in comparsion the other.)

Typically, we could discuss data requirement, model complexity and capacity, training and testing, hyper-parameter, decision boundary, noise and outlier sensitivity, or any other speciality of the algorithms.

Model:
- DT: a tree structure built recersively by splitting data base on feature values in a greedy manner.
- KNN: a non-parametric, instance-based learning algorithm. kNN makes predictions based on the local similarity between samples.
Data:
- Decision Tree (DT): handle categorical features, does not require normalization, less sensitive to outliers, but can be sensitive to changes in training data (high variance)
- KNN: work for numerical features only, sensitive to data scale, sensivetive to irrelavant or redundant features (due to distance calculation)
Training and Testing:
- DT: Computation is mainly in training phase. Inference or prediction is fast, only need to traverse the tree down to a leaf node.
- KNN: Training is virtually non-existent, just storing the data. Prediction can be slow for large datasets since it requires computing the distance to all training instances and identifying the k nearest neighbors for each prediction.
Hyperparameters tuning:
- DT: many hyperparameters to tune, related to tree growth and complexity (eg. maximum depth, split criteria, minimum samples per leaf)
- KNN: just the neighbor k and the distance metric (Euclidean, Manhattan)
Decision boundary (complexity):
- DT: Each node or split separate the feature space into two or more sub regions. Infinite depth tree can have arbitrary decision boundary
- KNN: the smoothness of boundary is largely depends on the value of k. Large k values leads to smooth decision boundary (low variance high bias), while small k values leads to complex boundary (high variance low bias)

Feature	Decision Tree Classifier	k-Nearest Neighbors (kNN) Classifier
Concept	Learns a hierarchy of rules to make decisions	Stores training data and classifies based on neighbors
Training Phase	Builds a tree structure by recursively splitting data	No explicit training; just stores data
Prediction Speed	Fast (traverses the tree)	Slow (computes distances for all points)
Memory Usage	Low (stores tree structure)	High (stores all training data)
Handling of Nonlinear Boundaries	Limited but can approximate	Very flexible, handles complex boundaries
Robustness to Noise	Can be sensitive (pruning helps)	Sensitive to outliers (can use weighted kNN)
Interpretability	High (tree structure is human-readable)	Low (decisions depend on many points)
Scalability	Scales well with pruning	Scales poorly for large datasets due to distance computation
Hyperparameters	Tree depth, impurity criteria, pruning	Number of neighbors (k), distance metric

4. Pruning to Address Overfitting in Decision Trees

How to Prune

Pruning reduces the complexity of a decision tree by removing nodes that do not contribute significantly to classification accuracy. This helps prevent overfitting, where the model memorizes noise instead of learning general patterns.

4.1 Pre-Pruning (Early Stopping)

Stops the tree from growing beyond a certain point.
- Pre-pruning involves stopping the tree from growing too deep or too complex during the construction phase. Rather than allowing the tree to fully develop until all leaves are pure (contain instances of a single class) or until all possible splits are performed, pre-pruning sets conditions that halt the growth of the tree earlier.
Common criteria:
- Maximum tree depth
- Minimum samples per leaf
- Minimum impurity decrease
Advantage:
- Prevents over-complex trees from the start.
- Reduce the size of the tree, making it faster to build and more interpretable.
Disadvantage:
- Might stop too early and miss important patterns.
- It may lead to underfitting if the stopping criteria are too restrictive, as the tree might not grow deep enough to capture important patterns in the data.

4.2 Post-Pruning (Reduced-Error Pruning / Cost Complexity Pruning)

Grows the full tree and then prunes it by removing nodes that do not improve generalization.
- Post-pruning, in contrast, allows the tree to grow to its full size and complexity and then prunes it back. This is done by removing branches that contribute little to the power of the tree to classify new instances
Common methods:
- Reduced-error pruning: Removes nodes and checks if accuracy improves on a validation set.
- Cost complexity pruning (CCP): Uses a penalty term to balance tree complexity and accuracy.
Advantage:
- Post-pruning is generally more effective at reducing overfitting compared to pre-pruning because it starts with a fully grown tree, ensuring that no potentially useful branches are stopped prematurely. It often results in better performance on unseen data.
Disadvantage:
- computationally expensive since it requires building the full tree first and then evaluating various pruning options.

5. Choose between Decision Tree, Random Forest, and GBDT

Choosing between a single Decision Tree (DT), Random Forest (RF), and Gradient Boosted Decision Trees (GBDT) depends on the problem characteristics, data size, and computational constraints.

5.1 Consider Model Complexity and Interpretability

Model	Complexity	Interpretability
Decision Tree	Low	High (easy to visualize)
Random Forest	Medium	Low (many trees make interpretation harder)
GBDT	High	Low (boosting is less interpretable)

Use DT if interpretability is crucial (e.g., medical diagnosis).
Use RF/GBDT if accuracy is more important than interpretability.

5.2 Evaluate Overfitting and Generalization

Model	Overfitting Risk	Generalization
DT	High (prone to overfitting)	Poor for complex data
RF	Low (averaging reduces overfitting)	Good
GBDT	Medium (but can overfit with too many trees)	Very Good

Use DT for small datasets with simple decision boundaries.
Use RF when reducing overfitting is a priority.
Use GBDT when fine-tuning is needed to maximize accuracy.

5.3 Handling of Large Datasets

Model	Training Speed	Prediction Speed	Suitable for Large Data?
DT	Fast	Very fast	Yes
RF	Slower (many trees)	Fast	Yes
GBDT	Slowest (sequential training)	Fast	Limited

Use DT when speed is critical.
Use RF when you have a large dataset and need parallelizable training.
Use GBDT for smaller datasets where maximizing accuracy is key.

5.4 Sensitivity to Noisy Data

Model	Sensitivity to Noise
DT	High
RF	Low
GBDT	Medium (but can be tuned)

Use RF if your dataset has noise or irrelevant features.
Use GBDT if feature engineering is well done.

6. Bias and Variance Trade-off of Models

The bias-variance tradeoff is a fundamental concept in machine learning that describes the tradeoff between the error introduced by the model’s assumptions about the target function (bias) and the error introduced by the model’s sensitivity to fluctuations in the training dataset (variance).

KNN: KNN can have low bias and high variance, especially with a lower value of k (the number of nearest neighbors). With a lower k, the model closely follows the training data, leading to high sensitivity (variance) but capturing the underlying data patterns well (low bias). Increasing k increases the bias but reduces the variance as the model becomes smoother and less sensitive to the training data.
SVM: SVMs can have relatively low bias and moderate variance, depending on the choice of the kernel and regularization parameter (soft-margin SVM). The regularization parameter controls the tradeoff between achieving a low bias by allowing more complexity (with a risk of higher variance) and keeping the model simpler with higher bias but lower variance. The kernel choice (linear, polynomial, RBF, etc.) also affects this balance by determining the flexibility of the decision boundary.
Decision trees: Unpruned decision trees typically exhibit low bias and high variance. They can capture complex data structures by fitting closely to the training data, but this makes them sensitive to noise in the training data. Pruning and setting restrictions on tree growth can help increase bias slightly but significantly reduce variance, leading to a more generalizable model.
Random Forest: Random Forest, an ensemble of decision trees, is designed to maintain the low bias of decision trees while reducing variance through bagging (bootstrap aggregating). By averaging multiple deep trees trained on different subsets of the data, Random Forest achieves lower variance than individual decision trees without a substantial increase in bias.
GBDT: focuses on sequentially reducing the residual errors of the models, leading to low bias. The method can have high variance if the number of boosting rounds is too large without adequate regularization, as each new model overfits the residual errors of the previous ones. Proper regularization and setting a stopping criterion are essential to balance the bias-variance tradeoff.
Neural Networks: especially deep neural networks, have the flexibility to model complex non-linear relationships, resulting in low bias. However, without proper regularization (e.g., dropout, weight decay), they can have high variance due to their capacity to overfit the training data. The architecture design and regularization techniques are key to managing the bias-variance tradeoff.

Algorithm	Bias	Variance	Notes
K-Nearest Neighbors (kNN) (small )	Low	High	Flexible but sensitive to noise
K-Nearest Neighbors (kNN) (large )	High	Low	More generalized but may underfit
Support Vector Machines (SVM) (with linear kernel)	High	Low	Good for simple, linearly separable data
Support Vector Machines (SVM) (with RBF kernel)	Low	High	Powerful but prone to overfitting
Decision Trees (shallow)	High	Low	Simple but may underfit
Decision Trees (deep/unpruned)	Low	High	Overfits easily
Random Forest	Low	Medium	Averaging reduces variance
Gradient Boosted Trees (GBDT)	Low	High	Can overfit but performs well with tuning
Neural Networks (small model, few layers)	High	Low	May underfit complex data
Neural Networks (deep model, many layers)	Low	High	Requires regularization to prevent overfitting

6. What is CART?

CART (Classification and Regression Trees) is a decision tree algorithm used for classification and regression problems. It was introduced by Breiman et al. (1984).

Unlike earlier decision tree methods that used information gain (entropy), CART:

Uses Gini impurity for classification.
Uses Mean Squared Error (MSE) for regression.
Produces binary trees (each split has exactly two branches).

6.1 How CART Works?

Find the Best Split
- For classification: Uses Gini impurity.
- For regression: Uses MSE to minimize variance.
Split the Data
- Recursively splits the dataset until a stopping condition is met (e.g., max depth, min samples per leaf).
Pruning (Optional)
- Reduces overfitting by removing weak branches.

6.2 CART vs. Other Decision Tree Methods

Algorithm	Splitting Criterion	Tree Type
ID3 (Iterative Dichotomiser 3)	Information Gain (Entropy)	Multi-way (not always binary)
C4.5 (Successor of ID3)	Information Gain Ratio	Multi-way
CART (Used in `sklearn`)	Gini Impurity (Classification), MSE (Regression)	Binary Tree

Why use CART?
✅ More efficient than entropy-based methods.
✅ Used in popular ML models (Random Forest, GBDT).
✅ Handles both classification & regression tasks.