3.4 Gini Index Splitting
1. Definitions
1.1 What is Gini Impurity ?
Gini impurity is a measure of how often a randomly chosen element would be incorrectly classified if it were labeled according to the class distribution in a subset. It is used in decision trees (like CART) as an alternative to entropy.
where
- Lower Gini (closer to 0) → More pure (one class dominates).
- Higher Gini (closer to 0.5) → More impure (mixed classes).
Example Calculation
Suppose a dataset has 80% “Yes” and 20% “No” labels:
A lower Gini value indicates a better split in a decision tree.
1.2 Entropy vs. Gini: Which to Use?
Criterion | Entropy | Gini Impurity |
---|---|---|
Formula | ||
Range | ||
Speed | Slower (log computation) | Faster (no log function) |
Tendency | Prefers balanced splits | Prefers dominant class |
In practice, Gini impurity is used more often in CART decision trees (e.g., scikit-learn) because it is computationally faster.
1.3 Why Probability is Squared in Gini ?
In the Gini impurity formula, the probabilities are squared to penalize larger class proportions more heavily. The intuition behind squaring the probabilities is as follows:
Gini Impurity Formula:
Where:
is the probability of class .
Why are probabilities squared?
-
Emphasizes dominant classes:
- Squaring the probabilities ensures that dominant classes (those with higher probability) have a greater effect on the impurity score.
- For example, if a dataset has 90% of one class and 10% of another, the dominant class (90%) will have a larger impact on the Gini value.
-
Minimizes impurity:
- The squared term makes it so that when the class distribution is more imbalanced (e.g., one class dominates), the Gini impurity value increases, driving the algorithm to prefer splits that better separate the data.
-
Encourages homogeneous nodes:
- A pure node (where all instances belong to one class) has a Gini value of 0 because
for the dominant class, and the sum of squared probabilities is 1.
- A pure node (where all instances belong to one class) has a Gini value of 0 because
Example:
For a binary classification problem:
- If a node has 90% of class A and 10% of class B, then:
- Low Gini value indicates that this split is relatively pure because one class dominates.
If the node has 50% of class A and 50% of class B:
- High Gini value means the node is impure (evenly split between the classes).
2. Toy Example
import numpy as np
# Function to calculate Gini impuritydef gini_impurity(labels): # Calculate the probabilities of each class class_counts = np.bincount(labels) probabilities = class_counts / len(labels)
# Compute Gini Impurity gini = 1 - np.sum(probabilities**2) return gini
# Toy dataset: [1, 1, 1, 0, 0, 0, 1, 0, 0, 1]# Class 1 = "Play", Class 0 = "Don't Play"labels = np.array([1, 1, 1, 0, 0, 0, 1, 0, 0, 1])
# Calculate the Gini Impurity for the whole datasetgini = gini_impurity(labels)print(f"Gini Impurity for the dataset: {gini:.4f}")
- Labels: The dataset has 6 instances of class 0 (“Don’t Play”) and 4 instances of class 1 (“Play”).
- Gini Impurity formula:
where is the probability of each class.
Output:
Gini Impurity for the dataset: 0.48
Interpretation
- A Gini impurity of 0.48 indicates that the dataset is fairly mixed between the two classes.
- A lower Gini value (closer to 0) would indicate more purity (one class dominates).