3.4 Gini Index Splitting

1. Definitions

1.1 What is Gini Impurity ?

Gini impurity is a measure of how often a randomly chosen element would be incorrectly classified if it were labeled according to the class distribution in a subset. It is used in decision trees (like CART) as an alternative to entropy.

where is the probability of class .

Lower Gini (closer to 0) → More pure (one class dominates).
Higher Gini (closer to 0.5) → More impure (mixed classes).

Example Calculation

Suppose a dataset has 80% “Yes” and 20% “No” labels:

A lower Gini value indicates a better split in a decision tree.

1.2 Entropy vs. Gini: Which to Use?

Criterion	Entropy	Gini Impurity
Formula
Range	to	to
Speed	Slower (log computation)	Faster (no log function)
Tendency	Prefers balanced splits	Prefers dominant class

In practice, Gini impurity is used more often in CART decision trees (e.g., scikit-learn) because it is computationally faster.

1.3 Why Probability is Squared in Gini ?

In the Gini impurity formula, the probabilities are squared to penalize larger class proportions more heavily. The intuition behind squaring the probabilities is as follows:

Gini Impurity Formula:

Where:

is the probability of class .

Why are probabilities squared?

Emphasizes dominant classes:
- Squaring the probabilities ensures that dominant classes (those with higher probability) have a greater effect on the impurity score.
- For example, if a dataset has 90% of one class and 10% of another, the dominant class (90%) will have a larger impact on the Gini value.
Minimizes impurity:
- The squared term makes it so that when the class distribution is more imbalanced (e.g., one class dominates), the Gini impurity value increases, driving the algorithm to prefer splits that better separate the data.
Encourages homogeneous nodes:
- A pure node (where all instances belong to one class) has a Gini value of 0 because for the dominant class, and the sum of squared probabilities is 1.

Example:

For a binary classification problem:

If a node has 90% of class A and 10% of class B, then:

Low Gini value indicates that this split is relatively pure because one class dominates.

If the node has 50% of class A and 50% of class B:

High Gini value means the node is impure (evenly split between the classes).

2. Toy Example

import numpy as np

# Function to calculate Gini impurity
def gini_impurity(labels):
    # Calculate the probabilities of each class
    class_counts = np.bincount(labels)
    probabilities = class_counts / len(labels)

    # Compute Gini Impurity
    gini = 1 - np.sum(probabilities**2)
    return gini

# Toy dataset: [1, 1, 1, 0, 0, 0, 1, 0, 0, 1]
# Class 1 = "Play", Class 0 = "Don't Play"
labels = np.array([1, 1, 1, 0, 0, 0, 1, 0, 0, 1])

# Calculate the Gini Impurity for the whole dataset
gini = gini_impurity(labels)
print(f"Gini Impurity for the dataset: {gini:.4f}")

Labels: The dataset has 6 instances of class 0 (“Don’t Play”) and 4 instances of class 1 (“Play”).
Gini Impurity formula: where is the probability of each class.

Output:

Gini Impurity for the dataset: 0.48

Interpretation

A Gini impurity of 0.48 indicates that the dataset is fairly mixed between the two classes.
A lower Gini value (closer to 0) would indicate more purity (one class dominates).