Skip to content

3.4 Gini Index Splitting

1. Definitions

1.1 What is Gini Impurity ?

Gini impurity is a measure of how often a randomly chosen element would be incorrectly classified if it were labeled according to the class distribution in a subset. It is used in decision trees (like CART) as an alternative to entropy.

where is the probability of class .

  • Lower Gini (closer to 0) → More pure (one class dominates).
  • Higher Gini (closer to 0.5) → More impure (mixed classes).

Example Calculation

Suppose a dataset has 80% “Yes” and 20% “No” labels:

A lower Gini value indicates a better split in a decision tree.

1.2 Entropy vs. Gini: Which to Use?

CriterionEntropyGini Impurity
Formula
Range to to
SpeedSlower (log computation)Faster (no log function)
TendencyPrefers balanced splitsPrefers dominant class

In practice, Gini impurity is used more often in CART decision trees (e.g., scikit-learn) because it is computationally faster.

1.3 Why Probability is Squared in Gini ?

In the Gini impurity formula, the probabilities are squared to penalize larger class proportions more heavily. The intuition behind squaring the probabilities is as follows:

Gini Impurity Formula:

Where:

  • is the probability of class .

Why are probabilities squared?

  1. Emphasizes dominant classes:

    • Squaring the probabilities ensures that dominant classes (those with higher probability) have a greater effect on the impurity score.
    • For example, if a dataset has 90% of one class and 10% of another, the dominant class (90%) will have a larger impact on the Gini value.
  2. Minimizes impurity:

    • The squared term makes it so that when the class distribution is more imbalanced (e.g., one class dominates), the Gini impurity value increases, driving the algorithm to prefer splits that better separate the data.
  3. Encourages homogeneous nodes:

    • A pure node (where all instances belong to one class) has a Gini value of 0 because for the dominant class, and the sum of squared probabilities is 1.

Example:

For a binary classification problem:

  • If a node has 90% of class A and 10% of class B, then:
  • Low Gini value indicates that this split is relatively pure because one class dominates.

If the node has 50% of class A and 50% of class B:

  • High Gini value means the node is impure (evenly split between the classes).

2. Toy Example

import numpy as np
# Function to calculate Gini impurity
def gini_impurity(labels):
# Calculate the probabilities of each class
class_counts = np.bincount(labels)
probabilities = class_counts / len(labels)
# Compute Gini Impurity
gini = 1 - np.sum(probabilities**2)
return gini
# Toy dataset: [1, 1, 1, 0, 0, 0, 1, 0, 0, 1]
# Class 1 = "Play", Class 0 = "Don't Play"
labels = np.array([1, 1, 1, 0, 0, 0, 1, 0, 0, 1])
# Calculate the Gini Impurity for the whole dataset
gini = gini_impurity(labels)
print(f"Gini Impurity for the dataset: {gini:.4f}")
  • Labels: The dataset has 6 instances of class 0 (“Don’t Play”) and 4 instances of class 1 (“Play”).
  • Gini Impurity formula: where is the probability of each class.

Output:

Gini Impurity for the dataset: 0.48

Interpretation

  • A Gini impurity of 0.48 indicates that the dataset is fairly mixed between the two classes.
  • A lower Gini value (closer to 0) would indicate more purity (one class dominates).