Decision Trees

Entropy and Information Gain

Definition (Entropy)

The entropy of a dataset \(S\) with classes \(C\) is:

\[H(S) = -\sum_{c \in C} p_c \log_2(p_c)\]

where \(p_c\) is the proportion of examples belonging to class \(c\). Entropy is maximised when classes are equally distributed and zero when all examples belong to a single class.

Definition (Information Gain)

The information gain of splitting dataset \(S\) on attribute \(A\) is:

\[\text{IG}(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} H(S_v)\]

where \(S_v\) is the subset of \(S\) for which attribute \(A\) has value \(v\). The attribute with the highest information gain is chosen at each node.

Remark (Splitting Criterion)

The ID3 algorithm greedily selects the attribute maximising information gain at each step. Its successor C4.5 uses gain ratio instead (normalising by split information) to avoid bias toward attributes with many values.

Worked Example: Boolean Attributes

Example (Boolean Attribute Split)

Consider the following dataset with two binary attributes \(A\), \(B\) and a binary class label:

#	\(A\)	\(B\)	Class
1	1	1	+
2	1	1	\(-\)
3	1	1	\(-\)
4	1	1	\(-\)
5	1	0	+
6	1	0	+
7	0	1	\(-\)
8	0	0	+
9	0	0	+
10	0	0	+

(a) Entropy of \(S\).

There are 6 positive and 4 negative examples, so:

\[H(S) = -\frac{6}{10}\log_2\frac{6}{10} - \frac{4}{10}\log_2\frac{4}{10} \approx 0.971\]

(b) Information Gain of \(A\).

Splitting on \(A\):

\(A = 1\): 6 examples (3+, 3\(-\)) \(\Rightarrow H(S_{A=1}) = 1.0\)
\(A = 0\): 4 examples (3+, 1\(-\)) \(\Rightarrow H(S_{A=0}) \approx 0.811\)

\[\text{IG}(S, A) = 0.971 - \frac{6}{10}(1.0) - \frac{4}{10}(0.811) \approx 0.047\]

(c) Information Gain of \(B\).

Splitting on \(B\):

\(B = 1\): 5 examples (1+, 4\(-\)) \(\Rightarrow H(S_{B=1}) \approx 0.722\)
\(B = 0\): 5 examples (5+, 0\(-\)) \(\Rightarrow H(S_{B=0}) = 0\)

\[\text{IG}(S, B) = 0.971 - \frac{5}{10}(0.722) - \frac{5}{10}(0) \approx 0.610\]

(d) Best attribute and tree comparison.

\(B\) has much higher information gain (\(0.610\) vs \(0.047\)). Building the complete trees reveals the practical impact: splitting on \(B\) first produces a pure leaf immediately (\(B = 0\): all 5 examples are +), classifying half the dataset in a single query. In contrast, splitting on \(A\) first always requires a second split on \(B\) — every example traverses two nodes.

Average queries per example: \(2.0\) with \(A\) at the root, but only \(1.5\) with \(B\) at the root.

(e) Ensembles.

A single decision tree can overfit, especially on noisy data. Random Forests address this by training an ensemble of trees, each on a bootstrap sample of the data with a random subset of features considered at each split. The final prediction is the majority vote across all trees.

Worked Example: 2-of-3 Boolean Function

Example (2-of-3 Boolean Function)

Consider the Boolean function “exactly 2 of 3 variables are true”:

\(X\)	\(Y\)	\(Z\)	Class
0	0	0	F
0	0	1	F
0	1	0	F
0	1	1	T
1	0	0	F
1	0	1	T
1	1	0	T
1	1	1	F

(a) Information Gain.

There are 3 true and 5 false examples:

\[H(S) = -\frac{3}{8}\log_2\frac{3}{8} - \frac{5}{8}\log_2\frac{5}{8} \approx 0.954\]

Splitting on \(X\):

\(X = 0\): 4 examples (1T, 3F) \(\Rightarrow H(S_{X=0}) \approx 0.811\)
\(X = 1\): 4 examples (2T, 2F) \(\Rightarrow H(S_{X=1}) = 1.0\)

\[\text{IG}(S, X) = 0.954 - \frac{4}{8}(0.811) - \frac{4}{8}(1.0) \approx 0.049\]

Remark (Symmetry)

By symmetry of the 2-of-3 function, \(X\), \(Y\), and \(Z\) each appear in exactly the same positions. Therefore \(\text{IG}(S, Y) = \text{IG}(S, Z) = 0.049\). All three attributes are equally informative — the algorithm must break the tie arbitrarily.

(b) Complete decision tree.

Choosing \(X\) at the root (breaking the tie), then recursively splitting:

\(X = 0\): subset \(\{(0,0,0,\text{F}),\ (0,0,1,\text{F}),\ (0,1,0,\text{F}),\ (0,1,1,\text{T})\}\). Split on \(Y\):
- \(Y = 0\): all F \(\Rightarrow\) leaf F
- \(Y = 1\): \(\{(0,1,0,\text{F}),\ (0,1,1,\text{T})\}\). Split on \(Z\):
  - \(Z = 0 \Rightarrow\) F, \(\quad Z = 1 \Rightarrow\) T
\(X = 1\): subset \(\{(1,0,0,\text{F}),\ (1,0,1,\text{T}),\ (1,1,0,\text{T}),\ (1,1,1,\text{F})\}\). Split on \(Y\):
- \(Y = 0\): \(\{(1,0,0,\text{F}),\ (1,0,1,\text{T})\}\). Split on \(Z\):
  - \(Z = 0 \Rightarrow\) F, \(\quad Z = 1 \Rightarrow\) T
- \(Y = 1\): \(\{(1,1,0,\text{T}),\ (1,1,1,\text{F})\}\). Split on \(Z\):
  - \(Z = 0 \Rightarrow\) T, \(\quad Z = 1 \Rightarrow\) F

Remark (Hard Functions for Trees)

The 2-of-3 function requires every path from root to leaf to query all three attributes. No single attribute provides a strong signal on its own (all have the same low information gain). Functions with this XOR-like structure are inherently hard for greedy tree learners — the tree cannot simplify the problem early.

Worked Example: Gain Ratio

Example (ID3 with Gain Ratio)

Consider the following dataset for predicting a gem’s homeworld (binary class). There are 4 attributes and 12 examples:

species	rebel	age	ability	homeworld
pearl	yes	6000	regeneration	no
bismuth	yes	8000	regeneration	no
pearl	no	6000	weapon-summoning	no
garnet	yes	5000	regeneration	no
amethyst	no	6000	shapeshifting	no
amethyst	yes	5000	shapeshifting	no
garnet	yes	6000	weapon-summoning	no
diamond	no	6000	regeneration	yes
diamond	no	8000	regeneration	yes
amethyst	no	5000	shapeshifting	yes
pearl	no	8000	shapeshifting	yes
jasper	no	6000	weapon-summoning	yes

(a) Which attributes are most predictive of the class? You might guess that “age” and “ability” are not very predictive — but you might be surprised that “species” has higher information gain than “rebel”. Why?

(b) Suppose for attribute “species” the Information Gain is \(0.52\) and Split Information is \(2.46\), whereas for “rebel” the Information Gain is \(0.48\) and Split Information is \(0.98\). Which attribute would be selected under the Gain Ratio criterion? Is this a better choice?

import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

# --- dataset ---
species   = ['pearl','bismuth','pearl','garnet','amethyst','amethyst',
             'garnet','diamond','diamond','amethyst','pearl','jasper']
rebel     = ['yes','yes','no','yes','no','yes','yes','no','no','no','no','no']
age       = ['6000','8000','6000','5000','6000','5000',
             '6000','6000','8000','5000','8000','6000']
ability   = ['regeneration','regeneration','weapon-summoning','regeneration',
             'shapeshifting','shapeshifting','weapon-summoning','regeneration',
             'regeneration','shapeshifting','shapeshifting','weapon-summoning']
homeworld = ['no','no','no','no','no','no','no','yes','yes','yes','yes','yes']

def H(labels):
    """Entropy."""
    n = len(labels)
    if n == 0: return 0
    counts = Counter(labels)
    return -sum((c/n)*np.log2(c/n) for c in counts.values())

def IG(attr, labels):
    """Information gain."""
    n, h = len(labels), H(labels)
    for v in set(attr):
        sub = [labels[i] for i in range(n) if attr[i] == v]
        h -= (len(sub)/n) * H(sub)
    return h

def SI(attr):
    """Split information."""
    n = len(attr)
    counts = Counter(attr)
    return -sum((c/n)*np.log2(c/n) for c in counts.values())

attrs = {'species': species, 'rebel': rebel, 'age': age, 'ability': ability}

print(f"H(S) = {H(homeworld):.4f}\n")
print(f"{'Attribute':18s} {'IG':>7s} {'SplitInfo':>10s} {'GainRatio':>10s}")
print("-" * 48)
for name, attr in attrs.items():
    ig, si = IG(attr, homeworld), SI(attr)
    print(f"{name:18s} {ig:7.4f} {si:10.4f} {ig/si:10.4f}")

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4.5))

names = list(attrs.keys())
igs   = [IG(a, homeworld) for a in attrs.values()]
cols  = ['#4C72B0', '#DD8452', '#55A868', '#C44E52']

# left panel: IG for all attributes
bars = ax1.bar(names, igs, color=cols, edgecolor='k', lw=.5)
ax1.set_ylabel('Information Gain')
ax1.set_title('Information Gain by Attribute')
for b, v in zip(bars, igs):
    ax1.text(b.get_x() + b.get_width()/2, v + .008,
             f'{v:.3f}', ha='center', fontsize=9)
ax1.set_ylim(0, max(igs) * 1.25)

# right panel: IG vs Gain Ratio for species & rebel
x, w = np.arange(2), 0.30
ig_pair = [IG(species, homeworld), IG(rebel, homeworld)]
gr_pair = [ig_pair[0] / SI(species), ig_pair[1] / SI(rebel)]
b1 = ax2.bar(x - w/2, ig_pair, w, label='Info Gain',
             color='#4C72B0', edgecolor='k', lw=.5)
b2 = ax2.bar(x + w/2, gr_pair, w, label='Gain Ratio',
             color='#DD8452', edgecolor='k', lw=.5)
ax2.set_xticks(x)
ax2.set_xticklabels(['species', 'rebel'])
ax2.set_title('Information Gain vs Gain Ratio')
ax2.legend(fontsize=9)
for b in list(b1) + list(b2):
    ax2.text(b.get_x() + b.get_width()/2, b.get_height() + .008,
             f'{b.get_height():.2f}', ha='center', fontsize=9)

plt.tight_layout()
plt.show()

Remark (Gain Ratio)

Gain Ratio corrects for a bias in Information Gain that favours attributes with many values by dividing by the Split Information — the entropy of the partition itself:

\[\text{GainRatio}(S, A) = \frac{\text{IG}(S, A)}{H_A(S)}, \qquad H_A(S) = -\sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|}\log_2 \frac{|S_v|}{|S|}\]

Here “species” splits the data into 6 small groups (Split Info \(= 2.46\)), inflating its IG to \(0.52\) but yielding Gain Ratio \(\approx 0.21\). Meanwhile “rebel” splits into just 2 groups (Split Info \(= 0.98\)) with IG \(= 0.48\) and Gain Ratio \(\approx 0.49\). Under Gain Ratio, “rebel” is correctly selected — “species” achieved its high IG simply by having many distinct values. Note that other corrections (e.g. normalised IG) have been proposed with different trade-offs; there is no universally best splitting criterion.

Ensembles and Random Forests

A single decision tree is prone to overfitting: it can memorise noise in the training data, leading to poor generalisation. Ensemble methods address this by combining multiple trees.

Bagging (Bootstrap Aggregating): train each tree on a random bootstrap sample (sampling with replacement) of the training data. Average the predictions (regression) or take a majority vote (classification).
Random Forests: bagging with an additional twist — at each split, only a random subset of \(m \approx \sqrt{p}\) features (out of \(p\) total) is considered. This decorrelates the trees, reducing variance further.
Boosting (AdaBoost, Gradient Boosting): train trees sequentially, with each new tree focusing on examples the previous ensemble got wrong. Boosting reduces both bias and variance but is more sensitive to noisy data.

Decision Trees

Entropy and Information Gain

Worked Example: Boolean Attributes

Worked Example: 2-of-3 Boolean Function

Worked Example: Gain Ratio

Ensembles and Random Forests

Backlinks (2)

1. Wiki `/wiki/`

2. Supervised Learning `/wiki/ml/supervised/`

Decision Trees

Entropy and Information Gain#

Worked Example: Boolean Attributes#

Worked Example: 2-of-3 Boolean Function#

Worked Example: Gain Ratio#

Ensembles and Random Forests#

Backlinks (2)

1. Wiki /wiki/

2. Supervised Learning /wiki/ml/supervised/

Entropy and Information Gain

Worked Example: Boolean Attributes

Worked Example: 2-of-3 Boolean Function

Worked Example: Gain Ratio

Ensembles and Random Forests

1. Wiki `/wiki/`

2. Supervised Learning `/wiki/ml/supervised/`