Decision Trees
Entropy and Information Gain
The entropy of a dataset \(S\) with classes \(C\) is:
\[H(S) = -\sum_{c \in C} p_c \log_2(p_c)\]
where \(p_c\) is the proportion of examples belonging to class \(c\). Entropy is maximised when classes are equally distributed and zero when all examples belong to a single class.
The information gain of splitting dataset \(S\) on attribute \(A\) is:
\[\text{IG}(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} H(S_v)\]
where \(S_v\) is the subset of \(S\) for which attribute \(A\) has value \(v\). The attribute with the highest information gain is chosen at each node.
The ID3 algorithm greedily selects the attribute maximising information gain at each step. Its successor C4.5 uses gain ratio instead (normalising by split information) to avoid bias toward attributes with many values.
Worked Example: Boolean Attributes
Consider the following dataset with two binary attributes \(A\), \(B\) and a binary class label:
| # | \(A\) | \(B\) | Class |
|---|---|---|---|
| 1 | 1 | 1 | + |
| 2 | 1 | 1 | \(-\) |
| 3 | 1 | 1 | \(-\) |
| 4 | 1 | 1 | \(-\) |
| 5 | 1 | 0 | + |
| 6 | 1 | 0 | + |
| 7 | 0 | 1 | \(-\) |
| 8 | 0 | 0 | + |
| 9 | 0 | 0 | + |
| 10 | 0 | 0 | + |
(a) Entropy of \(S\).
There are 6 positive and 4 negative examples, so:
\[H(S) = -\frac{6}{10}\log_2\frac{6}{10} - \frac{4}{10}\log_2\frac{4}{10} \approx 0.971\]
(b) Information Gain of \(A\).
Splitting on \(A\):
- \(A = 1\): 6 examples (3+, 3\(-\)) \(\Rightarrow H(S_{A=1}) = 1.0\)
- \(A = 0\): 4 examples (3+, 1\(-\)) \(\Rightarrow H(S_{A=0}) \approx 0.811\)
\[\text{IG}(S, A) = 0.971 - \frac{6}{10}(1.0) - \frac{4}{10}(0.811) \approx 0.047\]
(c) Information Gain of \(B\).
Splitting on \(B\):
- \(B = 1\): 5 examples (1+, 4\(-\)) \(\Rightarrow H(S_{B=1}) \approx 0.722\)
- \(B = 0\): 5 examples (5+, 0\(-\)) \(\Rightarrow H(S_{B=0}) = 0\)
\[\text{IG}(S, B) = 0.971 - \frac{5}{10}(0.722) - \frac{5}{10}(0) \approx 0.610\]
(d) Best attribute and tree comparison.
\(B\) has much higher information gain (\(0.610\) vs \(0.047\)). Building the complete trees reveals the practical impact: splitting on \(B\) first produces a pure leaf immediately (\(B = 0\): all 5 examples are +), classifying half the dataset in a single query. In contrast, splitting on \(A\) first always requires a second split on \(B\) — every example traverses two nodes.
Average queries per example: \(2.0\) with \(A\) at the root, but only \(1.5\) with \(B\) at the root.
(e) Ensembles.
A single decision tree can overfit, especially on noisy data. Random Forests address this by training an ensemble of trees, each on a bootstrap sample of the data with a random subset of features considered at each split. The final prediction is the majority vote across all trees.
Worked Example: 2-of-3 Boolean Function
Consider the Boolean function “exactly 2 of 3 variables are true”:
| \(X\) | \(Y\) | \(Z\) | Class |
|---|---|---|---|
| 0 | 0 | 0 | F |
| 0 | 0 | 1 | F |
| 0 | 1 | 0 | F |
| 0 | 1 | 1 | T |
| 1 | 0 | 0 | F |
| 1 | 0 | 1 | T |
| 1 | 1 | 0 | T |
| 1 | 1 | 1 | F |
(a) Information Gain.
There are 3 true and 5 false examples:
\[H(S) = -\frac{3}{8}\log_2\frac{3}{8} - \frac{5}{8}\log_2\frac{5}{8} \approx 0.954\]
Splitting on \(X\):
- \(X = 0\): 4 examples (1T, 3F) \(\Rightarrow H(S_{X=0}) \approx 0.811\)
- \(X = 1\): 4 examples (2T, 2F) \(\Rightarrow H(S_{X=1}) = 1.0\)
\[\text{IG}(S, X) = 0.954 - \frac{4}{8}(0.811) - \frac{4}{8}(1.0) \approx 0.049\]
By symmetry of the 2-of-3 function, \(X\), \(Y\), and \(Z\) each appear in exactly the same positions. Therefore \(\text{IG}(S, Y) = \text{IG}(S, Z) = 0.049\). All three attributes are equally informative — the algorithm must break the tie arbitrarily.
(b) Complete decision tree.
Choosing \(X\) at the root (breaking the tie), then recursively splitting:
- \(X = 0\): subset \(\{(0,0,0,\text{F}),\ (0,0,1,\text{F}),\ (0,1,0,\text{F}),\ (0,1,1,\text{T})\}\). Split on \(Y\):
- \(Y = 0\): all F \(\Rightarrow\) leaf F
- \(Y = 1\): \(\{(0,1,0,\text{F}),\ (0,1,1,\text{T})\}\). Split on \(Z\):
- \(Z = 0 \Rightarrow\) F, \(\quad Z = 1 \Rightarrow\) T
- \(X = 1\): subset \(\{(1,0,0,\text{F}),\ (1,0,1,\text{T}),\ (1,1,0,\text{T}),\ (1,1,1,\text{F})\}\). Split on \(Y\):
- \(Y = 0\): \(\{(1,0,0,\text{F}),\ (1,0,1,\text{T})\}\). Split on \(Z\):
- \(Z = 0 \Rightarrow\) F, \(\quad Z = 1 \Rightarrow\) T
- \(Y = 1\): \(\{(1,1,0,\text{T}),\ (1,1,1,\text{F})\}\). Split on \(Z\):
- \(Z = 0 \Rightarrow\) T, \(\quad Z = 1 \Rightarrow\) F
- \(Y = 0\): \(\{(1,0,0,\text{F}),\ (1,0,1,\text{T})\}\). Split on \(Z\):
The 2-of-3 function requires every path from root to leaf to query all three attributes. No single attribute provides a strong signal on its own (all have the same low information gain). Functions with this XOR-like structure are inherently hard for greedy tree learners — the tree cannot simplify the problem early.
Worked Example: Gain Ratio
Consider the following dataset for predicting a gem’s homeworld (binary class). There are 4 attributes and 12 examples:
| species | rebel | age | ability | homeworld |
|---|---|---|---|---|
| pearl | yes | 6000 | regeneration | no |
| bismuth | yes | 8000 | regeneration | no |
| pearl | no | 6000 | weapon-summoning | no |
| garnet | yes | 5000 | regeneration | no |
| amethyst | no | 6000 | shapeshifting | no |
| amethyst | yes | 5000 | shapeshifting | no |
| garnet | yes | 6000 | weapon-summoning | no |
| diamond | no | 6000 | regeneration | yes |
| diamond | no | 8000 | regeneration | yes |
| amethyst | no | 5000 | shapeshifting | yes |
| pearl | no | 8000 | shapeshifting | yes |
| jasper | no | 6000 | weapon-summoning | yes |
(a) Which attributes are most predictive of the class? You might guess that “age” and “ability” are not very predictive — but you might be surprised that “species” has higher information gain than “rebel”. Why?
(b) Suppose for attribute “species” the Information Gain is \(0.52\) and Split Information is \(2.46\), whereas for “rebel” the Information Gain is \(0.48\) and Split Information is \(0.98\). Which attribute would be selected under the Gain Ratio criterion? Is this a better choice?
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
# --- dataset ---
species = ['pearl','bismuth','pearl','garnet','amethyst','amethyst',
'garnet','diamond','diamond','amethyst','pearl','jasper']
rebel = ['yes','yes','no','yes','no','yes','yes','no','no','no','no','no']
age = ['6000','8000','6000','5000','6000','5000',
'6000','6000','8000','5000','8000','6000']
ability = ['regeneration','regeneration','weapon-summoning','regeneration',
'shapeshifting','shapeshifting','weapon-summoning','regeneration',
'regeneration','shapeshifting','shapeshifting','weapon-summoning']
homeworld = ['no','no','no','no','no','no','no','yes','yes','yes','yes','yes']
def H(labels):
"""Entropy."""
n = len(labels)
if n == 0: return 0
counts = Counter(labels)
return -sum((c/n)*np.log2(c/n) for c in counts.values())
def IG(attr, labels):
"""Information gain."""
n, h = len(labels), H(labels)
for v in set(attr):
sub = [labels[i] for i in range(n) if attr[i] == v]
h -= (len(sub)/n) * H(sub)
return h
def SI(attr):
"""Split information."""
n = len(attr)
counts = Counter(attr)
return -sum((c/n)*np.log2(c/n) for c in counts.values())
attrs = {'species': species, 'rebel': rebel, 'age': age, 'ability': ability}
print(f"H(S) = {H(homeworld):.4f}\n")
print(f"{'Attribute':18s} {'IG':>7s} {'SplitInfo':>10s} {'GainRatio':>10s}")
print("-" * 48)
for name, attr in attrs.items():
ig, si = IG(attr, homeworld), SI(attr)
print(f"{name:18s} {ig:7.4f} {si:10.4f} {ig/si:10.4f}")
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4.5))
names = list(attrs.keys())
igs = [IG(a, homeworld) for a in attrs.values()]
cols = ['#4C72B0', '#DD8452', '#55A868', '#C44E52']
# left panel: IG for all attributes
bars = ax1.bar(names, igs, color=cols, edgecolor='k', lw=.5)
ax1.set_ylabel('Information Gain')
ax1.set_title('Information Gain by Attribute')
for b, v in zip(bars, igs):
ax1.text(b.get_x() + b.get_width()/2, v + .008,
f'{v:.3f}', ha='center', fontsize=9)
ax1.set_ylim(0, max(igs) * 1.25)
# right panel: IG vs Gain Ratio for species & rebel
x, w = np.arange(2), 0.30
ig_pair = [IG(species, homeworld), IG(rebel, homeworld)]
gr_pair = [ig_pair[0] / SI(species), ig_pair[1] / SI(rebel)]
b1 = ax2.bar(x - w/2, ig_pair, w, label='Info Gain',
color='#4C72B0', edgecolor='k', lw=.5)
b2 = ax2.bar(x + w/2, gr_pair, w, label='Gain Ratio',
color='#DD8452', edgecolor='k', lw=.5)
ax2.set_xticks(x)
ax2.set_xticklabels(['species', 'rebel'])
ax2.set_title('Information Gain vs Gain Ratio')
ax2.legend(fontsize=9)
for b in list(b1) + list(b2):
ax2.text(b.get_x() + b.get_width()/2, b.get_height() + .008,
f'{b.get_height():.2f}', ha='center', fontsize=9)
plt.tight_layout()
plt.show()
Gain Ratio corrects for a bias in Information Gain that favours attributes with many values by dividing by the Split Information — the entropy of the partition itself:
\[\text{GainRatio}(S, A) = \frac{\text{IG}(S, A)}{H_A(S)}, \qquad H_A(S) = -\sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|}\log_2 \frac{|S_v|}{|S|}\]
Here “species” splits the data into 6 small groups (Split Info \(= 2.46\)), inflating its IG to \(0.52\) but yielding Gain Ratio \(\approx 0.21\). Meanwhile “rebel” splits into just 2 groups (Split Info \(= 0.98\)) with IG \(= 0.48\) and Gain Ratio \(\approx 0.49\). Under Gain Ratio, “rebel” is correctly selected — “species” achieved its high IG simply by having many distinct values. Note that other corrections (e.g. normalised IG) have been proposed with different trade-offs; there is no universally best splitting criterion.
Ensembles and Random Forests
A single decision tree is prone to overfitting: it can memorise noise in the training data, leading to poor generalisation. Ensemble methods address this by combining multiple trees.
Bagging (Bootstrap Aggregating): train each tree on a random bootstrap sample (sampling with replacement) of the training data. Average the predictions (regression) or take a majority vote (classification).
Random Forests: bagging with an additional twist — at each split, only a random subset of \(m \approx \sqrt{p}\) features (out of \(p\) total) is considered. This decorrelates the trees, reducing variance further.
Boosting (AdaBoost, Gradient Boosting): train trees sequentially, with each new tree focusing on examples the previous ensemble got wrong. Boosting reduces both bias and variance but is more sensitive to noisy data.
Backlinks (2)
1. Wiki /wiki/
Knowledge is a paradox. The more one understand, the more one realises the vastness of his ignorance.