Theory

The decision-making process at each step of the decision tree is the process of reducing information uncertainty

Entropy

A common way to measure uncertainty

Information Entropy

Entropy =

H(X) = \sum\limits_{i} -p_{i} \log_2 p_{i}

$p_{i}$ is the probability of class i
$p_{i}$ is the fraction of class i in the set
Entropy comes from information theory

Condition Entropy

H(Y|X) = \sum\limits_{i=1}^n p_{i} H(Y|X=x_{i})

Information Gain

Calculating Method 1

There is 𝐼 that, when information about 𝐼 is introduced, the entropy of 𝑈 becomes smaller.
We have to choose the one that best makes the information entropy of 𝑈 smaller 𝐼 to get the maximum information gain

g(U|D) = H(U) - H(U|I)

Calculating Method 2

The information gain I(C;F) of the class variable C with possible values ${c_{1}, c_{2}, ... c_{m}}$ with respect to the feature variable F with possible values ${f_1, f_2, ... , f_d}$ is defined by:

I(C;F) = \sum\limits_{i=1}^m \sum\limits_{j}^d P(C=c_i,F=f_j)\log_2 \frac{P(C=c_i,F=f_j)}{P(C=c_i)P(F=f_j)}

These are estimated from frequencies in the training data.

$P(C = c_i)$ is the probability of class C having value $c_i$ .
$P(F=f_j)$ is the probability of feature F having value $f_j$ .
$P(C=c_{i},F=f_{j})$ is the joint probability of Class $C = c_i$ and Variable $F = f_{j}$

ID3 & C4.5

ID3

The core idea of the ID3 (Interactive Dichotomizer-3) algorithm is to recursively build a decision tree by using the information gain as a criterion for feature selection when selecting each node of the decision tree.
The process can be summarised as follows:

Start from the root node, the information gain is calculated for all possible division scenarios
Then the feature with the highest information gain is selected as the division feature, and the child nodes are created from the different values of the feature
Finally, the above method is called recursively for the child nodes to build the decision tree until all features have little or no information gain to choose from.

C4.5

The C4.5 algorithm uses the information gain ratio as the criterion for feature selection.
The feature with the highest information gain ratio is selected as the root node of the current sample set.

g_R(D,A)=\frac{g(D,A)}{H_A(D)}

$g_R(D,A)$ : The information gain ratio of feature 𝐴 to the training set 𝐷
$g(D,A)$ : Its information gain
$H_A(D)$ : The information Entropy of the training set 𝐷 with respect to feature 𝐴
$H_A(D) = -\sum\limits_{i=1}^n\frac{|D_i|}{|D|}\log_2\frac{|D_i|}{|D|}$

Pruning

Pruning of the decision tree is often achieved by minimizing the loss function or cost function of the decision tree as a whole.

|𝑇| is the number of leaf nodes of tree 𝑇
𝑡 is a leaf node of tree 𝑇 which has $𝑁_𝑡$ sample points, where the category 𝑘 has $𝑁_{𝑡𝑘}$ of them, $𝑘 = 1,2,...,𝐾$
$𝐻_𝑡 (𝑇)$ is the empirical entropy on the leaf node 𝑡 and 𝛼 >= 0 is the parameter

Then the loss function of the decision tree can be defined as

C_\alpha (T) = \sum\limits_{t=1}^{|T|}N_tH_t(T)+\alpha|T|

Empirical entropy

H_t(T)=-\sum\limits_{k} \frac{N_{tk}}{N_t} \log \frac{N_{tk}}{N_t}

Hence

C(T)=\sum\limits_{t=1}^{|T|}N_tH_t(T)=-\sum\limits_{t=1}^{|T|} \sum\limits_{k=1}^{K}N_{tk}\log \frac{N_{tk}}{N_t}

which can be rewritten as

C_\alpha(T)=C(T)+\alpha|T|

The larger 𝛼 prompted the selection of the simpler model

CART

Classification And Regression Tree

Definition

The Classification and Regression Tree is a decision tree that can be used for both classification and regression.

The left branch takes the value of “yes” and the right branch takes the value of “no”.
In this way, the decision tree is constructed by recursively dichotomising each feature, dividing the entire feature space into a finite number of cells.

Steps

Generate a decision tree based on the training data set, and the generated tree should be as large as possible.
Use the test set to prune the generated tree and select the optimal subtree.

Common Measures of impurity

Entropy-measures uncertainty
$Entropy=-\sum\limits_{j}p_j\log_2p_j$
Gini Index-minimizes the probability of misclassification
$Gini=1-\sum\limits_j{p_j}^2$
Classification Error
$ClassificationError=1-maxp_j$

Gerneration

Input : input training set D, stop calculation condition
Output: CART classification decision tree
Starting from the root node, the following operations are performed recursively for each node and a binary decision tree is constructed.

Let the training set be 𝐷, and then calculate the Gini impurity of the existing features for that dataset. Next, for each feature 𝐴, for each of its possible values 𝑎 , divide 𝐷 into two parts $𝐷_1,𝐷_2$ based on whether the sample points hold for 𝐴 = 𝑎 and then calculate the Gini impurity for 𝐴 = 𝑎.
Among all possible features 𝐴 and all their possible cut-off points 𝑎, select the feature with the smallest Gini impurity taken as the division criterion to divide the original dataset into two parts and assign them to the two sub-nodes;
Calling steps 1 and 2 recursively for both child nodes until the stopping condition is satisfied
Generate a CART decision tree.

Pruning

Keep pruning from the bottom of the previously generated decision tree $T_0$ until the root node of $T_0$ , forming a subordinate sequence ${T_0,T_1,T_2,\cdots T_n}$ .
This subsequence is then tested by cross-validation, from which the optimal subtree is selected.

Steps

Input : CART classification decision tree
Output : The optimal decision tree

$k = 0, T = T_0$
$\alpha = +\infty$
For tree 𝑇, bottom-up computation of $𝐶(𝑇_𝑡),|𝑇_𝑡|$ for its internal node 𝑡
- $T_t$ : The subtree with 𝑡 as the root node
- $𝐶(𝑇_𝑡)$ : The error of subtree 𝑇_𝑡 on the training set
- $|T_t|$ : The number of leaf nodes of subtree $𝑇_𝑡$

g(t)=\frac{C(t)-C(T_i)}{|T_i-1|},\alpha=min(\alpha,g(t))

Prune the internal node t corresponding to $g(t) = \alpha$ and decide the class of the leaf node by majority voting to obtain the tree T.
$k=k+1,{\alpha}_k=\alpha,T_k=T$
If $T_k$ is not a tree consisting of the root node and two leaf nodes, continue with step 3; Otherwise make $T_k = T_n$ ;
Use cross-validation to select the optimal decision of the curved tree $T_{\alpha}$ among the sequence of subtrees $T_0,T_1,T_2,\cdots,T_𝑛$

Ensemble Learning

Improves overall generalisation by combining multiple models together.

Bagging

Train a series of independent models of different classes in parallel.
The output of each model is then combined according to a strategy and the final result is output.

Bootstrap Samples

A number of randomly selected sub-training sets containing a number of samples from the original data and, for each sub-training set, a number of randomly selected feature dimensions as model inputs

Aggregate Outputs

A number of randomly selected sub-training sets containing a number of samples from the original data and, for each sub-training set, a number of randomly selected feature dimensions as model inputs

y=\frac{1}{M}\sum\limits_{m=1}^{M}f_m(x)

Boosting

Train a series of similar models in a serial fashion.
The latter model is used to correct the output of the former model.
All the models are combined.

Stacking

Train a series of independent models of different classes in parallel.
The output of each model is used as input to train a new model, which is used to output the final prediction.

Random Forest

Essentially it is also a Bagging integrated learning model based on decision trees

Steps

Random sampling of the original dataset to obtain multiple training subsets
Train different decision tree models on each training subset to obtain different decision tree models
Combine the multiple decision tree models obtained from training, and then obtain the final output