Dicision Tree
Theory
The decision-making process at each step of the decision tree is the process of reducing information uncertainty
Entropy
A common way to measure uncertainty
Information Entropy
- is the probability of class i
- is the fraction of class i in the set
Entropy comes from information theory
Condition Entropy
Information Gain
Calculating Method 1
There is 𝐼 that, when information about 𝐼 is introduced, the entropy of 𝑈 becomes smaller.
We have to choose the one that best makes the information entropy of 𝑈 smaller 𝐼 to get the maximum information gain
Calculating Method 2
The information gain I(C;F) of the class variable C with possible values with respect to the feature variable F with possible values is defined by:
These are estimated from frequencies in the training data.
- is the probability of class C having value .
- is the probability of feature F having value .
- is the joint probability of Class and Variable
ID3 & C4.5
ID3
The core idea of the ID3 (Interactive Dichotomizer-3) algorithm is to recursively build a decision tree by using the information gain as a criterion for feature selection when selecting each node of the decision tree.
The process can be summarised as follows:
- Start from the root node, the information gain is calculated for all possible division scenarios
- Then the feature with the highest information gain is selected as the division feature, and the child nodes are created from the different values of the feature
- Finally, the above method is called recursively for the child nodes to build the decision tree until all features have little or no information gain to choose from.
C4.5
The C4.5 algorithm uses the information gain ratio as the criterion for feature selection.
The feature with the highest information gain ratio is selected as the root node of the current sample set.
- : The information gain ratio of feature 𝐴 to the training set 𝐷
- : Its information gain
- : The information Entropy of the training set 𝐷 with respect to feature 𝐴
Pruning
Pruning of the decision tree is often achieved by minimizing the loss function or cost function of the decision tree as a whole.
- |𝑇| is the number of leaf nodes of tree 𝑇
- 𝑡 is a leaf node of tree 𝑇 which has sample points, where the category 𝑘 has of them,
- is the empirical entropy on the leaf node 𝑡 and 𝛼 >= 0 is the parameter
Then the loss function of the decision tree can be defined as
Empirical entropy
Hence
which can be rewritten as
The larger 𝛼 prompted the selection of the simpler model
CART
Classification And Regression Tree
Definition
The Classification and Regression Tree is a decision tree that can be used for both classification and regression.
- The left branch takes the value of “yes” and the right branch takes the value of “no”.
- In this way, the decision tree is constructed by recursively dichotomising each feature, dividing the entire feature space into a finite number of cells.
Steps
- Generate a decision tree based on the training data set, and the generated tree should be as large as possible.
- Use the test set to prune the generated tree and select the optimal subtree.
Common Measures of impurity
Entropy-measures uncertainty
Gini Index-minimizes the probability of misclassification
Classification Error
Gerneration
Input : input training set D, stop calculation condition
Output: CART classification decision tree
Starting from the root node, the following operations are performed recursively for each node and a binary decision tree is constructed.
- Let the training set be 𝐷, and then calculate the Gini impurity of the existing features for that dataset. Next, for each feature 𝐴, for each of its possible values 𝑎 , divide 𝐷 into two parts based on whether the sample points hold for 𝐴 = 𝑎 and then calculate the Gini impurity for 𝐴 = 𝑎.
- Among all possible features 𝐴 and all their possible cut-off points 𝑎, select the feature with the smallest Gini impurity taken as the division criterion to divide the original dataset into two parts and assign them to the two sub-nodes;
- Calling steps 1 and 2 recursively for both child nodes until the stopping condition is satisfied
- Generate a CART decision tree.
Pruning
Keep pruning from the bottom of the previously generated decision tree until the root node of , forming a subordinate sequence .
This subsequence is then tested by cross-validation, from which the optimal subtree is selected.
Steps
Input : CART classification decision tree
Output : The optimal decision tree
For tree 𝑇, bottom-up computation of for its internal node 𝑡
- : The subtree with 𝑡 as the root node
- : The error of subtree 𝑇_𝑡 on the training set
- : The number of leaf nodes of subtree
Prune the internal node t corresponding to and decide the class of the leaf node by majority voting to obtain the tree T.
If is not a tree consisting of the root node and two leaf nodes, continue with step 3; Otherwise make ;
Use cross-validation to select the optimal decision of the curved tree among the sequence of subtrees
Ensemble Learning
Improves overall generalisation by combining multiple models together.
Bagging
Train a series of independent models of different classes in parallel.
The output of each model is then combined according to a strategy and the final result is output.
Bootstrap Samples
A number of randomly selected sub-training sets containing a number of samples from the original data and, for each sub-training set, a number of randomly selected feature dimensions as model inputs
Aggregate Outputs
A number of randomly selected sub-training sets containing a number of samples from the original data and, for each sub-training set, a number of randomly selected feature dimensions as model inputs
Boosting
Train a series of similar models in a serial fashion.
The latter model is used to correct the output of the former model.
All the models are combined.
Stacking
Train a series of independent models of different classes in parallel.
The output of each model is used as input to train a new model, which is used to output the final prediction.
Random Forest
Essentially it is also a Bagging integrated learning model based on decision trees
Steps
- Random sampling of the original dataset to obtain multiple training subsets
- Train different decision tree models on each training subset to obtain different decision tree models
- Combine the multiple decision tree models obtained from training, and then obtain the final output