决策树
机器学习
Decision Tree
Decision Tree is a basic classification and regression method. Contains 3 algorithms:
Select Features
Define Entropy as 
H(p)=H(X)=−∑i=1npilogpi,
 
where 
X is a random variable with probability 
pi=P(X=xi), with range 
0≤H(p)≤logn.The larger the uncertainly is, the larger entropy 
H(X) will be.
Define Conditional Entropy as 
H(Y∣X)=∑i=1npiH(Y∣X=xi),
 
where 
X and 
p are defined as above.
Define Information Gain is defined as: 
G(D,A)=H(D)−H(D∣A),
 
which measures how much uncertainty reduced by introducing 
A for classification. We should keep those features reduce most uncertainty.
Generating Decision Tree
ID3
Input: Training data set D, feature set A, threshold ϵ; 
Output: Decision Tree T.
- If all instances belong to same class Ck, return T with label Ck;
 
- If A=⊘, return T with label Ck with most instances in D;
 
- Calculates information gains for each Ai, and select Ag with largest information gain; 
- If information gain of Ag<ϵ, return T with label Ck with most instances in D;
 
- Else, partition D into Dis with different Ag=ai, construct sub-trees with Di and A−Ag.
 
 
This approach tends to over-fit.
C4.5
C4.5 is similar to ID3 but select features with information gain ratio 
gR(D,A)=g(D,A)H(D)
 
instead of information gain.
Pruning Decision Tree
Cost function
Let T denote a decision tree with t leaves. Each leaf has Nt samples, and Ntk samples for class k. Define loss function as: 
Cα(T)=−∑t=1|T|∑k=1KNtklogNtkNt+α|T|.
Pruning algorithm
- Calculate emperical entropy for every nodes;
 
- If the cost become smaller, then prune it.
 
CART Algorithm
Classification And Regression Tree is a widely used approach for learning decision tree. It can be used for both classification and regression.
Generate regression tree
Input: Training set D 
Output: Regression tree f(x)
Scan for the most appropriate splitting variable j and splitting point s, solving 
minj,s⎡⎣minc1∑xi∈R1(yi−c1)2+minc2∑xi∈R2(yi−c2)2⎤⎦
 
where R1 and R2 are regions splitted by j and s, cm=E(yi∣xi∈Rm). 
Back to step 1 for R1 and R2, until terminating condition satisfied.
 
Split input space into M regions R1,R2,…,RM, generate decision treeL 
f(x)=∑m=1Mc^mI(x∈Rm)
 
Geni Index
Geni Index of probability distribution is defined as 
Gini(p)=∑k=1Kpk(1−pk)=1−∑k=1Kp2k
For binomial distribution, the equivalent form of Gini index is Gini(p)=2p(1−p).
The Gini index of sample set D is 
Gini(D)=1−∑k=1K(|Ck||D|)2
Split D into D1 and D2 by letting Dm={(x,y)∈D∣A(x)=a}. Then under the constraint of feature A, Gini index is defined as 
Gini(D,A)=|D1||D|Gini(D1)+|D2||D|Gini(D2)
The higher Gini index is, the more unpredicatable of samples are.
Generate classification tree
Input: Training set D 
Output: CART classification decision tree
- Split D with A=a for every feature A and every possible value a, calculates Gini index Gini(D,A).
 
- Split D with A=a which produces smallest Gini index.
 
- Call step 1 and step 2 for subset D1 and D2.
 
CART Pruning
TODO: Don't know how this algorithm come.
Input: Decision Tree generated by CART 
Output: Optimized decision tree Ta
- Set k=0, T=T0.
 
- Set α=+∞.
 
- Calculate C(Tt), |Tt|, 
g(t)=C(t)−C(Tt)|Tt|−1(3)
 
and 
α=min(α,g(t))(4)
 
from top down, where Tt is sub-tree with root t, C(Ti) is error, |Tt| is the number of leaves. 
- Visit t from top down, if g(t)=α, prune Tt.
 
- Set k=k+1, αk=α, and Tk=T.
 
- If T is not a single node tree, back to step 4.