决策树
机器学习
Decision Tree
Decision Tree is a basic classification and regression method. Contains 3 algorithms:
Select Features
Define Entropy as
H(p)=H(X)=−∑i=1npilogpi,
where
X is a random variable with probability
pi=P(X=xi), with range
0≤H(p)≤logn.The larger the uncertainly is, the larger entropy
H(X) will be.
Define Conditional Entropy as
H(Y∣X)=∑i=1npiH(Y∣X=xi),
where
X and
p are defined as above.
Define Information Gain is defined as:
G(D,A)=H(D)−H(D∣A),
which measures how much uncertainty reduced by introducing
A for classification. We should keep those features reduce most uncertainty.
Generating Decision Tree
ID3
Input: Training data set D, feature set A, threshold ϵ;
Output: Decision Tree T.
- If all instances belong to same class Ck, return T with label Ck;
- If A=⊘, return T with label Ck with most instances in D;
- Calculates information gains for each Ai, and select Ag with largest information gain;
- If information gain of Ag<ϵ, return T with label Ck with most instances in D;
- Else, partition D into Dis with different Ag=ai, construct sub-trees with Di and A−Ag.
This approach tends to over-fit.
C4.5
C4.5 is similar to ID3 but select features with information gain ratio
gR(D,A)=g(D,A)H(D)
instead of information gain.
Pruning Decision Tree
Cost function
Let T denote a decision tree with t leaves. Each leaf has Nt samples, and Ntk samples for class k. Define loss function as:
Cα(T)=−∑t=1|T|∑k=1KNtklogNtkNt+α|T|.
Pruning algorithm
- Calculate emperical entropy for every nodes;
- If the cost become smaller, then prune it.
CART Algorithm
Classification And Regression Tree is a widely used approach for learning decision tree. It can be used for both classification and regression.
Generate regression tree
Input: Training set D
Output: Regression tree f(x)
Scan for the most appropriate splitting variable j and splitting point s, solving
minj,s⎡⎣minc1∑xi∈R1(yi−c1)2+minc2∑xi∈R2(yi−c2)2⎤⎦
where R1 and R2 are regions splitted by j and s, cm=E(yi∣xi∈Rm).
Back to step 1 for R1 and R2, until terminating condition satisfied.
Split input space into M regions R1,R2,…,RM, generate decision treeL
f(x)=∑m=1Mc^mI(x∈Rm)
Geni Index
Geni Index of probability distribution is defined as
Gini(p)=∑k=1Kpk(1−pk)=1−∑k=1Kp2k
For binomial distribution, the equivalent form of Gini index is Gini(p)=2p(1−p).
The Gini index of sample set D is
Gini(D)=1−∑k=1K(|Ck||D|)2
Split D into D1 and D2 by letting Dm={(x,y)∈D∣A(x)=a}. Then under the constraint of feature A, Gini index is defined as
Gini(D,A)=|D1||D|Gini(D1)+|D2||D|Gini(D2)
The higher Gini index is, the more unpredicatable of samples are.
Generate classification tree
Input: Training set D
Output: CART classification decision tree
- Split D with A=a for every feature A and every possible value a, calculates Gini index Gini(D,A).
- Split D with A=a which produces smallest Gini index.
- Call step 1 and step 2 for subset D1 and D2.
CART Pruning
TODO: Don't know how this algorithm come.
Input: Decision Tree generated by CART
Output: Optimized decision tree Ta
- Set k=0, T=T0.
- Set α=+∞.
- Calculate C(Tt), |Tt|,
g(t)=C(t)−C(Tt)|Tt|−1(3)
and
α=min(α,g(t))(4)
from top down, where Tt is sub-tree with root t, C(Ti) is error, |Tt| is the number of leaves.
- Visit t from top down, if g(t)=α, prune Tt.
- Set k=k+1, αk=α, and Tk=T.
- If T is not a single node tree, back to step 4.