@nrailgun 2015-10-21T16:29:00.000000Z 字数 3895 阅读 1731

决策树

机器学习

Decision Tree

Decision Tree is a basic classification and regression method. Contains 3 algorithms:

ID3
C4.5
CART

Select Features

Define Entropy as

H (p) = H (X) = - \sum i = 1 n p i log p i,

$H(p) = H(X) = -\sum_{i=1}^n p_i \log{p_i},$
where

X $X$ is a random variable with probability

pi=P(X=xi) $p_i = P(X = x_i)$ , with range

0≤H(p)≤logn $0 \le H(p) \le \log{n}$ .The larger the uncertainly is, the larger entropy

H(X) $H(X)$ will be.

Define Conditional Entropy as

H (Y ∣ X) = \sum i = 1 n p i H (Y ∣ X = x i),

$H(Y \mid X) = \sum_{i=1}^n p_i H(Y \mid X = x_i),$
where

X $X$ and

p $p$ are defined as above.

Define Information Gain is defined as:

G (D, A) = H (D) - H (D ∣ A),

$G(D, A) = H(D) - H(D \mid A),$
which measures how much uncertainty reduced by introducing

A $A$ for classification. We should keep those features reduce most uncertainty.

Generating Decision Tree

ID3

Input: Training data set $D$ , feature set $A$ , threshold $\epsilon$ ;
Output: Decision Tree $T$ .

If all instances belong to same class $C_k$ , return $T$ with label $C_k$ ;
If $A = \oslash$ , return $T$ with label $C_k$ with most instances in $D$ ;
Calculates information gains for each Ai, and select Ag with largest information gain;
- If information gain of $A_g \lt \epsilon$ , return $T$ with label $C_k$ with most instances in $D$ ;
- Else, partition $D$ into $D_i$ s with different $A_g = a_i$ , construct sub-trees with $D_i$ and $A - {A_g}$ .

This approach tends to over-fit.

C4.5

C4.5 is similar to ID3 but select features with information gain ratio

g R (D, A) = g ( D , A ) H ( D )

$g_R(D, A) = \frac{g(D, A)}{H(D)}$
instead of information gain.

Pruning Decision Tree

Cost function

Let $T$ denote a decision tree with $t$ leaves. Each leaf has $N_t$ samples, and $N_{tk}$ samples for class $k$ . Define loss function as:

C α (T) = - \sum t = 1 | T | \sum k = 1 K N t k log N t k N t + α | T | .

$C_\alpha(T) = -\sum_{t=1}^{|T|} \sum_{k=1}^{K} N_{tk} \log{\frac{N_{tk}}{N_t}} + \alpha |T|.$

Pruning algorithm

Calculate emperical entropy for every nodes;
If the cost become smaller, then prune it.

CART Algorithm

Classification And Regression Tree is a widely used approach for learning decision tree. It can be used for both classification and regression.

Generate regression tree

Input: Training set $D$
Output: Regression tree $f(x)$

Scan for the most appropriate splitting variable $j$ and splitting point $s$ , solving

$min j, s ⎡ ⎣ min c 1 \sum x i \in R 1 (y i - c 1) 2 + min c 2 \sum x i \in R 2 (y i - c 2) 2 ⎤ ⎦$ $\min_{j,s} \left[ \min_{c_1} \sum_{x_i \in R_1} (y_i - c_1)^2 + \min_{c_2} \sum_{x_i \in R_2} (y_i - c_2)^2 \right]$
where $R_1$ and $R_2$ are regions splitted by $j$ and $s$ , $c_m = E(y_i \mid x_i \in R_m)$ .
Back to step 1 for $R_1$ and $R_2$ , until terminating condition satisfied.
Split input space into $M$ regions $R_1, R_2, \dots, R_M$ , generate decision treeL

$f (x) = \sum m = 1 M c^m I (x \in R m)$ $f(x) = \sum_{m=1}^{M} \hat{c}_m I(x \in R_m)$

Geni Index

Geni Index of probability distribution is defined as

G i n i (p) = \sum k = 1 K p k (1 - p k) = 1 - \sum k = 1 K p 2 k

$\mathrm{Gini}(p) = \sum_{k=1}^{K} p_k(1 - p_k) = 1 - \sum_{k=1}^{K} p_k^2$

For binomial distribution, the equivalent form of Gini index is $\mathrm{Gini}(p) = 2p(1-p)$ .

The Gini index of sample set $D$ is

G i n i (D) = 1 - \sum k = 1 K (| C k | | D |) 2

$\mathrm{Gini}(D) = 1 - \sum_{k=1}^{K} \left( \frac{|C_k|}{|D|} \right)^2$

Split $D$ into $D_1$ and $D_2$ by letting $D_m = \left\{ (x,y) \in D \mid A(x) = a \right\}$ . Then under the constraint of feature $A$ , Gini index is defined as

G i n i (D, A) = | D 1 | | D | G i n i (D 1) + | D 2 | | D | G i n i (D 2)

$\mathrm{Gini}(D, A) = \frac{|D_1|}{|D|} \mathrm{Gini}(D_1) + \frac{|D_2|}{|D|} \mathrm{Gini}(D_2)$

The higher Gini index is, the more unpredicatable of samples are.

Generate classification tree

Input: Training set $D$
Output: CART classification decision tree

Split $D$ with $A = a$ for every feature $A$ and every possible value $a$ , calculates Gini index $\mathrm{Gini}(D, A)$ .
Split $D$ with $A = a$ which produces smallest Gini index.
Call step 1 and step 2 for subset $D_1$ and $D_2$ .

CART Pruning

TODO: Don't know how this algorithm come.

Input: Decision Tree generated by CART
Output: Optimized decision tree $T_a$

Set $k = 0$ , $T = T_0$ .
Set $\alpha = +\infty$ .
Calculate $C(T_t)$ , $|T_t|$ ,
$g (t) = C ( t ) - C ( T t ) | T t | - 1 (3)$ $\begin{equation} g(t) = \frac{C(t) - C(T_t)}{|T_t| - 1} \end{equation}$
and
$α = min (α, g (t)) (4)$ $\begin{equation} \alpha = \min(\alpha, g(t)) \end{equation}$
from top down, where $T_t$ is sub-tree with root $t$ , $C(T_i)$ is error, $|T_t|$ is the number of leaves.
Visit $t$ from top down, if $g(t) = \alpha$ , prune $T_t$ .
Set $k = k+1$ , $\alpha_k = \alpha$ , and $T_k = T$ .
If $T$ is not a single node tree, back to step 4.