# 机器学习回顾篇（8）：CART决策树算法

## 2 分类问题

$$Gini(X) = \sum\limits_l^L {\frac{{|{X_l}|}}{{|X|}}(1 – \frac{{|{X_l}|}}{{|X|}})} = 1 – {\sum\limits_{l = 1}^L {\left( {\frac{{|{X_l}|}}{{|X|}}} \right)} ^2}$$

$$Gini(X,A) = \left| {\frac{{{X_1}}}{X}} \right|Gini({X_1}) + \left| {\frac{{{X_2}}}{X}} \right|Gini({X_2})$$

$$Gini(X,{A_1}) = \frac{5}{{10}} \times \left\{ {2 \times \frac{4}{5} \times \frac{1}{5}} \right\} + \frac{5}{{10}} \times \left\{ {2 \times \frac{3}{5} \times \frac{2}{5}} \right\} = 0.4$$

$$Gini(X,{A_2}){\text{ = }}Gini(X,{A_1}) = 0.4$$

$$Gini(X,{B_1}) = \frac{3}{{10}} \times \left\{ {2 \times \frac{2}{3} \times \frac{1}{3}} \right\} + \frac{7}{{10}} \times \left\{ {2 \times \frac{5}{7} \times \frac{2}{7}} \right\} = 0.42$$

$$Gini(X,{B_2}) = \frac{4}{{10}} \times \left\{ {2 \times \frac{4}{4} \times \frac{0}{4}} \right\} + \frac{6}{{10}} \times \left\{ {2 \times \frac{3}{6} \times \frac{3}{6}} \right\} = 0.3$$

$$Gini(X,{B_3}) = \frac{3}{{10}} \times \left\{ {2 \times \frac{1}{3} \times \frac{2}{3}} \right\} + \frac{7}{{10}} \times \left\{ {2 \times \frac{6}{7} \times \frac{1}{7}} \right\} = 0.46$$

$$Gini(X,{C_1}) = \frac{3}{{10}} \times \left\{ {2 \times \frac{0}{3} \times \frac{3}{3}} \right\} + \frac{7}{{10}} \times \left\{ {2 \times \frac{4}{7} \times \frac{3}{7}} \right\} = 0.34$$

$$Gini(X,{C_2}) = \frac{3}{{10}} \times \left\{ {2 \times \frac{1}{3} \times \frac{2}{3}} \right\} + \frac{7}{{10}} \times \left\{ {2 \times \frac{5}{7} \times \frac{2}{7}} \right\} = 0.42$$

$$Gini(X,{C_3}) = \frac{3}{{10}} \times \left\{ {2 \times \frac{1}{3} \times \frac{2}{3}} \right\} + \frac{7}{{10}} \times \left\{ {2 \times \frac{6}{7} \times \frac{1}{7}} \right\} = 0.46$$

## 3 回归问题

$${X_1} = \{ x|{x_A} \leqslant a\}$$

$${X_2} = \{ x|{x_A} > a\}$$

$$Los{s_{A,a}} = \sum\limits_{x \in {X_1}} {(y – {c_1}} {)^2} + \sum\limits_{x \in {X_2}} {(y – {c_2}} {)^2}$$

$${\min \sum\limits_{x \in {X_1}} {{{(y – {c_1})}^2}} + \min \sum\limits_{x \in {X_2}} {{{(y – {c_2})}^2}} }$$

$${c_i} = ave(y|x \in {X_i})$$

$$f(x)={c_i} = ave(y|x \in {X_i})$$

（1）对当前数据集$X$，计算所有特征属性$A$下所有取值$a$作为分割点时的最小$Los{s_{A,a}}$；

（2）对比所有$Los{s_{A,a}}$，选择最小的$Los{s_{A,a}}$所对应的特征属性$A$为当前最优分裂特征属性，$a$为最佳分裂点将数据集划分都左右两个子树中；

（3）对左右两个子树的数据集重复（1）、（2）步骤继续划分，直到节点中数据集满足指定条件则决策树构建完成。

## 4 树剪枝

（1）令$k=0$， $T=T_0$，$\alpha = + \infty$；

（2）自上而下地对各内部节点计算$C({T_t})$，$|{T_t}|$以及

$$g(t) = {{C(t) – C({T_t})} \over {|{T_t}| – 1}}$$

$$\alpha = \min (\alpha ,g(t))$$

（3）自上而下地访问内部节点$t$，如果有$g(t)=\alpha$，则对$t$进行剪枝，并对叶子结点$t$以多数表决法决定输出，得到树$T$；

（4）令$k=k+1$，${\alpha _k} = \alpha$，${T_k} = T$；

（5）如果$T$不是由根节点单独构成的树，则回到步骤（3）；

（6）采用交叉验证法在子树序列${T_0},{T_1}, \cdots ,{T_k} = T$选取最优的子树${T_\alpha }$。

$t_1$、$t_2$节点的$g(t)$最小，我们选择剪枝少的节点，也就是$t_3$进行剪枝,并且令${\alpha _2} = {1 \over 8}$。剪枝后决策树如下：