@Macux 2018-03-14T12:06:52.000000Z 字数 28127 阅读 2167

Gradient Boosting Decison Tree Tuning

Algorithm

0、GBDT 的个人理解

0.1 GBDT 宏观把握

GBDT 延续着提升树的套路：加法模型 & 前向分步算法求解

GBDT 为了提升一般性（在不同的cost function）情形下均可求解，做了如下调整：

第(t+1)棵树拟合的是第t棵树的“残差”。

为了方便计算在不同 cost function 下的残差，Freidman 大大提出用 cost function 的负梯度作为残差的近似值。

通过以上调整，GBDT 就能在任何 cost function 的情况下，基于加法模型和前向分步算法进行求解。

在不同 cost function 下，残差的计算有多么难，贴个图体会下：

整体的算法流程

0.2 GBDT 一些重要的点

GBDT 如何用于分类？
1. 首先明确一点，GBDT 无论用于分类还是回归一直都是使用的CART 回归树。GBDT 无论用于分类还是回归一直都是使用的CART 回归树。不会因为我们所选择的任务是分类任务就选用分类树，这里面的核心是因为 GBDT 每轮的训练是在上一轮的训练的残差基础之上进行训练的。这里的残差就是当前模型的负梯度值。这就要求每轮迭代的时候，弱分类器的输出的结果相减是有意义的。
2. 如果选用的弱分类器是分类树，类别相减是没有意义的。上一轮输出的是样本 x 属于 A类，本一轮训练输出的是样本 x 属于 B类。 A 和 B 很多时候甚至都没有比较的意义，A 类 - B类是没有意义的。
3. 言归正传，GBDT 如何解决分类问题？
  - 针对样本 X 每个可能的类都训练一个分类回归树。举例说明，目前样本有三类，也就是 K = 3。如果样本 x 属于第二类，那么针对该样本 x 的分类结果用一个三维向量 [0,1,0] 来表示。0表示样本不属于该类，1表示样本属于该类。由于样本已经属于第二类了，所以第二类对应的向量维度为1，其他位置为0。
  - 针对样本有三类的情况，实质上是在每轮的训练的时候是同时训练三颗树。第一颗树针对样本x的第一类，输入为 $（x,0）$ 。第二颗树输入针对样本x 的第二类，输入为 $（x,1）$ 。第三颗树针对样本x 的第三类，输入为 $（x,0）$
  - 每颗树的训练过程其实就是CATR TREE 的生成过程。假设三颗树对 x 的类别的预测值分别为 $f_{1}(x),f_{2}(x),f_{3}(x)$ ，那么在此类训练中，我们仿照多分类的逻辑回归，使用softmax 来产生概率，则属于类别 1 的概率：
    $p_{1}=exp(f_{1}{(x)})/\sum_{k= 1}^{3}exp(f_{k}{(x)})$
  - 此时，可以求出针对类别 1、2、3 各自残差：
    $f_{11}(x) = 0-f_{1}(x)$
    $f_{22}(x)= 1-f2(x)$
    $f_{33}(x)= 0-f_{3}(x)$
  - 然后开始第二轮训练。针对第一类输入的是 $（x,f_{11}(x)）$ , 针对第二类输入的是 $（x,f_{22}(x)）$ , 针对第三类输入的是 $（x,f_{33}(x)）$ 。继续训练出三颗树。一直迭代M轮。每轮构建3颗树。
  - 所以当K = 3，模型应该有三个式子：
    $F_{1M}{(x)}=\sum_{m=1}^{M}{\hat{C_{1m}}I(x\epsilon R_{1m})}$
    $F_{2M}{(x)}=\sum_{m=1}^{M}{\hat{C_{2m}}I(x\epsilon R_{2m})}$
    $F_{3M}{(x)}=\sum_{m=1}^{M}{\hat{C_{3m}}I(x\epsilon R_{3m})}$
  - 当训练完毕以后，新来一个样本 $x_i$ 需要预测该样本的类别的时候，便可以有这三个式子产生三个值， $f_{1}(x_i),f_{2}(x_i),f_{3}(x_i)$ 。样本属于某个类别c的概率为： $p_{c}=exp(f_{c}{(x_i)})/\sum_{k= 1}^{3}exp(f_{k}{(x_i)})$
GBDT 如何构建特征 ?
1. 其实说 GBDT 能够构建特征并非很准确， GBDT 本身是不能产生特征的，但是我们可以利用 GBDT 去产生特征的组合。
3. 如图所示，使用 GBDT 生成了两棵树，两颗树一共有五个叶子节点。将样本 X 输入到两颗树当中去，样本 X 落在了第一棵树的第二个叶子节点，第二颗树的第一个叶子节点。于是便可以依次构建一个五纬的特征向量，每一个纬度代表了一个叶子节点，样本落在这个叶子节点上面的话那么值为1，没有落在该叶子节点的话，那么值为 0。
4. 于是对于该样本，可以得到一个向量[0,1,0,1,0] 作为该样本的组合特征，和原来的特征一起输入到逻辑回归当中进行训练。实验证明这样会得到比较显著的效果提升。

1、Classification And Regression Trees, CART

1.1 Learning From Tianqi Chen's PPT

Regression Tree is not just for regression!

Regression tree ensemble defines how you make the prediction score, it can be used for Classification, Regression, Ranking....

It all depends on how you define the objective function!

So far we have learned:
（1）Using Square loss Will results in common gradient boosted machine. -- For Regression Issue
（2）Using Logistic loss Will results in LogitBoost. -- For Classification Issue.

1.2 Regression Decision Tree 整体把握

回归树的每一个节点都会得一个预测值，分枝时穷举每一个 $feature$ 的每个阈值找最好的分割点。
衡量最好的标准是最小化平方误差。

在用 GBDT 做特征工程时，为什么要使用 Regression Decision Tree：
（1）、 大牛解释：树的每条路径，是通过最小化 MSE 等方法分割出来的、有区分性的路径。
（2）、个人理解：树的路径代表了 MSE 最小的路径（ $feature$ ） $\Leftrightarrow$ 对 Label 贡献最强 & 最直接的路径。

1.3 Regression Decision Tree 数学原理

一棵回归树对应着输入空间（即特征空间）的一个划分以及在划分单元上的输出值。于是回归树的模型，可表示为：

$f(x)=\sum_{m=1}^Mc_mI(x\in R_m)$

使用平方误差 $\sum_{x_i\in R_m}(y_i-f(x_i))^2$ 作为 Loss Function，用 平方误差最小 准则求解每个单元上的最优输出值。

易知，单元 $R_m$ 上的 $c_m$ 的最优值 $\hat c_m$ ，是 $R_m$ 上的所有输入实例 $x_i$ 对应的输出 $y_i$ 的均值，即

$\hat c_m=ave(y_i|x_i\in R_m)$
PS：这只是当 $Loss$ $Function$ 是平方误差时的形式。一般情况是以枚举的方式来确定最佳的分裂点。

采用启发式方法划分输入空间。选择第 $j$ 个变量 $x^{(j)}$ 和它的取值 $s$ ，作为切分变量（splitting variable）和切分点（splitting point），并定义两个区域：

$R_1(j,s)=\{x|x^{(j)}\le s\}$ ， $R_2(j,s)=\{x|x^{(j)}\gt s\}$

为寻找最优的切分变量 $j$ 和最优切分点 $s$ ，求解：

$min_{j,s}[min_{c_1}\sum_{x_i\in R_1(j,s)}(y_i-c_1)^2 + min_{c_2}\sum_{x_i\in R_2(j,s)}(y_i-c_2)^2]$

为什么上式是减去固定的 $c_1$ 和 $c_2$ ？

输入空间被划分为 $M$ 个单元（ $R_1,R_2,...,R_M$ ），每个单元 $R_m$ 上只有一个固定的输出值 $c_m$ 。

1.4 Regression Decision Tree 过程总结

采用贪心策略来生成决策树的每个节点：
（1）、从深度为 0 的树开始，对每个叶节点枚举所有的可用特征。
（2）、针对每个特征，把属于该节点的训练样本根据该特征值升序排列，通过线性扫描的方式来决定该特征的最佳分裂点，并记录该特征的最大收益（采用最佳分裂点时的收益）。
（3）、选择收益最大的特征作为分裂特征，用该特征的最佳分裂点作为分裂位置，把该节点生长出左右两个新的叶节点，并为每个新节点关联对应的样本集。
（4）、回到第（1）步，递归执行到满足特定条件为止。

在上述算法的第（2）步，样本排序的时间复杂度为 $O(n \log n)$ 。

假设公用 $k$ 个特征，那么生成一颗深度为 $d$ 的树的时间复杂度为 $O(d·k·n\log n)$ 。

具体实现可以进一步优化计算复杂度，比如可以缓存每个特征的排序结果等。

2、Machine Learning 关键元素

2.1 模型 & 参数

模型，是由一个参数向量决定的函数。

参数，从数据中学习得到的知识。

在监督学习过程中，模型就是所要学习的条件概率分布或者决策函数，它决定了在给定特征向量 $x$ 时，如何预测出目标 $y$ 。比如：
（1）线性回归模型：定义 $x_i \in R^d$ 为训练集中的第 $i$ 个训练样本。

$\hat{y}_i = \sum_j{w_j x_{ij}}$ 表示模型的预测值。

（2）逻辑回归模型：定义 $x_i \in R^d$ 为训练集中的第 $i$ 个训练样本。

$\hat{y}_i = 1/(1+exp(-\hat{y}_i))$ 表示模型预测为正例的概率；

2.2 目标函数

目标函数的形式，通常如下：

                   $Obj(\Theta)=L(\Theta)+\Omega(\Theta)$
$L(\Theta)$ 为 Loss Function ，用来衡量模型拟合训练数据的好坏程度；
$\Omega(\Theta)$ 为    Regularizer  ，用来衡量学习到的模型的复杂度。

常用的损失函数有：

（1）平方损失      ： $l(y_i, \hat{y}_i)=(y_i - \hat{y}_i)^2$
（2）Logistic Loss： $l(y_i, \hat{y}_i)=y_i ln(1+e^{y_i}) + (1-y_i)ln(1+e^{\hat{y}_i})$

常用的正则项：

（1） $L1$ 范数： $\Omega(w)=\lambda \Vert w \Vert_1$
（2） $L2$ 范数： $\Omega(w)=\lambda \Vert w \Vert_2$

Logistic Regression 是指使用 Logistic Loss 和 L2 范数或 L1 范数正则项的线性模型。

Minimize your error（Loss Function） while Regularizing your parameters（Regularizer）.

（1）、损失函数最小化表明模型能够较好的拟合训练数据，一般也预示着模型能够较好地拟合真实数据（Ground Truth）。尽量使模型走出欠拟合的状态。
（2）、最小化正则项，意味着鼓励算法学习到较简单的模型，简单模型一般在测试样本上的预测结果比较稳定、方差较小（奥坎姆剃刀原则）。尽量使模型避免过拟合。

3、加法模型

考虑加法模型：

$f_M(x)=\sum_{m=1}^MT(x;\theta_m)$
$T$ 表示决策树， $\theta_m$ 表示决策树参数， $M$ 表示决策树的个数。

上述加法模型的目标函数定义为：

$Obj=\sum_{i=1}^n l(y_i, \hat{y}_i) + \sum_{k=1}^K \Omega(f_k)$
其中 $\Omega$ 表示决策树的复杂度，那么该如何定义树的复杂度呢？比如，可以考虑树的节点数量、树的深度或者叶子节点所对应的分数的 L2 范数等等。

如何来学习加法模型呢？ -- 前向分步算法（forward stagewise algorithm）。

（0）、前向分步算法的整体把握：利用迭代的思想（联想一下 EM algorithm），每次学习一棵树，第(t+1)棵树学习的是第t棵树的残差。学习的过程就是拟合度不断提高的过程，就是 cost function 不断减少的过程，否则还迭代学习个毛线球啊。通过这种方式，达到优化的效果：每一次迭代都能使 cost function 减小。
（1）、因为学习的是加法模型，如果能够从前往后，每一步只学习一个基函数及其系数，逐步逼近优化目标函数，这一学习过程称之为 Boosting。
（2）、具体地，我们从一个常量预测开始，每次学习一个基函数及其参数值，过程如下：
$\begin{split} \hat{y}_i^0 &= 0 \\ \hat{y}_i^1 &= f_1(x_i) = \hat{y}_i^0 + f_1(x_i) \\ \hat{y}_i^2 &= f_1(x_i) + f_2(x_i) = \hat{y}_i^1 + f_2(x_i) \\ & \cdots \\ \hat{y}_i^t &= \sum_{k=1}^t f_k(x_i) = \hat{y}_i^{t-1} + f_t(x_i) \\ \end{split}$
（3）、第 $t$ 步，模型对 $x_i$ 的预测为： $\hat{y}_i^t= \hat{y}_i^{t-1} + f_t(x_i)$ ， $f_t(x_i)$ 为第 $t$ 轮需要学习的决策树。
（4）、这个时候目标函数可以写为：
$\begin{split} Obj^{(t)} &= \sum_{i=1}^nl(y_i, \hat{y}_i^t) + \sum_{i=i}^t \Omega(f_i) \\ &= \sum_{i=1}^n l\left(y_i, \hat{y}_i^{t-1} + f_t(x_i) \right) + \Omega(f_t) + constant \end{split}\tag{1}$

联想问题

Boosting 和 Bagging 的区别

4、GBDT 算法解析

4.1 算法逻辑解析

迭代的目标的是上一个模型拟合数据的残差（redisual）：
（1）、每一次迭代都是使用的同样的方法（Loss Function 等其它超参设置），生成新的 CART 决策树（更多的是回归树！）
（2）、每一次拟合的目标数据集总是在变化：总是拟合上一轮迭代的残差。即只有 $Tree_1$ 是根据 $y_i$ 来拟合树， $Tree_2$ ~ $Tree_k$ 是拟合前一颗决策树的 残差表。

4.2 梯度提升解析

对一般损失函数而言，每一步优化可能不太容易，利用最速下降法的近似方法，其关键是利用损失函数的负梯度在当前模型中的值：

$\begin{split} -[{\partial L(y,f(x_i))\over\partial f(x_i)}]_{f(x)=f_{m-1}(x)}\end{split}$

作为回归问题提升树算法中的残差的近似值，拟合一个回归树。

梯度提升，可以理解为：每一轮新迭代产生的 $Tree_{i+1}$ 对 $Tree_{i}$ 拟合中产生的残差，给予更多的关注。

通过 Loss Function 的负梯度来拟合，就找到一种通用的拟合损失误差的方法。

近似值，可以理解为：负梯度是残差常规表达式的特例。
为什么拟合前一棵树的残差，有助于提高整个模型的拟合度？
先看一张图

GBDT 使用的是 加法模型 $\Leftrightarrow$ 预测值等于所有树值得累加。
比如上图中，A 的预测值 = (树 1 左节点值 15) + (树 2 左节点 -1) = 14。

拟合前一棵树的残差，其实是对上一轮迭代预测结果的修正。
这种“修正”体现在：将上一轮迭代的预测值 $+$ Loss Function 的梯度在当前模型的值。易见，这种修正是行之有效的。
PS：
喂给树模型的数据集，要么特征是 numerical 要么是经过 one-hot 处理的字符串特征。
否则就会报错：
ValueError: could not convert string to float
GBDT分为两种

残差版本 　

在学习的过程中，首先学习一颗回归树，然后将“真实值-预测值”得到残差，再把残差作为一个学习目标，学习下一棵回归树。

依次类推，直到残差小于某个接近0的阀值或回归树数目达到某一阀值。其核心思想是每轮通过拟合残差来降低损失函数。

总的来说，第一棵树是正常的，之后所有的树的决策全是由残差来决定，之后的每一棵回归树都在学习前N-1棵树的残差

Gradient版本

Gradient版本把GBDT说成一个梯度迭代树，使用梯度下降法求解。

Gradient版本认为每一棵回归树在学习前N-1棵树的梯度下降值。

相同之处

都是迭代回归树，都是累加每颗树结果作为最终结果；

每棵树都在学习前N-1棵树尚存的不足，从总体流程和输入输出上两者是没有区别的。

不同之处（敲黑板）

在于每步迭代时，是否使用Gradient作为求解方法。

前者不用Gradient而是用残差—-残差是全局最优值，Gradient是局部最优方向 * 步长。

即前者每一步都在试图让结果变成最好，后者则每步试图让结果更好一点。

看起来残差版本的GBDT更科学一点（有绝对最优方向），为什么舍近求远去估计一个局部最优方向呢？原因在于灵活性：

残差版本的GBDT最大问题是，由于它依赖残差，cost function一般固定为反映残差的均方差，因此很难处理纯回归问题之外的问题。

Gradient版本求解方法是梯度下降，只要可求导的cost function都可以使用。

4.3 关于 GBDT 的一些很棒的提问（出处）

怎样设置单棵树的停止生长条件？

答：
(1). 节点分裂时的最小样本数；
(2). 最大深度；
(3). 最多叶子节点数；
(4). loss 满足约束条件；
从算法原理上来看，第 (4) 条的约束最重。
从调参方便度来看，第 (2) 是最方便调参的。
从参数的重要度看，第 (1) & (3) 的权重 $>$ 第 (2) 的权重。

如何评估特征的权重大小？

答：
(1). 通过计算每个特征在训练集下的信息增益，最后计算每个特征信息增益与所有特征信息增益之和的比例为权重值。
(2). 借鉴投票机制。用相同的 GBDT 参数对每个特征训练出一个模型，然后在该模型下计算每个特征正确分类的个数，最后计算每个特征正确分类的个数与所有正确分类个数之和的比例为权重值。

如何人工去干预某列特征的权重？

答：
根据业务逻辑进行人工干预。
很懵逼！ GBDT 并没有列抽样的功能设定，XGBoost就有！
问题来了，XGBoost 如何控制列抽样时，不同特征的权重？？

当增加样本数量时，训练时长是线性增加吗？

答：
不是。因为生成单棵决策树时， $Loss$ $Function$ 极小值 $C_{mj}$ 与样本数量 $N$ 不是线性相关。 $arg \min_c\sum_{i=1}^NL(y_i,c)$

当增加树的棵树时，训练时长是线性增加吗？

答：
不是。因为每棵树的生成的时间复杂度不一样。
因为每棵树用到的特征数、样本数和深度不一样，所以生成树的时间复杂度会不一样。

当增加一个棵树叶子节点数目时，训练时长是线性增加吗？

答：
不是。叶子节点数和每棵树的生成的时间复杂度不成正比。

每个节点上都保存什么信息？

答：

中间节点，保存的是当前分裂后收益( $Score$ )，即上图的 $Gain$

叶子节点，保存的是不分裂的的收益( $Score$ )，即上图的 $the\,score\,of\,if\,we\,do\,not\,split$

如何防止过拟合？

答：

增加样本（data bias or small data的缘故），移除噪声。

减少特征，保留重要的特征（可以用PCA等对特征进行降维）。

对样本进行采样。建树的时候，不是把所有的样本都作为输入，而是选择一个子集。

对特征进行采样。类似样本采样一样, 每次建树的时候，只对部分的特征进行切分。
但采样有个很严重的问题，就是不可复现，也就是每次的训练结果不一样，不知道是调了参数导致效果好，还是采样导致效果好，还是其他的?（如何实现？？？）

在迭代中，动态调整学习率。（如何实现？？？）

训练模型的同时学习学习率/步长。（如何实现？？？）

对回归树进行剪枝。

减少回归树的个数, 尤其那些作用不大的回归树（在验证集上进行测试）。

GBDT 在训练和预测的时候都用到了步长，这两个步长一样么？都有什么用，如果不一样，为什么？怎么设步长的大小？（太小？太大？）在预测时，设太大对排序结果有什么影响？跟 shrinking里面的步长一样么？

这两个步长一样么？
答：
训练跟预测时，两个步长是一样的，也就是预测时的步长为训练时的步长，从训练的过程可以得知（更新当前迭代模型的时候）。

都有什么用，如果不一样，为什么？
答：
它的作用就是使得每次更新模型的时候，使得 loss 能够平稳地沿着负梯度的方向下降，不至于发生震荡。类似下图：

怎么设步长的大小?
答：
有两种方法，一种就是按策略来决定步长，另一种就是在训练模型的同时，学习步长。
A. 策略：
a 每个树步长恒定且相等，一般设较小的值；
b 开始的时候给步长设一个较小值，随着迭代次数动态改变，或者说衰减。
B. 学习：
因为在训练第 k 棵树的时候，前 k-1 棵树时已知的，而且求梯度的时候是利用前k-1棵树来获得。所以这个时候，就可以把步长当作一个变量来学习。（如何实现？？？）

（太小？太大？）在预测时，对排序结果有什么影响？
答：
如果步长过大，在训练的时候容易发生震荡，使得模型学不好，或者完全没有学好，从而导致模型精度不好。
而步长过小，导致训练时间过长，即迭代次数较大，从而生成较多的树，使得模型变得复杂，容易造成过拟合以及增加计算量。
不过步长较小的话，使训练比较稳定，总能找到一个稳定的局部最优解。
个人觉得过大过小的话，模型的预测值都会偏离真实情况（可能比较严重），从而导致模型精度不好。

跟 shrinking 里面的步长一样么？
答：
这里的步长跟 shrinking 里面的步长是一致的。

4.4 GBDT 和 Random Forest 的区别

回顾一下 Random Forest 的算法原理

（1）Random Forest 是 Bagging 的一个扩展变体。
（2）Random Forest 在以 CART 决策树为基学习器构建 Bagging 集成的基础上，在决策树的训练过程中引入了随机属性选择：“随机地选择训练样本” & “随机地选择特征集合”。
（3）随机地选择训练样本：从原始训练集 $S$ 中，以有放回抽样的方式，取样 $N$ 次，形成一个训练集 $S(i)$ ，作为当前 CART 决策树的训练集。
（4）随机地选择特征集合：对于每一个节点，从特征集合中（假设有 $D$ 个特征）随机选择 $K$ 个特征，运用这 $K$ 个变量来决定最佳的分裂点（最优的切分变量 & 最优切分点）。在当前决策树的生成过程中， $K$ 的值是保持不变的。
（5）这里的参数 $K$ 控制了随机性的引入程度。当 $K = D$ 时，则基决策树的构建与传统决策树相同，若 $K = 1$ ，则是随机选择一个属性用于划分。一般情况下，推荐值 $K=log_2D$ 或 $K=\sqrt{D}$ 。
（6）整体理解：Random Forest 是从 $D$ 个 $feature$ 中选择 $K$ 个让每一棵决策树进行学习，就像是把每一棵决策树分别培养成了精通于某一个窄领域的专家。因此在随机森林中有很多个不同领域的专家，对一个新的问题（新的输入数据）可以从不同的角度去看待，最终由各位专家投票得到结果。

GBDT 和 Random Forest 的区别

对于决策树的集成学习，有两个重要的策略：Bagging & Boosting
（1）Bagging 的玩法是：每个分类器都随机从原样本中做有放回的抽样，然后分别在这些抽样后的样本上训练基分类器，最后把这些分类器组合起来，对新问题的输出以投票决定。其代表算法是 Random Forest。
（2）Boosting 的玩法是：通过迭代地训练一系列的分类器，每个分类器采用的样本分布都和上一轮的学习结果有关。其代表算法是AdaBoost，GBDT。

GBDT 和 Random Forest 在本质上是属于不同集成策略的集成学习（ensemble learning）算法

为什么 XGBoost/GBDT 在调参时为什么树的深度很少就能达到很高的精度？

（1）Boosting 主要关注降低偏差，因此 Boosting 能基于泛化性能相当弱的学习器构建出很强的集成；而 Bagging 主要关注降低方差，因此它在不剪枝的决策树、神经网络等学习器上效用更为明显。
（2）对于 Boosting 来说，每一步我们都会在上一轮的基础上更加拟合原数据，所以可以保证偏差（bias），所以对于每个基分类器来说，问题就在于如何选择 variance 更小的分类器，即更简单的分类器，所以 XGBoost/GBDT 选择了深度很浅的决策树。
（3）GBDT 生成的 N 棵树是有联系的，后面的树学习的是前面树的残差（前面学习的不好的地方），因此即使树的深度比较少，也可以有很好的效果。而 Random Forest 的每棵树是独立的。

4.5 GBDT 和 XGBoost 的区别

回顾一下 XGBoost 的算法原理

整体目标：
$L\left( \phi \right) = \sum\limits_i {l\left( {{{y_i},{\hat y}_i}} \right)} + \sum\limits_k {\Omega \left( {{f_k}} \right)}$
其中 $L\left( \cdot \right)$ 为目标函数， $l(\cdot)$ 是 $Loss$ $Function$ ，通常是凸函数，用于刻画预测值 ${{{\hat y}_i}}$ 和真实值 ${{y_i}}$ 的差异，第二项 $\Omega \left( \cdot \right)$ 为模型的正则化项，用于降低模型的复杂度，避免模型过拟合。

$L\left( \cdot \right)$ 为函数空间上的表达，现将其转换为下面这张 $gradient$ $boosting$ 的方式，记 ${\hat y}^{(t)}_i$ 为第 $i$ 个样本第 $t$ 轮迭代：
${L^{\left( t \right)}} = \sum\limits_{i = 1}^n {l\left( {{y_i},{{\hat y}_i}^{\left( {t - 1} \right)} + {f_t}\left( {{{\bf{x}}_i}} \right)} \right)} + \Omega \left( {{f_t}} \right)$
对该函数在 ${\hat y}^{(t)}_i$ 位置进行二阶泰勒展开，可以加速优化过程。具体过程如下：
（1）、 $l(y_i, x)$ 在 ${\hat y}^{(t-1)}_i$ 处进行二阶展开得到：
$l({y_i},x) = l\left( {{y_i},{{\hat y}^{\left( {t - 1} \right)}}} \right) + {{\partial l\left( {{y_i},{{\hat y}^{\left( {t - 1} \right)}}} \right)} \over {\partial {{\hat y}^{\left( {t - 1} \right)}}}}\left( {x - {{\hat y}^{\left( {t - 1} \right)}}} \right) + {{{\partial ^2}l\left( {{y_i},{{\hat y}^{\left( {t - 1} \right)}}} \right)} \over {\partial {{\hat y}^{\left( {t - 1} \right)}}}}{\left( {x - {{\hat y}^{\left( {t - 1} \right)}}} \right)^2}$
（2）、令 $x={\hat y}^{(t-1)}+f_t({\bf x}_i)$
（3）、记一阶导为 ${g_i} = {{\partial l\left( {{y_i},{{\hat y}^{\left( {t - 1} \right)}}} \right)} \over {\partial {{\hat y}^{\left( {t - 1} \right)}}}}$ ，二阶导为 ${h_i} = {{{\partial ^2}l\left( {{y_i},{{\hat y}^{\left( {t - 1} \right)}}} \right)} \over {\partial {{\hat y}^{\left( {t - 1} \right)}}}}$ 。
（4）、从而， $l(y_i, {\hat y}^{(t-1)}+f_t({\bf x}_i))$ 的二阶泰勒展开：
$l\left( {{y_i},{{\hat y}^{\left( {t - 1} \right)}}} \right) + {g_i}{f_t}\left( {{{\bf{x}}_i}} \right) + {1 \over 2}{h_i}f_t^2\left( {{{\bf{x}}_i}} \right)$
（5）、带入目标函数可得：
${L^{\left( t \right)}} \simeq \sum\limits_{i = 1}^n {\left[ {l\left( {{y_i},{{\hat y}^{\left( {t - 1} \right)}}} \right) + {g_i}{f_t}\left( {{{\bf{x}}_i}} \right) + {1 \over 2}{h_i}f_t^2\left( {{{\bf{x}}_i}} \right)} \right]} + \Omega \left( {{f_t}} \right)$

由于函数中的常量在函数最小化的过程中不起作用，删除常数项（第一项）可得：
${{\tilde L}^{\left( t \right)}} = \sum\limits_{i = 1}^n {\left[ {{g_i}{f_t}\left( {{{\bf{x}}_i}} \right) + {1 \over 2}{h_i}f_t^2\left( {{{\bf{x}}_i}} \right)} \right]} + \Omega \left( {{f_t}} \right)\tag 1$

假设决策树模型的复杂度由生成的树的叶子节点数量和叶子节点对应的值向量的 $L2$ 范数决定，即 $\Omega(f_t)=\gamma T + \frac12 \lambda \sum_{j=1}^T w_j^2$

那么 $\Omega$ 展开原目标函数改写为：
$\eqalign{ & {{\tilde L}^{\left( t \right)}} = \sum\limits_{i = 1}^n {\left[ {{g_i}{f_t}\left( {{{\bf{x}}_i}} \right) + {1 \over 2}{h_i}f_t^2\left( {{{\bf{x}}_i}} \right)} \right]} + \gamma T + {1 \over 2}\lambda \sum\limits_{j = 1}^T {w_j^2} \cr & \;\;\;\;\;{\rm{ = }}\sum\limits_{j = 1}^T {\left[ {\left( {\sum\limits_{i \in {I_j}} {{g_i}} } \right){w_j} + {1 \over 2}\left( {\sum\limits_{i \in {I_j}} {{h_i}} + \lambda } \right)w_j^2} \right]} + \gamma T \cr} \tag 2$

假设一颗决策树的叶子节点个数为 $T$ $\Leftrightarrow$ 该决策树是由所有叶子节点对应的值组成的向量 $w \in R^T$ ，以及一个把特征向量映射到叶子节点索引（Index）的函数 $q:R^d \to \{1,2,\cdots,T\}$ 组成。因此，决策树可以定义为： $f_t(x)=w_{q(x)}$ 。

定义集合 $I_j=\{ i \vert q(x_i)=j \}$ 为所有被划分到叶子节点的训练样本的集合。

对于固定的树结构 $q(x)$ ，对 $w_j$ 导得解析解 $w_j^* = {{\sum\limits_{i \in {I_j}} {{g_i}} } \over {\sum\limits_{i \in {I_j}} {{h_i} + \lambda } }}$

代入到 (1) 式中，可得：
$Obj = -\frac12 \sum_{j=1}^T \frac{G_j^2}{H_j+\lambda} + \gamma T \tag 3$

公式(3)可以作为分裂节点的打分，形式上很像 CART 树纯度打分的计算，区别在于它是从目标函数中推导而得。图中显示目标函数值的计算：

通常情况下，我们采用贪心策略来生成决策树的每个节点。
（1）从深度为 0 的树开始，对每个叶节点枚举所有的可用特征；
（2）针对每个特征，把属于该节点的训练样本根据该特征值升序排列（样本排序的时间复杂度为： $O(n \log n)$ ），通过线性扫描的方式来决定该特征的最佳分裂点，并记录该特征的最大收益（采用最佳分裂点时的收益），具体过程如下：
I). 枚举每个叶结点上的特征潜在的分裂点。
II). 对每个潜在的分裂点，计算如果以这个分裂点对叶结点进行分割以后，分割前和分割后的 $Loss$ $Function$ 的变化情况。GBDT遇到负损失，马上停止分裂。xgboost会一直分裂到最大深度，再回过头来剪枝。
（3）选择收益最大的特征作为分裂特征，用该特征的最佳分裂点作为分裂位置，把该节点生长出左右两个新的叶节点，并为每个新节点关联对应的样本集；
（4）回到第 1 步，递归执行到满足特定条件为止。

数据集有 $K$ 个特征，那么生成一颗深度为 $d$ 的树的时间复杂度为： $O(dKn\log n)$

如何计算每次分裂的收益？
假设当前节点记为 $C$ ，分裂之后左孩子节点记为 $L$ ，右孩子节点记为 $R$ 。
则该分裂获得的收益为：当前节点的目标函数值 $-$ 左右两个孩子节点的目标函数值之和
$\Leftrightarrow$ $Gain=Obj_C-Obj_L-Obj_R$ 。具体地，根据等式(3)可得：
$Gain=\frac12 \left[ \frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda} - \frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right] - \gamma \tag 4$
其中， $-\gamma$ 项表示因为增加了树的复杂性（该分裂增加了一个叶子节点）带来的惩罚。等式(4)还可以用来计算输入特征的相对重要程度。

GBDT 和 XGBoost 的区别

传统GBDT的每颗树学习的是 $Loss$ $Function$ 在上一轮预测值的梯度，而 XGBoost 是直接学习残差。

传统 GBDT 以 CART 作为基分类器，XGBoost 还支持线性分类器。

传统 GBDT 在优化时只用到一阶导数信息，XGBoost 则对代价函数进行了二阶泰勒展开，同时用到了一阶和二阶导数，优化速度更快。PS：XGBoost 支持自定义 $Loss$ $Function$ ，只要函数可一阶和二阶求导。

XGBoost 在 $Loss$ $Function$ 里加入了正则项，用于控制模型的复杂度，防止过拟合，这也是 XGBoost 优于传统 GBDT 的一个特性。

Shrinkage（缩减），相当于学习速率（XGBoost 中的 eta）。XGBoost 在进行完一次迭代后，会将叶子节点的权重乘上该系数，目的是削弱每棵树的影响，让后面有更大的学习空间。实际应用中，一般把 eta 设置得小一点，然后迭代次数设置得大一点。（补充：传统 GBDT 的实现也有学习速率）（个人认为，GBDT 的 learning_rate 和 XGBoost 的 eta 并没有什么区别。）

列抽样（column subsampling）。XGBoost 借鉴了随机森林的做法，支持列抽样，不仅能降低过拟合，还能减少计算，这也是 XGBoost 异于传统 GBDT 的一个特性。（由参数 colsample_bytree 控制）

对缺失值的处理。对于特征的值有缺失的样本，XGBoost 可以自动学习出它的分裂方向。（待补充，弄懂是什么意思？？？）

XGBoost 工具支持并行。XGBoost 的并行不是 tree 粒度的并行 $\Leftrightarrow$ XGBoost 也是一次迭代完才能进行下一次迭代的 $\Leftrightarrow$ 第 t 次迭代的代价函数里包含了前面 t-1 次迭代的预测值。XGBoost 的并行是在特征粒度上的。我们知道，决策树的学习最耗时的一个步骤就是对特征的值进行排序（因为要确定最佳分割点），XGBoost 在训练之前，预先对数据进行了排序，然后保存为 block 结构，后面的迭代中重复地使用这个结构，大大减小计算量。这个 block 结构也使得并行成为了可能，在进行节点的分裂时，需要计算每个特征的增益，最终选增益最大的那个特征去做分裂，那么各个特征的增益计算就可以开多线程进行。

可并行的近似直方图算法。树节点在进行分裂时，我们需要计算每个特征的每个分割点对应的增益，即用贪心法枚举所有可能的分割点。当数据无法一次载入内存或者在分布式情况下，贪心算法效率就会变得很低，所以 XGBoost 还提出了一种可并行的近似直方图算法，用于高效地生成候选的分割点。（待补充，弄懂是什么意思？？？）

The XGBoost Advantage compared to GBDT

Regularization

(1)、Standard GBM implementation has no regularization like XGBoost, therefore it also helps to reduce overfitting.
(2)、In fact, XGBoost is also known as ' $regularized$ $boosting$ ' technique.
(3)、XGBoost 自带正则项来约束模型的复杂度。

Parallel Processing

(1)、XGBoost implements parallel processing and is blazingly faster as compared to GBM.
(2)、XGBoost also supports implementation on Hadoop or Spark.
(3)、XGBoost 支持特征粒度上的并行计算，前有详述。

High Flexibility

(1)、XGBoost allow users to define XGBoost allow users to define custom optimization objectives and evaluation criteria.（XGBoost 允许玩家自定义优化目标函数和评价标准（指标/函数））
(2)、This adds a whole new dimension to the model and there is no limit to what we can do.

Handling Missing Values

(1)、XGBoost has an in-built routine to handle missing values.
(2)、User is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future.
(3)、对于特征的值有缺失的样本，XGBoost 可以自动学习出它的分裂方向。
(4)、玩家必须提供一个与现有取值都不相同的特殊值来代替缺失值，默认是 -999。如果每个节点都有缺失值，XGBoost 会采取特殊的处理，尝试学习哪个路径是更好的选择

Tree Pruning

(1)、A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm.（GBM模型的停止条件是当节点继续分列时不能减少损失，这是一种贪心的策略。）
(2)、XGBoost on the other hand make splits upto the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.（XGBoost 通过指定参数 max_depth 限制深度，停止分裂的条件是分裂不能带来正增益。）
(3)、Another advantage is that sometimes a split of negative loss say -2 may be followed by a split of positive loss +10. GBM would stop as it encounters -2. But XGBoost will go deeper and it will see a combined effect of +8 of the split and keep both.
(4)、GBDT 只要遇到负的分裂收益，则停止分裂，不管是否达到预设的最大深度。而 XGBoost 会先按照预设的 max_depth 进行分裂，and then start pruning the tree backwards and remove splits beyond which there is no positive gain.
(5)、如果某一个分裂的收益是 $-2$ ，紧接着的分裂收益是 $+10$ 。GBDT 在遇到 '-2' 的收益时会马上停止分裂。而 XGBoost 会进行权衡：这两个分裂加起来的总收益是 $+8$ 。

Built-in Cross-Validation

(1)、XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.（XGBoost 允许使用者在每次迭代都进行一次交叉验证。因此，很容易在一次实验中直接得到最佳的迭代次数）
(2)、This is unlike GBM where we have to run a grid-search and only a limited values can be tested.（而 GBM 只能进行参数网格搜索来测试有限的取值，进而确定最佳迭代次数）

Continue on Existing Model

(1)、User can start training an XGBoost model from its last iteration of previous run. This can be of significant advantage in certain specific applications.
(2)、GBM implementation of sklearn also has this feature by using 'warm_start '.

5、Parameter Tuning Of GBDT & XGBoost

5.1 GBDT's Parameters Tuning

Let's look at the general structure of a decision tree:
Begin to tune Tree-Specific Parameters：These affect each individual tree in the model.
1. min_samples_split
  
  (1)、Defines the minimum number of samples (or observations) which are required in a node to be considered for splitting.（内部节点再划分所需最小样本数，如果某节点的样本数少于min_samples_split，则不会继续再尝试选择最优特征来进行划分。）
  (2)、Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.（如果样本量不大，不需要管这个值。如果样本量数量级非常大，则推荐增大这个值。）
  (3)、Too high values can lead to under-fitting hence, it should be tuned using CV.
  (4)、This should be $0.5\%-1\%$ of total values. Since this is imbalanced class problem, we’ll take a small value from the range.
2. min_samples_leaf
  
  (1)、Defines the minimum samples (or observations) required in a terminal node or leaf.（叶子节点最少样本数，这个值限制了叶子节点最少的样本数，如果某叶子节点数目小于样本数，则会和兄弟节点一起被剪枝，默认是1。整数 $\rightarrow$ 最少的样本数，小数 $\rightarrow$ 最少样本数占样本总数的百分比。如果样本量不大，不需要管这个值。如果样本量数量级非常大，则推荐增大这个值。）
  (2)、Used to control over-fitting similar to min_samples_split.
  (3)、Generally lower values should be chosen for imbalanced class problems because the regions in which the minority class will be in majority will be very small.（一般来说，当样本的类别不均衡时，参数应该取较低的值。因为如果某一个类的样本量本身很少，要找到一个区域是这个少数类占据大多数，是很少见的。）
3. min_weight_fraction_leaf
  
  (1)、Similar to min_samples_leaf but defined as a fraction of the total number of observations instead of an integer.
  (2)、Only one of #2 and #3 should be defined.（min_samples_leaf 和 min_weight_fraction_leaf 设置其中一个足矣。）
  (3)、Can be selected based on intuition. This is just used for preventing overfitting and again a small value because of imbalanced classes.
4. max_depth
  
  (1)、The maximum depth of a tree.
  (2)、Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
  (3)、Should be tuned using CV.
  (4)、A good rule of thumb is that choosing (5-8) based on the number of observations and predictors always works. if dataset has 87K rows and 49 columns we will take 8.
5. max_leaf_nodes
  
  (1)、The maximum number of terminal nodes or leaves in a tree.
  (2)、Can be defined in place of max_depth. Since binary trees are created, a depth of $n$ would produce a maximum of $2^n$ leaves.
  (3)、If this is defined, GBM will ignore max_depth.
6. max_features
  
  (1)、The number of features to consider while searching for a best split. These will be randomly selected.（划分时考虑的最大特征数，如果样本特征数不多，比如小于50，我们用默认的"None"就可以了）
  (2)、As a thumb-rule, square root of the total number of features works great but we should check upto $30\%-40\%$ of the total number of features.
  (3)、Higher values can lead to over-fitting but depends on case to case.
  (4)、Its a general thumb-rule to start with square root. $\Leftrightarrow$ 设置为"sqrt"或者"auto"，划分时最多考虑 $\sqrt{N}$ 个特征。
Lets consider another set of parameters for managing boosting.
1. learning_rate
  
  (1)、This determines the impact of each tree on the final outcome (step 2.4). GBM works by starting with an initial estimate which is updated using the output of each tree. The learning parameter controls the magnitude of this change in the estimates.（较小的 learning_rate 可以削弱每棵树的影响，让后面有更大的学习空间。参见 GBDT 与 XGBoost 的区别）
  (2)、Lower values are generally preferred as they make the model robust to the specific characteristics of tree and thus allowing it to generalize well.
  (3)、Lower values would require higher number of trees to model all the relations and will be computationally expensive.（较小的 learning_rate 意味着我们需要更多的弱学习器的迭代次数。）
  (4)、较小的 $\nu$ 意味着需要更多的弱学习器的迭代次数。步长（learning_rate） 和 迭代最大次数（n_estimators）一起来决定算法的拟合效果。可以从一个小一点的 $\nu$ 开始调参，默认是 1。
2. n_estimators
  
  (1)、The number of sequential trees to be modeled.（弱学习器的最大迭代次数，或者说最大的弱学习器的个数。）
  (2)、Though GBM is fairly robust at higher number of trees but it can still overfit at a point. Hence, this should be tuned using CV for a particular learning rate.
3. subsample
  
  (1)、The fraction of observations to be selected for each tree. Selection is done by random sampling.（这里的子采样和随机森林不一样，随机森林使用的是放回抽样，而这里是不放回抽样。）
  (2)、Values slightly less than 1 make the model robust by reducing the variance.
  (3)、Typical values 0.8 generally work fine but can be fine-tuned further.
  (4)、如果取值为 1，则全部样本都使用，即使用子采样。（default）
Apart from these, there are certain miscellaneous parameters which affect overall functionality.
1. loss
  (1)、It refers to the loss function to be minimized in each split.
  (2)、It can have various values for classification and regression case. Generally the default values work fine. Other values should be chosen only if you understand their impact on the model.
  (3)、分类模型和回归模型的 $Loss$ $Function$ 是不一样的。
  - 对于分类模型，有对数似然损失函数（deviance）和指数损失函数（exponential）两者输入选择。默认是deviance。对数似然损失函数非常类似于平方损失，是因为在使用梯度下降来求最优解的时候，它的迭代式子与平方损失求导后的式子非常相似。
  - 对于回归模型，有均方差（ls）, 绝对损失（lad）, Huber损失（huber）和分位数损失（quantile）。默认是 ls。
    （1）、如果数据的噪音点不多，用默认的均方差 ls 比较好。如果是噪音点较多，则推荐用抗噪音的损失函数 huber。
    （2）、如果我们需要对训练集进行分段预测的时候，则采用 quantile 。
    （3）、对于正态分布数据，huber 损失的效果近似于平方差损失；
    （4）、对于长尾数据，huber 损失的效果近似于绝对值损失；
    （5）、中等程度拖尾的数据，huber 的表现是最好的 & 无论合适，huber 对 Outlier 的容忍程度是最高的；
  - alpha：这个参数只有 GradientBoostingRegressor 才有，当我们使用 huber 和 quantile 时，需要指定分位数的值。默认是0.9，如果噪音点较多，可以适当降低这个分位数的值。
2. init
  
  (1)、This affects initialization of the output.
  (2)、This can be used if we have made another model whose outcome is to be used as the initial estimates for GBM.
3. random_state
  
  (1)、The random number seed so that same random numbers are generated every time.
  (2)、This is important for parameter tuning. If we don’t fix the random number, then we’ll have different outcomes for subsequent runs on the same parameters and it becomes difficult to compare models.
  (3)、It can potentially result in overfitting to a particular random sample selected. We can try running models for different random samples, which is computationally expensive and generally not used.
  (4)、控制子采样的随机 seed，同样的随机 seed 可以保证同样的采样结果。
4. verbose
  
  The type of output to be printed when the model fits. The different values can be：
  (1)、 $0$ ：no output generated (default)
  (2)、 $1$ ：output generated for trees in certain intervals
  (3)、 $>1$ ：output generated for all trees
5. warm_start
  
  (1)、This parameter has an interesting application and can help a lot if used judicially.
  (2)、Using this, we can fit additional trees on previous fits of a model. It can save a lot of time and you should explore this option for advanced applications.
Author have simplified it for you in an excel file which you can download from his GitHub repository.

5.2 XGBoost's Parameters Tuning

General Parameters：Guide the overall functioning
1. booster [default=gbtree]
  
  Select the type of model to run at each iteration. It has 2 options:
  (1)、gbtree: tree-based models
  (2)、gblinear: linear models
2. silent [default=0]
  
  (1)、Silent mode is activated is set to 1, i.e. no running messages will be printed.
  (2)、It’s generally good to keep it 0 as the messages might help in understanding the model.
  (3)、取 0 (default) 时表示打印出运行时信息，取 1 时表示以缄默方式运行，不打印运行时信息。
3. nthread [default to maximum number of threads available if not set]
  
  (1)、This is used for parallel processing and number of cores in the system should be entered
  (2)、If you wish to run on all cores, value should not be entered and algorithm will detect automatically
  (3)、XGBoost 运行时的线程数。缺省值是当前系统可以获得的最大线程数。
4. num_pbuffer [set automatically by XGBoost, no need to be set by user]
  
  (1)、size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last boosting step.
5. num_feature [set automatically by XGBoost, no need to be set by user]
  
  (1)、boosting 过程中用到的特征维数，设置为特征个数。XGBoost 会自动设置，不需要手工设置
Booster Parameters：Guide the individual booster (tree/regression) at each step
Though there are 2 types of boosters, I'll consider only tree booster here because it always outperforms the linear booster and thus the later is rarely used.
1. eta [default=0.3]
  
  (1)、Analogous to learning_rate in GBM.
  (2)、Makes the model more robust by shrinking the weights on each step.
  (3)、Typical final values to be used: $0.01-0.2$ .
  (4)、XGBoost 在进行完一次迭代后，会将叶子节点的权重乘上该系数，目的是削弱每棵树的影响，让后面有更大的学习空间。
2. min_child_weight [default=1]
  
  (1)、Defines the minimum sum of weights of all observations required in a child.
  (2)、This is similar to min_child_leaf in GBM but not exactly. This refers to min “sum of weights” of observations while GBM has min “number of observations”.
  (3)、Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
  (4)、Too high values can lead to under-fitting hence, it should be tuned using CV.
  (5)、子节点中最小的样本权重和。如果一个叶子节点的样本权重和小于 min_child_weight 则分裂过程结束。
  (6)、在线性回归模型中，这个参数是指建立每个模型所需要的最小样本数。该参数越大算法越 conservative。
3. max_depth [default=6]
  
  (1)、The maximum depth of a tree, same as GBM.
  (2)、Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
  (3)、Should be tuned using CV.
4. max_leaf_nodes
  
  (1)、The maximum number of terminal nodes or leaves in a tree.
  (2)、Can be defined in place of max_depth. Since binary trees are created, a depth of $n$ would produce a maximum of $2^n$ leaves.
  (3)、If this is defined, GBM will ignore max_depth.
5. gamma [default=0]
  
  (1)、A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
  (2)、Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.
  (3)、值越大，会使算法越 conservative。
6. max_delta_step [default=0]
  
  (1)、In maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative.
  (2)、Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.
  (3)、This is generally not used but you can explore further if you wish.
7. subsample [default=1]
  
  (1). Same as the subsample of GBM. Denotes the fraction of observations to be randomly samples for each tree.
  (2). Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
  (3). 用于训练模型的子样本占整个样本集合的比例。
8. colsample_bytree [default=1]
  
  (1). Similar to max_features in GBM. Denotes the fraction of columns to be randomly samples for each tree.
  (2). 在建立树时对特征采样的比例。
  (3)。 （如何设置特征的权重，实现人工干预？？？）
9. colsample_bylevel [default=1]
  
  (1). Denotes the subsample ratio of columns for each split, in each level.
  (2). There is no need to use this often because subsample and colsample_bytree will do the job for you.
10. lambda [default=1]
  
  (1). $L2$ regularization term on weights (analogous to Ridge regression)
  (2). This used to handle the regularization part of XGBoost. Though many data scientists don’t use it often, it should be explored to reduce overfitting.
  (3). lambda 越大，正则惩罚越强。
11. alpha [default=0]
  
  (1). $L1$ regularization term on weight (analogous to Lasso regression)
  (2). Can be used in case of very high dimensionality so that the algorithm runs faster when implemented.
  (3). alpha 越大，正则惩罚越强。
12. scale_pos_weight [default=1]
  
  (1). A value greater than 0 should be used in case of high class imbalance as it helps in faster convergence.
Learning Task Parameters：Guide the optimization performed
These parameters are used to define the optimization objective the metric to be calculated at each step.
1. objective [default=reg:linear]
  
  This defines the loss function to be minimized. Mostly used values are:
  (1)、binary:logistic – logistic regression for binary classification, returns predicted probability (not class)
  (2)、multi:softmax – multiclass classification using the softmax objective, returns predicted class (not probabilities). Also need to set an additional num_class (number of classes) parameter defining the number of unique classes
  (3)、multi:softprob – same as softmax, but returns predicted probability of each data point belonging to each class.
2. eval_metric [ default according to objective ]
  
  (1)、The metric to be used for validation data.
  (2)、The default values are rmse for regression and error for classification.
  (3)、Typical values are：
  
  $rmse$ – root mean square error
  $mae$ – mean absolute error
  $logloss$ – negative log-likelihood
  $error$ – Binary classification error rate (0.5 threshold)
  $merror$ – Multiclass classification error rate
  $mlogloss$ – Multiclass logloss
  $auc$ - Area under the curve
3. seed [default=0]
  
  (1)、The random number seed.
  (2)、Can be used for generating reproducible results and also for parameter tuning.

Gradient Boosting Decison Tree Tuning

0、GBDT 的个人理解

0.1 GBDT 宏观把握

0.2 GBDT 一些重要的点

1、Classification And Regression Trees, CART

1.1 Learning From Tianqi Chen's PPT

1.2 Regression Decision Tree 整体把握

1.3 Regression Decision Tree 数学原理

1.4 Regression Decision Tree 过程总结

2、Machine Learning 关键元素

2.1 模型 & 参数

2.2 目标函数

3、加法模型

4、GBDT 算法解析

4.1 算法逻辑 解析

4.2 梯度提升 解析

4.3 关于 GBDT 的一些很棒的提问（出处）

4.4 GBDT 和 Random Forest 的区别

4.5 GBDT 和 XGBoost 的区别

5、Parameter Tuning Of GBDT & XGBoost

5.1 GBDT's Parameters Tuning

5.2 XGBoost's Parameters Tuning

内容目录

4.1 算法逻辑解析

4.2 梯度提升解析