@lyc102 2017-04-10T00:49:40.000000Z 字数 5930 阅读 3547

机器学习（周志华）第二章：模型评估与选择

machine_learning

机器学习（周志华）第二章：模型评估与选择

经验误差与过拟合

error rate: $E=a/m$
accuracy: $1-a/m$

预测输出与真实输出之间的差异称为误差(error)

training error or empirical error （训练误差或经验误差）
generalization error （泛化误差）

我们希望得到泛化误差小的学习器，实际能做的是努力使经验误差最小化。为了在新样本上表现好，应该从训练样本中学习出所有潜在样本的“普遍规律”。

overfitting （过拟合）：把特性当共性
underfitting（欠拟合）：没有找到本质

过拟合是机器学习面临的关键障碍，是无法彻底避免的，所能做的只是“缓解”。深层次的原因：问题本身是NP的，算法是P的。相信“NP $\neq$ P"，则过拟合就不可避免。

模型选择(model selection)问题。理想方案：最小化泛化误差。但泛化误差无法直接获取。最小化训练误差又会导致过度拟合。

评估方法

用“测试误差”，即模型在测试集上的误差，作为泛化误差的近似。测试集应该尽可能与训练集互斥。

对数据集 $D$ 做适当处理，从中产生训练集 $S$ 和测试集 $T$ 。

留出法（hold-out)

$D = S\cup T, \; S\cap T = \emptyset$ .

训练和测试集的划分要尽可能保持数据分布的一致性，避免因数据划分引入额外的偏差。一种保留类别比例的采样：stratified sampling.

单次使用留出法得到的估计结果不够稳定可靠。一般采用若干次随机划分、重复实验取平均值。

$S$ 过大，测试误差不够准确
$T$ 过大，训练出的模型不够准确

常见做法：大约 $2/3 \sim 4/5$ 的样本用于训练，剩余样本用于测试。

交叉验证法（cross validation)

将 $D$ 做一个 $k$ 个大小相似的partition, $D = D_1\cup D_2 \cup \ldots \cup D_k, \; D_i\cap D_j = \emptyset \, (i\neq j)$ . 留下一个做测试集，其他作为训练集，从而得到 $k$ 次训练和测试。划分本身还要随机重复 $p$ 次，最终的评估结果是这 $p$ 次 $k$ 折交叉验证结果的均值。

特殊例子： $k=m$ , 称为留一法(LOO). 当 $m$ 很大时工作量太大。

自助法 (bootstrapping)

自助采样(bootstrap sampling)：sampling with replacement.

有一部分样本会重复出现，有一部分不会出现。样本在 $m$ 次采样中始终不被采到的概率是：

$\lim_{m\to \infty} \left (1-\frac{1}{m}\right )^m = \frac{1}{e} \approx 0.368.$
所以大约有

$1/3$ 强的样本不会在采样集中出现。采样数据集作为训练集，剩下的作为测试集。

自助法在数据集较小、难以有效划分时有用。但自助法改变了初始数据集的分布，会引入估计偏差。

调参与最终模型

parameter tuning

参数空间太大，调参的工作量很大。在不少应用中，参数调得好不好往往对最终模型性能有关键性影响。

如果误差函数对参数是光滑的，可以用优化算法寻找最优参数

模型评估与选择中用于评估测试的数据集称为 validation set，和测试集不同，属于训练数据中的一部分。

性能度量（performance measure)

回归任务最常用的是“均方误差”（mean squared error)

$E(f; D) = \frac{1}{m}\sum_{i=1}^m (f(\boldsymbol x_i)-y_i)^2.$
更一般地，对于数据分布

$\mathcal D$ 和 p.d.f.

$p(\cdot)$ ,

$E(f; \mathcal D) = \int_{\boldsymbol x\sim \mathcal D} (f(\boldsymbol x)-y)^2p(\boldsymbol x) {\rm d}\boldsymbol x.$

错误率与精度

对于数据分布 $\mathcal D$ 和 p.d.f. $p(\cdot)$ , 错误率

$E(f; \mathcal D) = \int \chi(f(\boldsymbol x)\neq y)p(\boldsymbol x) {\rm d}\boldsymbol x.$ 精度是

${\rm acc}(f; \mathcal D) = 1 - E(f; \mathcal D).$

查准率、查全率与F1

以信息检索为例：

查准率(precision): 检索出的信息中有多少比例是用户感兴趣的
查全率(recall)：用户感兴趣的信息中有多少被检索出来了

查准率和查全率是一对矛盾的度量。P-R曲线。若一个学习器的曲线被另一个完全包住，则后者更优，如果交叉，则难以断言。比较合理的依据是P-R曲线所包含的面积，越大越好。

平衡点(Break-Event Point)是两者相等的点。可以根据平衡点的大小来度量。

F1是基于查准率和查全率的调和平均：

$\frac{1}{F1} = \frac{1}{2}\left (\frac{1}{P} + \frac{1}{R}\right ).$ F1度量的一般形式

$F_{\beta}$ :

$\frac{1}{F_{\beta}} = \frac{1}{1+\beta^2}\left (\frac{1}{P} + \frac{\beta^2}{R}\right ).$

多次训练/测试的结果，可以先算查准率和查全率，再对两者做平均，或者先对数据做平均，再算查准率和查全率。

ROC 与 AUC

代价敏感错误率与代价曲线

比较检验

机器学习中性能比较比想象的要复杂。原因有三：
1. 希望比较泛化性能，但获得的度量是测试集上的，两者有差别；
2. 测试集上的性能和测试集的选取有关；
3. 算法的随机性。

统计假设检验(hypothesis test). 若在测试集上学习器A比B好，则A的泛化性能是否在统计意义上优于B，以及这个结论的把握有多大。

假设检验

假设： $\epsilon \leq \epsilon_0$

$\epsilon$ : 泛化误差，我们想要控制但不知道。
$\epsilon_0$ : upper bound of $\epsilon$
$\hat\epsilon$ : 测试误差

把一个确定性的不等式 $\epsilon \leq \epsilon_0$ 转化为以一定概率成立的 statement, e.g. $P\{\epsilon \leq \epsilon_0\}\geq 1- \alpha$ or $P\{\epsilon \geq \epsilon_0\}\leq \alpha$ .

考虑分类问题。样本总数为 $m$ ，则误分类个数的上届是 $m\epsilon_0$ . Let $X$ be the random variable representing the number of correct classification examples. Then $X\sim {\rm Bino}(m, \epsilon)$ . Given a small number $\alpha\in (0,1)$ , we consider the inequality

$P\{X(\epsilon) \leq m \epsilon_0\}\geq 1 - \alpha,$ or equivalently the tail bound

$P\{X(\epsilon) \geq m \epsilon_0\}\leq \alpha.$ Here

$m$ is fixed,

$\epsilon_0$ and

$\alpha$ is given. Only

$\epsilon$ is a variable. We use

$X(\epsilon)$ to emphasize the dependence of

$\epsilon$ . When

$\epsilon\ll 1$ , e.g.

$\epsilon=0$ , then obviously the inequality holds. As we increase

$\epsilon$ , the tail (for a mixed point

$m \epsilon_0$ ) will go to

$1$ and violates the inequality. Mathematically

$P\{X(0) \geq m \epsilon_0\} = 0, P\{X(1) \geq m \epsilon_0\} = 1$ and

$p(\epsilon) = P\{X(\epsilon) \geq m \epsilon_0\}$ is an increasing function of

$\epsilon$ . Thus given an

$\alpha$ , we have a maximum

$\bar \epsilon$ , i.e.,

$\bar \epsilon=p^{-1}(\alpha)$ , so that

$p(\bar \epsilon)\leq \alpha.$

draw a picture here.

Then we compare the test error $\hat \epsilon$ with $\bar \epsilon$ ,

if $\hat \epsilon\leq \bar \epsilon$ , we can conclude that:
能以 $1-\alpha$ 的置信度认为， $\epsilon \leq \epsilon_0$ 。
if $\hat \epsilon > \bar \epsilon$ , we can conclude that:
在 $\alpha$ 的显著度下， $\epsilon > \epsilon_0$ .

Student t-distribution and t-test
From Wiki. Let $X_1, \ldots, X_n$ be i.i.d random variables $\sim N(\mu,\sigma)$ with mean $\mu$ and variance $\sigma^2$ . Let

$\bar X = \frac{1}{n}\sum_{i=1}^n X_i$ be the sample mean and let

$S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2$ be the sample variance. Then the random variable

$\sqrt{n} (\bar X - \mu)/\sigma \sim N(0,1)$ and the random variable by replacing the true variance by the sampled variance

$\sqrt{n} (\bar X - \mu)/S \sim t(n-1)$ follows t-distribution of degree

$n-1$ . The true mean

$\mu$ can be further replaced by the sampled mean

$\bar X$ . Then we can have a t-distribution random variable without knowning the true mean and true variance.

The t-distribution is symmetric and bell-shaped, like the normal distribution, but has heavier tails, meaning that it is more prone to producing values that fall far from its mean. Student's t-distribution with zero mean and degree $\nu$ has the probability density function given by

$f(x) = C(\nu) \left (1+ \frac{x^2}{\nu} \right )^{-\frac{\nu+1}{2}}.$ As

$x\to \infty$ , it decays to zero with a polynomial rate slower than the exponential one in the normal distribution. So it has heavier tails.

figure here

应用到交叉检验。假设有k个测试错误率 $\hat \epsilon_1, \hat \epsilon_2, \ldots, \hat \epsilon_k$ , we can form a random variable

$\tau_t = \frac{\sqrt{k}(\mu -\epsilon_0)}{\sigma},$ where

$\mu$ is the sampled mean and

$\sigma$ is the sampled variance of

$\hat \epsilon_i$ .

Given an $\alpha$ , we find an interval $[t_{-\alpha/2},t_{\alpha/2}]$ , 称为置信区间, such that

$P\{\tau_t \in [t_{-\alpha/2},t_{\alpha/2}]\}\geq 1 - \alpha.$ 如果 sampled mean

$\mu$ 和

$\epsilon_0$ 的差值在这个区间，即可认为泛化误差错误率为

$\epsilon_0$ , 置信度为

$1-\alpha$ .

交叉验证 t 检验

McNemar 检验

Friedman 检验与 Nemenyi 后续检验

偏差与方差

Bias-variance decomposition

$E(f;D) = {\rm bias}^2 + {\rm var} + \varepsilon^2.$

泛化误差 $E(f;D) = \mathbb E_D[(f(\boldsymbol x; D) - y_D)^2]$
偏差 ${\rm bias}^2 = (\bar f - y)^2$
方差 ${\rm var}(\boldsymbol x) = \mathbb E_D[(f(\boldsymbol x; D) - \bar f)^2]$
期望输出 $\bar f = \mathbb E_D[f(\boldsymbol x; D)]$
真实标记 $y$
数据标记 $y_D$
噪声 $y_D - y$ and assume $\mathbb E_D[y - y_D] = 0$ .

$\boldsymbol x$ and thus $f(\boldsymbol x)$ are random variables. So is $y_D$ . $\bar f$ and $y$ are numbers.

The bias-variance decomposition can be easily proved by using the orthogonality $(y-y_D, \cdot)_D = 0$ and $(f-\bar f, \cdot)_D = 0$ .

偏差：刻画了学习算法的拟合能力
方差：度量了同样大小的训练集的变动所导致的学习性能的变化，即刻画了数据扰动所造成的影响
噪声：刻画了学习问题本身的难度，是学习算法泛化误差所能到达的下届

bias-variance dilemma. 偏差与方差是有冲突的。初期是偏差主导，后期是方差主导。