@Emptyset 2015-07-07T02:04:53.000000Z 字数 3885 阅读 1675

Machine Learning Notes - A Mathematical Approach 1 : A Review of Probability

机器学习

[0] "What" and "Why"

早在上初中的时候就在R.Penrose的《皇帝新脑》里看到过历史上关于强AI与弱AI的哲学辩论（希尔勒的中文屋子），也曾被人工智能带来的憧憬深深吸引。转眼十年过去了，自认为也学了一些数学，可因为种种原因一直没能认真对待它在人工智能的应用（或许有那么一段时间受到哈代的“数学无用论”的影响，至今依旧如同强迫症地坚信美好的东西一定是“无用的”）。不止一次听人跟我说：你是学数学的，去看一下机器学习吧，应该对你来说不会太难。然而当我认真翻看机器学习教材时却开始被其中大量的统计学概念折磨——这才意识到统计学并不是数学的一个分支，而是基于概率论等数学理论的一个独立的应用学科，两者的语言模式似乎还是有差异的；而机器学习则是把统计学中的许多内容当作基础理论，基于统计学来进行算法研究的一门学科。惭愧的是，由于并不是学的应用数学，我在本科数学学习阶段并没有接触到统计学，只是学了一些概率论，遂决定从一个数学的角度来整理一份机器学习的笔记，并取名为：Machine Learning Notes - A Mathematical Approach。

All of Statistics: A Concise Course in Statistical Interface一书的前言里提到，在统计学与机器学习中，有些用词定义是不同的，罗列如下（单词之间用"|"分隔，前者是统计学中用词，后者是CS中的）——

estimation | learning : using data to estimate an unknown quantity

classification | supervised learning: predicting a discrete $Y$ from $X$

clustering | unsupervised learning: putting data into groups

data | training sample: $(X_1,Y_1),...,(X_n,Y_n)$

covariates | features: the $X_i$ 's

classifier | hypothesis: a map from covariates to outcomes

hypothesis | - : a subset of a parameter space $\Theta$

confidence interval | - : interval that contains an unknown quantity with given frequency

directed acyclic graph | Bayes net: multivariate distribution with given conditional independence relations

Bayesian interface | Bayesian interface: statistical methods for using data to update beliefs

frequentist interface | - : statistical methods with guaranteed frequency behavior

large deviation bounds | PAC learning: uniform bounds on probability of errors

从motivation看概率论与统计学居然是两个完全相反的过程——
The basic problem that we study in probability is: Given a data generating process, what are the properties of the outcomes?
The basic problem of statistical inference is the inverse of probability: Given the outcomes, what can we say about the process that generated the data?

1 A Review of Probability

……可无论我走了多远，每每回望时它都会告诉我，less is more，这个世界就是这样，任何复杂的解释都只是源于人对宇宙心存欲望——如同薛定谔方程的哥本哈根诠释与多重宇宙诠释，如同概率的频率极限诠释与贝叶斯诠释——没有观测者看到过彗星撞地球，这样的实验也从未进行或重复过。所以概率本应只是一个测度，多余的全是信仰罢了。
……大数定律则教育人类说：去尝试吧，它或许不是真相，但你越努力，你偏离真相的可能性就越低。
——题记

为了给Probably Approximately Correct(PAC) learning model提供一个靠谱的数学介绍，这一部分笔记主要参考了All of Statistics: A Concise Course in Statistical Inference一书的1到4章，以证明完Hoeffding's Inequality为止。同时也参考了Foundations of Machine Learning的Appendix C,D，以及Understanding Machine Learning的Appendix B,这两本书的附录都给出了这几个重要不等式的数学证明。

Definition 1.1 **: A function $\mathbb{P}$ that assigns a real number $\mathbb{P}(A)$ to each event $A$ is a **probability distribution or a probability measure if it satisfies the following three axioms:
Axiom 1: $\mathbb{P}(A)\ge 0$ for every $A$
Axiom 2: $P(\Omega)=1$
Axiom 3: If $A_1,A_2,...$ are disjoint then

$P (⋃ i = 1 \infty A i) = \sum i = 1 \infty P (A i)$ $\mathbb{P}\left(\bigcup\limits_{i=1}^\infty{A_i}\right)=\sum\limits_{i=1}^\infty\mathbb{P}(A_i)$

【Note: 概率论札记 - 1 - 一个无法成为事件的状态空间子集】

对于上面这个定义，可以有许多种不同的统计学阐释。两种主流的解释是frequencies和degrees of beliefs。前者认为 $\mathbb{P}(A)$ 是随着实验次数增加以后A发生的频率最终会趋向的一个数值极限，他们被称为"Frequentist"；后者则认为 $\mathbb{P}(A)$ 是衡量观测者相信事件 $A$ 会发生的程度，这个派别被人称为"Bayesian"（贝叶斯派）。显然贝叶斯派对概率的阐释更加广泛一些，基于贝叶斯诠释，我就可以说“在接下去100年内，地球遭到彗星撞击的可概率是x%”这样的话。然而用传统的“频率极限”的解释则不行，毕竟我们从来没有做过彗星撞地球的实验（类似于“尝试撞了十万次成功撞上了两次”）。

利用公理与 $\mathbb{P}$ 的定义，我们可以轻易推导出许多性质（证明略）——

$\mathbb{P}(\emptyset) = 0$

$A\subset B \Rightarrow \mathbb{P}(A)\le \mathbb{P}(B)$

$0\le\mathbb{P}(A)\le 1$

$\mathbb{P}(A^c)= 1-\mathbb{P}(A)$

$A\cap B =\emptyset \Rightarrow \mathbb{P}(A\cup B)=\mathbb{P}(A)+\mathbb{P}(B)$

Lemma 1.2 For any events $A$ and $B$ ,

P (A \cup B) = P (A) + P (B) - P (A B)

$\mathbb{P}(A\cup B) = \mathbb{P}(A)+\mathbb{P}(B)-\mathbb{P}(AB)$

**Proof (draft): ** $A\cup B=(AB^c)\cup(AB)\cup(A^cB)$ , making repeated use of the fact that $\mathbb{P}$ is additive for disjoint events.

Theorem 1.3 (Continuity of Probabilities). If $A_n\rightarrow A$ then

P (A n) \to P (A)

$\mathbb{P}(A_n)\rightarrow P(A)$
as

n→∞ $n\rightarrow\infty$

Note: 这里All of Statistics一书给的证明严格意义上是错误的，它先构造了一个monotone的 $A_n$ （显然是loss of generality），构造了一个 $B_n$ 是disjoint的，然后引用了Axiom 3，类似于Lemma 1.2的证法。（看到这样不负责任的proof表示不开心，就不写下来了）事实上引入 $\sigma-algebra$ 以及probability measure的定义后，这个定理可以被轻松证明。请参考 Notes on Probability Essentials - Axioms of Probability 中Theorem 4的证明。

Machine Learning Notes - A Mathematical Approach 1 : A Review of Probability

[0] "What" and "Why"

1 A Review of Probability

内容目录