@wjcper2008 2017-10-10T23:14:04.000000Z 字数 2136 阅读 1614

Maximum Mean Discrepancy (distance distribution)

迁移学习

It might help to give slightly more of an overview of MMD. $\DeclareMathOperator{\E}{\mathbb E}\newcommand{\R}{\mathbb R}\newcommand{\X}{\mathcal X}\newcommand{\h}{\mathcal H}\DeclareMathOperator{\MMD}{MMD}$

In general, MMD is defined by the idea of representing distances between distributions as distances between mean embeddings of features.

That is, say we have distributions $P$ and $Q$ over a set $\X$ . The MMD is defined by a feature map $\varphi : \X \to \h$ , where $\mathcal H$ is what's called a reproducing kernel Hilbert space. In general, the MMD is

$\MMD(P, Q) = \lVert \E_{X \sim P}[ \varphi(X) ] - \E_{Y \sim Q}[ \varphi(Y) ] \rVert_\h .$

Case: linear

As one example (linear case), we might have $\X = \h = \R^d$ and $\varphi(x) = x$ . In that case:

$\begin{align} \MMD(P, Q) &= \lVert \E_{X \sim P}[ \varphi(X) ] - \E_{Y \sim Q}[ \varphi(Y) ] \rVert_\h \\&= \lVert \E_{X \sim P}[ X ] - \E_{Y \sim Q}[ Y ] \rVert_{\R^d} \\&= \lVert \mu_P - \mu_Q \rVert_{\R^d} ,\end{align}$

So this MMD is just the distance between the means of the two distributions.
Matching distributions like this will match their means, though they might differ in their variance or in other ways.

Case: linear projection

We have $\mathcal X = \mathbb R^d$ and $\mathcal H = \mathbb R^p$ , with $\varphi(x) = A' x$ , where $A$ is a $d \times p$ matrix. So we have

$\begin{align} \MMD(P, Q) &= \lVert \E_{X \sim P}[ \varphi(X) ] - \E_{Y \sim Q}[ \varphi(Y) ] \rVert_\h \\&= \lVert \E_{X \sim P}[ A' X ] - \E_{Y \sim Q}[ A' Y ] \rVert_{\R^p} \\&= \lVert A' \E_{X \sim P}[ X ] - A' \E_{Y \sim Q}[ Y ] \rVert_{\R^p} \\&= \lVert A'( \mu_P - \mu_Q ) \rVert_{\R^p} .\end{align}$

This MMD is the difference between two different projections of the mean.

Dimension reduction loss information: If $p < d$ or the mapping $A'$ otherwise isn't invertible, then this MMD is weaker than the previous one: it doesn't distinguish between some distributions that the previous one does.

Case: variance projection

You can also construct stronger distances. For example, if $\X = \R$ and you use $\varphi(x) = (x, x^2)$ , then the MMD becomes $\sqrt{(\E X - \E Y)^2 + (\E X^2 - \E Y^2)^2}$ , and can distinguish not only distributions with different means but with different variances as well.

Summary

And you can get much stronger than that: if $\varphi$ maps to a general reproducing kernel Hilbert space, then you can apply the kernel trick to compute the MMD, and it turns out that many kernels, including the Gaussian kernel, lead to the MMD being zero if and only the distributions are identical.