@songying 2018-10-26T11:53:49.000000Z 字数 2320 阅读 1665

word2vec Parameter Learning Explained

word-embedding

http://shomy.top/2017/07/28/word2vec-all/

值得参考的paper

Efficient estimation of word representations in vector space
Distributed representations of words and phrases and their compositionality.

Abstract

本文对word2vec参数的训练进行深入探讨，包括CBOW模型与Skip-gram模型，以及各种优化绩效如hierarchical softmax 和 negative sampling。并且提供了对梯度下降过程的直观解释与详细推导。

1. CBOW ： Continuous Bag-of-Word Model

1.1 onw-word contex（简单模型）

我们先假定我们给定的上下文是一个词，target word也是一个词。有点二元模型。

我们观察，输入的长度为N，为词表的长度，而输出也是N，也是词表的长度，这是由softmax的特性决定的。

vocabulary size : V

hidden layer size : N

单元之间是全连接的。

输入是 one-hot encoded vector ${x_1, x_2, \dots, x_n}$ ，即其中只有 $x_k$ 为1，其余均为0.

输入层到隐层之间的权重矩阵: $W_{V × N}$ ，且每一行是一个N维的向量 $v_w$ , 。

隐层到输出层之间的权重矩阵： $W'_{V \times N}$

给定上下文假定 $x_k=1$ , 其余为0，那么则有：

$h = W^T x = W^T_{(k,\cdot)} := v_{w_I}^T$

h: 表示 W的第k行的向量

$w_I$ : 表示输入词

$v_{w_T}$ ：表示输入词 $w_I$ 的向量表示

注意： 在隐层中并没有激活函数，证明隐层输出仅仅是一个输入的线性结合。

从隐层到输出层，权重矩阵 $W'$ , 是一个 $N \times V$ 的矩阵。那么则有：

是 矩 阵 的 第 列

$u_j = v_{w_j}'^T h, v_{w_j}'是矩阵 W'的第j列 \\$

然后输出层上是一个softmax层，那么则有：

$p(w_j|w_I) = y_i =\frac{exp(u_j)}{\sum_{j'=1}^Vexp(u_{j'})}$

$y_i$ ：表示输出层softmax 的第j个单元的输出，是一个介于0-1之间的概率值。

由以上三个式子，最终我们可以推出：

$p(w_j|w_I) =\frac{exp(v_{w_j}'^T v_{w_I})}{\sum_{j'=1}^V exp(v_{w_j'}'^T, v_{w_I})}$

注意一点的是： $v_w$ 与 $v_w'$ 是单词w的两种不同表示。 $v_w$ 来自于W， $v_w'$ 来自于 $W'$ ，但在实际使用中，一般使用前者作为词向量。
$v_w$ : input vector
$v_w'$ : output vector

1.2 Multi-word context

本节介绍context 有多个词的情境，模型结构如下图所示：

输入有C个单词： $x_{1k}, \cdots, x_{CK}$ ,每个x都是用one-hot来表示。

$W_{V \times N}$ : 表示输入层到隐层的共享矩阵

在输入层到隐层的计算中，此时h的计算发生了变化，在onw-word contex模型中，直接取出W的第k行作为h的值，此时我们是从W中取出输入的C个单词对应的词向量，然后直接取平均。

$h = \frac{1}{C}W^T(x_1 + x_2 + \cdots + x_C) \\ = \frac{1}{C}( v_{w_1} + \cdots + v_{w_c})$

而从隐层到输出层的过程与上面的onw-word contex 一样，那么则有：

$p(w_j|w_I) = y_i =\frac{exp(u_j)}{\sum_{j'=1}^Vexp(u_{j'})}$

Skip-Gram Model

这个模型是根据单词来预测上下文，模型如下图所示：

从模型结构上看，Skip-Gram 与 CBOW相反，与 One-Word Model 很相似。

我们使用 $v_{w_I}$ 表示输入的word，在输入层到输出层中，与 One-Word Model相同，此时的隐层状态表示为：

$h = W^T_(k,\cdot) = v_{w_I}^T$

从隐层到输出层，我们要输出C个单词，因此输出有C个分布： $y_1, \cdots, y_C$ , 其中每一个分布都需要单独计算：

$P(w_{c,j} = w_{O,c}| w_I) = y_{c,j} = \frac{exp(u_{c,j}}{\sum_{j'=1}^V exp(u_{j'}}$

对照上图即可了解

$w_{c,j}$ ：the j-th word on the c-th panel of the output layer;

$w_{O,c}$ ： is the actual c-th word in the output context words;

$w_I$ : the only input word

$y_{c,j}$ : is the output of the j-th unit on the c-th panel of the output layer;

$u_{c,j}$ ：the net input of the j-th unit on the c-th panel of the output layer.

由于输出层的各个softmax之间共享矩阵 $W'$ ，那么则有：

$u_{c,j} = u_j = v_{w_j}^{T} \cdot h, for c = 1, 2, 3, \cdots, C$

$v'_{w_j}$ ： the output vector of the j-th word in the vocabulary，表示矩阵 W' 的j列

3. 对计算效率的优化

主要是对隐层到输出层矩阵 W' 的优化。采用方法为 Hierarchical Softmax 和 Negative Sampling。