[关闭]
@pluto-the-lost 2019-06-28T14:52:48.000000Z 字数 3189 阅读 52

NNLM & Word2vec

deep-learning representation-learning NLP pre-trained-model


(NNLM paper) Y. Bengio, R. Ducharme, P. Vincent. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137-1155, 2003.

(word2vec paper) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. ICLR Workshop, 2013.

Representation of Words


NNLM

image.png-97.5kB

1. word indices extract a D dimentional vector from a shared projection matrix
2. N words' vectors are projected to hidden layer, with H hidden units and a tanh non-linear activation
3. output layer calculate softmax probability of each word in the vocabulary
4. predict a word Wt using its previous n words

Objective function:


where is the probability that given , the model can predict the right word . is a regularization term.

Computational complexity:

Here we should have a rough estimation about the scale of each number:

N: 5 ~ 10
D: 500 ~ 2000
H: 500 ~ 1000
V: usually millions

It looks like that the dominator of should be , however, a hierarchical softmax method can reduce to ideally . Then the dominant will become .

This method is slow for the existance of the non-linear hidden layer. In 2013, Mikolov et al. showed that the hidden layer is not necessary and provided another model called Word2vec.

Word2vec

image.png-65kB

The method has two distinct model: CBOW and skip-gram. They share common idea that we don't need a extra hidden layer, but use the word vector to do prediction directly.

Objective function

The objective is mostly the same as NNLM. Note a difference that NNLM predict a word using its previous words, while the CBOW model predict word using both previous words and subsequent words.

continuous bag-of-words (CBOW)

skip-gram


tricks

hierarchical softmax

Use a Huffman tree to assign short binary codes to frequent words. The tree will be constructed as below:
image.png-89.4kB

1. each node in the candidate pool is regard as a tree (with only root)
2. each tree has a score that to be minimized in the final tree
3. until converge:
    3.1 merge two trees in the candidate pool with the smallest scores
    3.2 add the new merging tree to the candidate pool, set its score as the sum of two merged tree
    3.3 removed the two merged tree from the candidate pool

negative sampling

phrase vector

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注