@pluto-the-lost
2019-06-28T14:52:48.000000Z
字数 3189
阅读 52
deep-learning
representation-learning
NLP
pre-trained-model
1. word indices extract a D dimentional vector from a shared projection matrix
2. N words' vectors are projected to hidden layer, with H hidden units and a tanh non-linear activation
3. output layer calculate softmax probability of each word in the vocabulary
4. predict a word Wt using its previous n words
Objective function:
Computational complexity:
Here we should have a rough estimation about the scale of each number:
N: 5 ~ 10
D: 500 ~ 2000
H: 500 ~ 1000
V: usually millions
It looks like that the dominator of should be , however, a hierarchical softmax method can reduce to ideally . Then the dominant will become .
This method is slow for the existance of the non-linear hidden layer. In 2013, Mikolov et al. showed that the hidden layer is not necessary and provided another model called Word2vec.
The method has two distinct model: CBOW and skip-gram. They share common idea that we don't need a extra hidden layer, but use the word vector to do prediction directly.
Objective function
The objective is mostly the same as NNLM. Note a difference that NNLM predict a word using its previous words, while the CBOW model predict word using both previous words and subsequent words.
bag-of-words
continuous bag-of-words
computational complexity
CBOW model works better in smaller scaled data.
Use a Huffman tree to assign short binary codes to frequent words. The tree will be constructed as below:
1. each node in the candidate pool is regard as a tree (with only root)
2. each tree has a score that to be minimized in the final tree
3. until converge:
3.1 merge two trees in the candidate pool with the smallest scores
3.2 add the new merging tree to the candidate pool, set its score as the sum of two merged tree
3.3 removed the two merged tree from the candidate pool