[关闭]
@pluto-the-lost 2019-06-28T14:47:17.000000Z 字数 6574 阅读 37

Global Word Vectors

representation-learning NLP machine-learning matrix-decomposition


These series of methods, including LSA, PLSA, LDA and GloVe, are based on statisticscalled topic models. They are not specifically designed for word representation, however, LSA, PLSA and GloVe do obtain word vectors during the training.

**Matirx decomposition

Latent semantic analysis (LSA)**:

Also called Latent semantic index (LSI). LSA uses a SVD method to decompose a word-document matrix.

PLSA, LDA

These two method bring "topic" into their models, so they are also called "topic models". Basically they assume that documents will have one or more topics, and under different topic will yield different probability of words appearance.

Since they do not obtain a word representation, so we are not discussing them thoroughly. You can study them here.

PLSA believes that each document has a topic , which is unobservable but can be discribed as . Different topic correspoding to different probability distribution of words . The problem to be solve is


, where is parameters to be estimated, including parameters in and in . Considering is a latent variable, if is known, becomes . MLE is easier to achieve under this situation. EM algorithm can be applied to solve this problem.

LDA makes another assumption that each word in a document is sampled under different topic. In other words, there is a latent topic assignment matrix , where is the number of documents and N is the number of words in each document (without loss of generality, let they have the same length). The probability model for LDA is


the meaning of symbols in the equation:
- : the corpus matrix, word identity for word in document
- : latent topic assignment matrix, topic identity for topic in document
- : posterior probability for document to have topic , document representation
- : posterior probability for topic to generate word , word representation
- : prior probability of topicx, paramters for a Dirichlet distribution
- : prior probability of words, paramters for another Dirichlet distribution

Collapsed Gibbs sampling is used to solve this problem by sampling . When is clear, it will be easy to calculate and .

Global Vectors for Word Representation (GloVe)

The work was published in 2014 by Google. Briefly, it decomposes a weighted log-word-word co-occurrence matrix. More specifically, it minimize a cost function


where the symbols are:
- : word-word co-occurrence matrix
- : distributed representation for the word, calculated through
- : distributed representation for the word, calculated through , also called "separate context word vectors"
- , : bias specific for word or

The derivation of this model start from an assumption, that a "probe" word can be a bridge to compare two words and .

This is how the authors constructed the cost function and what assumption did they make during the derivation of the function.

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注