These series of methods, including LSA, PLSA, LDA and GloVe, are based on statisticscalled topic models. They are not specifically designed for word representation, however, LSA, PLSA and GloVe do obtain word vectors during the training.
Technically, differ from Word2vec-like methods that "learn" word representation using training corpus, topic models get representationlearn them through a matrix decomposition way. There are mainly two kinds of matrices to be decomposed:
Word-document matrix: rows for words or terms, columns for different documents in the corpus, stand for how many times the word appears in the document . LSA uses this strategy
Word-word matrix: rows and columns both represent terms, numbers coorespond to the number of times a given word occurs in the context of another given word, HAL utilizes matrices like this
Basically, any matrix decomposition method can be applied to both these two kind of matrices, so we mainly focus on decomposition methods
Also called Latent semantic index (LSI). LSA uses a SVD method to decompose a word-document matrix.
Why word-document matrix:
The original goal of LSA is to build a query of documentation while take synonym into consideration
Why do matrix decomposition
if we just use bag-of-words to represent a document, some sentences with same meaning but different expression will not be found
assuming that words with similar meaning are more likely to appear in one document
SVD can achive denoising to documents and find latent semantic structure
Why LSA is also a representation of word
when doing SVD, a word-document is decomposed to three distinct matrices:
, where is a hyperparameter
if combining and , we will have a matrix, it is a continuous representation of documents, where each column is a document
meanwhile, is a continuous representation of words, where each row is a word
PLSA, LDA
These two method bring "topic" into their models, so they are also called "topic models". Basically they assume that documents will have one or more topics, and under different topic will yield different probability of words appearance.
Since they do not obtain a word representation, so we are not discussing them thoroughly. You can study them here.
PLSA believes that each document has a topic , which is unobservable but can be discribed as . Different topic correspoding to different probability distribution of words . The problem to be solve is
, where is parameters to be estimated, including parameters in and in . Considering is a latent variable, if is known, becomes . MLE is easier to achieve under this situation. EM algorithm can be applied to solve this problem.
LDA makes another assumption that each word in a document is sampled under different topic. In other words, there is a latent topic assignment matrix , where is the number of documents and N is the number of words in each document (without loss of generality, let they have the same length). The probability model for LDA is
the meaning of symbols in the equation:
- : the corpus matrix, word identity for word in document
- : latent topic assignment matrix, topic identity for topic in document
- : posterior probability for document to have topic , document representation
- : posterior probability for topic to generate word , word representation
- : prior probability of topicx, paramters for a Dirichlet distribution
- : prior probability of words, paramters for another Dirichlet distribution
Collapsed Gibbs sampling is used to solve this problem by sampling . When is clear, it will be easy to calculate and .
The work was published in 2014 by Google. Briefly, it decomposes a weighted log-word-word co-occurrence matrix. More specifically, it minimize a cost function
where the symbols are:
- : word-word co-occurrence matrix
- : distributed representation for the word, calculated through
- : distributed representation for the word, calculated through , also called "separate context word vectors"
- , : bias specific for word or
The derivation of this model start from an assumption, that a "probe" word can be a bridge to compare two words and .
There are two words of interest, say and , and a probe word . If is closer to than to , it will appear more frequently in 's context than in 's context, and we will see
Similarly, if is more similar to , . If equally similar or dissimilar,
Our word embedding should keep this feature:
The embedding space should be linear, so the similarity is function of difference:
Also for linearity, the argument of should be a scalar:
There are still many possible , we want it to be homomorphism, which means :
Obviously, is a good solution
Since is not relevant to , we can regard it as a const, also with a const only correlated to to restore the symmetry:
This is really close, however, the author thought that not all co-occurence pairs are equally important. Those rare pairs are more likely to bring noise and carry less information, while rare pairs occupy 75%~90% of the co-occurance matrix. Then a weight is brought in and this is the final cost function:
Where weighting function should satisfy three condition:
should be relatively small when is large
After trying many functions, the author decided to use:
This is how the authors constructed the cost function and what assumption did they make during the derivation of the function.