@kpatrick
2019-08-01T16:38:25.000000Z
字数 1103
阅读 53
nlp
笔记
标点和特殊字符处理
词频统计
构建字典:word -> idx & idx -> word
Subsampling:
Words that occur often such as “the”,“is” etc are not very useful for providing context to nearby words. If we remove all of them, we are effectively removing any information they provide. A better approach as proposed by Mikolov et al. is to remove some of these words to remove some of the noise from our data. For each word in training set, we discard it with probability:
where is a threshold parameter and is the frequency of the word .
在采样过程中以一定的概率丢弃一些常见的词,为了使构建出来的训练集分布均匀,防止被等这些词支配。
Word2Vec
Negative Sampling
参考文章中用的是Skip-Gram的方式构造训练集。如果词典中的词汇很多,每次softmax都要计算一个很大的向量,Negative Sampling方法可以每次只训练1+k个classifiers,1个正样本,k个负样本
Loss Function
tf.nn.sampled_softmax_loss采样一个子集做softmax,减少计算量