[关闭]
@kpatrick 2019-08-01T16:38:25.000000Z 字数 1103 阅读 53

How to train custom Word Embdeddings

nlp 笔记

阅读资料:How to train custom Word Embeddings using GPU on AWS


1. Build Corpus 构造语料库


2. Preprocess 预处理

  1. 标点和特殊字符处理

  2. 词频统计

  3. 构建字典:word -> idx & idx -> word

  4. Subsampling

    Words that occur often such as “the”,“is” etc are not very useful for providing context to nearby words. If we remove all of them, we are effectively removing any information they provide. A better approach as proposed by Mikolov et al. is to remove some of these words to remove some of the noise from our data. For each word in training set, we discard it with probability:

    where is a threshold parameter and is the frequency of the word .

    在采样过程中以一定的概率丢弃一些常见的词,为了使构建出来的训练集分布均匀,防止被等这些词支配。
    下采样


3. Train Method 训练方法

  1. Word2Vec

    TensorFlow训练词向量

    • CBoW
    • Skip-Gram:负采样,训练1+k个lr分类,减少了softmax计算代价,
    • GloVe:global vector for word representation

    glove

    network structure

  2. Negative Sampling

    word2vec中的subsampling和negative sampling

    参考文章中用的是Skip-Gram的方式构造训练集。如果词典中的词汇很多,每次softmax都要计算一个很大的向量,Negative Sampling方法可以每次只训练1+k个classifiers,1个正样本,k个负样本

    负采样
    skip-gram

  3. Loss Function

    深度学习库 TensorFlow (TF) 中的候选采样
    Tensorflow基础知识---损失函数详解

    tf.nn.sampled_softmax_loss采样一个子集做softmax,减少计算量


添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注