@songying
2018-06-22T15:47:55.000000Z
字数 2262
阅读 1256
word-embedding
off-the-shelf (OTS): 现成的,已训练好的
out-of-vocabulary (OOV) tokens: 未出现过的词
Wesystematically explore several options for these choices, and provide recommendations to researchers working in this area.
目前很多RC模型都使用了一下技术:
- Tokens in the document and question are represented using word vectors obtained from a lookup table (either initialized randomly, or from a pre-trained source such as GloVe (Pennington et al., 2014)).
- A sequence model such as LSTM (Hochreiter and Schmidhuber, 1997), augmented with an attention mechanism (Bahdanau et al., 2014), updates these vectors to produce contextual representations.
- An output layer uses these contextual representations to locate the answer in the document.
本文探索1 对最终性能带来的影响。
在本文中,我们比较了两个模型: Stanford Attentive Reader (AR) (Chenet al., 2016) and Gated Attention (GA) Reader(Dhingra et al., 2016), 数据集采用: Who-Did-What dataset (Onishi et al., 2016)
pre-trained 词向量
Based on our findings, we recommend the use of certain pre-trained GloVe vectors for initialization. These consistently outperform other off-the-shelfembeddings such as word2vec 1 (Mikolov et al.,2013), as well as those pre-trained on the target corpus itself and, perhaps surprisingly, those trained on a large corpus from the same domain as the target dataset.
how to handle out-of-vocabulary (OOV) tokens?
- A common approach (e.g. (Chen et al., 2016; Shen et al., 2016)) is to replace infrequent words during training with a special token UNK, and use this token to model the OOV words at the test phase.
- A superior strategy is to assign each OOV token either a pre-trained, if available, or a random but unique vector at test time.
本文选用数据集:
- Who-Did-What (WDW) (Onishi et al., 2016)
- the Children’s Book Test (CBT) (Hill et al., 2015)
本文选用模型:
- Stanford AR
- the high-performing GA Reader.
Stanford AR
GA Reader
最受欢迎的方法是 Glove 和 Word2Vec
一大区别: Glove和 word2vec 都提供了预训练好的词向量,While the GloVe package provides embeddings with varying sizes (50-300), word2vec only provides embeddings of size 300.
我们也训练了一些其他的词向量:
总之,对于WDW语料库,我们使用了两种语料:
- one large (OTS)
- one small (WDW)
对于CBT语料库,我们使用了:
- one large (BT)
- one small (CBT).
- 对于WDW数据集: 隐层单元数: d= 128 , RNN单元: GRU, dropout: p=0.3.
- 对于 CBT-NE数据集: 隐层单元数: d= 128 , RNN单元: GRU, dropout: p=0.4.
The Stanford AR 只有一层
the GA Reader 有 3 层
对所有实验,词向量长度为: