@songying 2018-07-19T13:49:57.000000Z 字数 2270 阅读 1588

QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension

squad-model

Abstract

目前的QA模型都是基于RNN和attention，由于RNN的sequential特性，这些模型在training和inference很慢。我们提出了新的架构：QANet，该模型不需要RNN：它的encoder由卷积和self-attention组成。

Introduction

目前，大多数从成功的模型使用两种技术：
1. a recurrent model to process sequential inputs
2. an attention component to cope with long term interactions
这些模型的缺点在于它们在training和inference 时太慢，尤其是针对长文本的时候这是由于RNN的特性造成的。这导致实验花费时间太长， researchers无法快速迭代， but also prevents the models from being used for larger dataset.同时，这也导致模型无法在real-time application中应用。

著名的模型有： BiDAF

在本文中，为了使得模型快速，我们移除了RNN。在Encoder我们使用卷积层和self-attentions来作为基石来分别encode query 和 context。然后我们通过standard attentions来学习context与question之间的关系。The resulting representation is encoded again with our recurrency-free encoder before finally decoding to the probability of each position being the start or end of the answer span.

我们模型的设计背后的主要推动如下：
1. convolution captures the local structure of the text, while the self-attention learns the global interaction between each pair of words.
2. The additional context-query attention is a standard module to construct the query-aware context vector for each position in the context paragraph, which is used in the subsequent modeling layers.

本文的贡献如下：

We propose an efficient reading comprehension model that exclusively built upon convo-
lutions and self-attentions. To the best of our knowledge, we are the first to do so. This
combination maintains good accuracy, while achieving up to 13x speedup in training and
9x per training iteration, compared to the RNN counterparts. The speedup gain makes our
model the most promising candidate for scaling up to larger datasets.
To improve our result on SQuAD, we propose a novel data augmentation technique to
enrich the training data by paraphrasing. It allows the model to achieve higher accuracy
that is better than the state-of-the-art.

2. The Model

2.1 Problem Formulation

给定一个paragraph有n个单词： $C = {c_1, c_2, \cdots, c_n}$ ，一个query有m个单词： $Q= {q_1, q_2, \cdots, q_m}, 输出是从C中汲取的一个片段$ S = {c_i, c_{i+1}, \cdots, c_{i+j}}。在下文中，我会使用x来描述original word 和它的embedded vector。 $x \in C, Q$

2.2 Model Overview

我们的模型的大致结构与现有的模型相似分为： an embedding layer, an embedding encoder layer, a context-query attention layer, a model encoder layer and an output layer。

我们模型与其余模型最大的不同在于：
1. For both the embedding and modeling encoders, we only use convolutional and self-attention mechanism, discarding RNNs, which are used by most of the existing reading comprehension models. 因此我们的模型更快，因为它能够并行处理输出tokens。

我们的模型包括以下5层：

Input Embedding Layer：