@songying 2018-10-18T12:35:43.000000Z 字数 1983 阅读 1549

Effective Approaches to Attention-based Neural Machine Translation

Attention

Abstract

本文介绍了两种高效的注意力机制： global 和 local。

Introduction

本文中我们介绍了两种新的attention-based model：

a global approach in which all source words are attended

alocal one whereby only a subset of source words are considered at a time.

2 Neural Machine Translation

NMT 是通过计算 $p(y|x)$ 来翻译的。 x为 ${x_1, \cdots, x_n}$ , y 为 ${y_1, \cdots, y_n}$ 。一个基本的NMT包含两个组件：
- an encoder which computes a representation s for each source sentence
- a decoder which generates one target word at a time

$log p(y|x) = \sum_{j=1}^m logp(y_j|y_{<j}, s)$

在本文中，我们使用 stacking LSTM 。

Attention-based models

Global Attention

global attention : context vector $c_t$ 考虑了encoder中所有的隐层状态。

$a_t$ : size equals the number of time steps on the source side

$a_t(s) = align(h_t, \overline{h_s}) \\ = \frac{exp(score(h_t, \overline{h_s}))}{\sum_{s'} exp(score(h_t, \overline{h_{s'}})}$
$h_t$ : the current target hidden state。 decoder 在t时刻的隐层状态
$\overline{h_s}$ : source hidden state，源端的隐状态。
score 是一个用于评价 $h_t$ 与 \overline{h_s}$ 之间关系的函数。

得到 $a_t$ 后，通过加权平均的方式，得到上下文向量 $c_t$

Local Attention

Global Attention有一个明显的缺点就是，每一次，encoder端的所有hidden state都要参与计算，这样做计算开销会比较大，特别是当encoder的句子偏长，比如，一段话或者一篇文章，效率偏低。因此，为了提高效率，Local Attention应运而生。

local attention 一种介于Soft Attention和Hard Attention之间的一种Attention方式，即把两种方式折衷一下。

思想： local attention 可选的专注于上下文的 a small window 且是可微的。

the model first generates an aligned position $p_t$ for each target word at time t.
The context vector $c_t$ is then derived as a weighted average over the set of source hidden states within the window $[p_t −D,p_t +D]$ ; D 是凭经验选的。
此时 $a_t$ 的长度是固定的。如 $\in R^{2D+1}$

我们讨论该模型的两种变体, $p_t$ 有两种获取方式：

Monotonic alignment(local-m) - 我们简单的令 $p_t = t$ 。

Predictive alignment(local-p):

$模型参数，用来预测位置：$
$p_t = S \cdot sigmoid(v_p^Ttanh(W_ph_t)) \\ W_p, v_p: 模型参数，用来预测位置 \\ S： the source sentence length$

此 时 的 函 数 与 上 面 一 样

$a_t = align(h_t, \overline{h_s}) exp(-\frac{(s-p_t)^2}{2\sigma^2} \\ 此时的align函数与上面global 一样$
一般我们根据经验将

$sigma = \frac{D}{2}$

最终

$\tilde{h_t} = tanh(W_c[c_t;h_t]) \\ p(y_t|y_{<t}, x) = softmax(W_s \tilde{h_t})$

比较

总之，Global Attention和Local Attention各有优劣，在实际应用中，Global Attention应用更普遍，因为local Attention需要预测一个位置向量p，这就带来两个问题：
1、当encoder句子不是很长时，相对Global Attention，计算量并没有明显减小。
2、位置向量 $p_t$ 的预测并不非常准确，这就直接计算的到的local Attention的准确率。