@songying 2018-08-01T02:15:43.000000Z 字数 2352 阅读 2555

Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification

文本分类

Abstract

任务： Relationclassification
本文提出了 Attention-Based Bidirectional Long Short-Term Memory Networks(Att-BLSTM)，用来获取一句话中的重要信息。该模型在SemEval-2010 relation 分类任务上去的很好的结果。

Introduction

该paper的贡献在于使用BLSTM with attention mechanismwhich， which can automatically focus on the words that have decisive effect on classification, to capture the most important semantic information in a sentence, without using extra knowledge and NLP systems.

在第二段中， we review related work about relation classification.
在第三段中， presents our Att-
BLSTM model in detail
在第四段中， we describe details about the setup of experimental evaluation and the experimental results.
在第五段中， we have our conclusion

Model

如上图，该模型包含以下5个部分：
1. Input layer: input sentence to this model;
2. Embedding layer: map each word into a low dimension vector;
3. LSTM layer: utilize BLSTM to get high level features from step (2);
4. Attention layer: produce a weight vector, and merge word-level features from each time step intoasentence-level feature vector, by multiplying the weight vector;
5. Output layer: the sentence-level feature vector is finally used for relation classification.

Word Embedding

给定句子 $S = {x_1, x_2, \cdots, x_T}$ , 其中每个单词 $x_i$ 被转化为词向量 $e_i$ 。

对于每个x，我们使用词向量表 $W^{wrd} \in R^{d^w} |V|$ ,其中， V是一个固定的词表， $d^w$ 表示词向量的维度，矩阵 $W^{wrd}$ 是需要学习的参数。

将词转化为词向量，其中 $v^i$ 是one-hot。

$e_i = W^{wrd} v^i$

最终我们得到了句子的词向量表示： $emb_s = {e_1, e_2, \cdots, e_T}$

Bidirectional Network

本文使用的LSTM是《. Speech recognition with deep recurrent neural networks》

具体参见LSTM

在本文中，我们使用双向LSTM，于是得到：

$h_i = [ \overrightarrow{h_i} \oplus \overleftrightarrow{h_i} ]$

Attention

在本节中，我们提出了一种注意力机制。

用 $H = {h_1, h_2, \cdots, h_T}$ 来表示LSTM层的输出，那么句子最终的表示 $\gamma$ 由以下得出：

$M = tanh(H) \\ \alpha = softmax(w^TM) \\ \gamma = H \alpha^T$

其中， $H \in R^{d^w \times T}$ ， $d^w$ 是词向量的维度， $w$ 是需要训练的参数， $w, \alpha, \gamma$ 的维度分别是 $d^w, T, d^w$ .

我们最终获得了句子的最终表示：

$h^* = tanh(r)$

Classifying

在本层中，我们使用softmax来获取最终的分类：

$\hat{p}(y|S) = softmax(W^{(S)}h^* + b^{(S)}) \\ \hat{y} = argmax_y \quad \hat{p}(y|S)$

损失函数采用negative log-likelihood：

$J(\theta) = - \frac{1}{m} \sum_{i=1}^m t_i log(y_i) + \lambda ||\theta||^2_F$

其中， $t \in R^m$ 是 one-hot represented ground truth， $y \in R^M$ 是 the estimated probability for each class by softmax (m is the number of target classes), $\lambda$ is an L2 regularization hyperparameter.

在本文中我们使用L2正则化与dropout正则化来减轻过拟合。

Regularization

We employ dropout on the embedding layer, LSTM layer and the penultimate layer.

We additionally constrain L2-norms of the weight vectors by rescaling w to have $∥w∥ = s$ ,whenever $∥w∥ > s$ after a gradient descent step, as shown in equation 15.

Experiments

数据集： SemEval-2010 Task 8 dataset

如果自己设置的时候需要仔细看