@songying 2018-06-16T07:30:13.000000Z 字数 2895 阅读 2562

TriviaQA

数据集

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Abstract

华盛顿大学在2017年发布了TriviaQA数据集，该数据集包含超过650k个问题-答案-证据对，并且该数据集有如下特点：
1. 有比较复杂多样的问题。
2. 考虑了问题与答案，证据之间有大量的语法和词汇变化。
3. 需要从多个句子中共同推理得出答案。
对比SQuAD数据集，其主要集中于是推理方面的问题，并且实验证明一些在SQuAD上表现良好的模型在TriviaQA上并不能获得理想的结果。我个人认为该数据集是最具挑战性的数据集之一。

本文也提供了两种baseline algorithms： a featurebased classifier 和 a state-of-the-art neural network

Introduction

RC的难点在于：

the questions can be complex, e.g. have highly compositional semantics

finding the correct answer can require complex reasoning, e.g. combining facts from multiple sentences or background knowledge

individual facts can be difficult to recover from text

举例：

Question

The Dodecanese Campaign of WWII that was an attempt by the Allied forces to capture islands in the Aegean Sea was the inspiration for which acclaimed 1961 commando film?
Answer
The Guns of Navarone
Excerpt
The Dodecanese Campaign of World War II was an attempt by Allied forces to capture the Italianheld Dodecanese islands in the Aegean Sea following the surrender of Italy in September 1943, and use them
as bases against the German-controlled Balkans. The failed campaign, and in particular the Battle of Leros, inspired the 1957 novel The Guns of Navarone and the successful 1961 movie of the same name.

我们的贡献

We collect over 650K question-answer-evidence triples, with questions originating from trivia enthusiasts independent of the evidence documents. A high percentage of the questions are challenging, with substantial syntactic and lexical variability and often requiring multi-sentence reasoning. The dataset and code are available at http://nlp.cs.washington.edu/triviaqa/, offering resources for training new reading-comprehension models.
We present a manual analysis quantifying the quality of the dataset and the challenges involved in solving the task.
We present experiments with two baseline methods, demonstrating that the TriviaQA tasks are not easily solved and are worthy of future study.
In addition to the automatically gathered large-scale (but noisy) dataset, we present a clean, human-annotated subset of 1975 question-document-answer triples whose documents are certified to contain all facts required to answer the questions.

2. Overview

Problem Formulation

${(q_i, a_i, D_i) | i = 1 ... n}$

$q_i$ ：问题

$a_i$ : 答案

$D_i$ : 相关文档
其中，我们假定 $a_i$ 是 $D_i$ 中的一个substring，且 $D_i$ 是一个文档集合而不是单个文档。

Data and Distant Supervision

我们的文档从Wikipedia或网络搜索结果中获取，

Dataset Collection

First we gathered question-answer pairs from 14 trivia and quiz-league websites. We removed questions with less than four tokens, since these were generally either too simple or too vague.
We then collected textual evidence to answer questions using two sources: documents from
Web search results and Wikipedia articles for entities in the question.
Finally, to support learning from distant supervision, we further filtered the evidence documents
to exclude those missing the correct answer string and formed evidence document sets as described in Section 2. This left us with 95K question-answer pairs organized into
(1) 650K training examples for the Web search results, each contain-ing a single (combined) evidence document, and
(2) 78K examples for the Wikipedia reading comprehension domain, containing on average 1.8 evidence documents per example.