[关闭]
@songying 2018-10-20T15:28:46.000000Z 字数 1070 阅读 1477

CLOTH 数据集

数据集


3. Cloth 数据集

中学阅读理解数据集

3.1 数据收集与统计

7131 passages, 99433 questions

我们对数据进行以下处理以保证数据的有效性:

  1. we remove questions with an inconsistent format such as questions with more than four options
  2. we filter all questions whose validity relies on external information such as pictures or tables.
  3. we find that half of the total passages are duplicates and we delete those passages.
  4. on one of the websites, the answers are stored as images. We use two OCR software programs 4 to extract the answers from images. We discard the questions when results from the two software are different.

由于数据集难度不同,我们将数据集分为: CLOTH-M和CLOTH-H

3.2 Question Type Analysis

我们将问题分为三类: grammar, vocabulary 和 reasoning。

  1. Grammar: The question is about grammar usage, involving tense, preposition usage, active/passivevoices, subjunctive mood and soon.
  2. Short-term-reasoning: The question is about content words and can be answered based on the information within the same sentence. Note that the content words can evaluate knowledge of both vocabulary and reasoning.
  3. Matching/paraphrasing: The question is an-
    swered by copying/paraphrasing a word in the context.
  4. Long-term-reasoning: The answer must be
    inferred from synthesizing information distributed across multiple sentences.

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注