Skip to content

Evaluation Dataset

CNN / DailyMail

Data description:

The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The original version was created for machine reading and comprehension and abstractive question answering. The current version supports both extractive and abstractive summarization.

Dataset structure:

Amount of source data:

Train (287,113), validation (13,368), test (11,490)

Data detail:

KEYSEXPLAIN
ida string of the url where the article was retrieved from
articlea string containing the body of the news article
highlightsa string containing the highlight of the article as written by the article author

Sample of source dataset:

{'id': '0054d6d30dbcad772e20b22771153a2a9cbeaf62',
 'article': '(CNN) -- An American woman died aboard a cruise ship that docked at Rio de Janeiro on Tuesday, the same ship on which 86 passengers previously fell ill, according to the state-run Brazilian news agency, Agencia Brasil. The American tourist died aboard the MS Veendam, owned by cruise operator Holland America. Federal Police told Agencia Brasil that forensic doctors were investigating her death. The ship's doctors told police that the woman was elderly and suffered from diabetes and hypertension, according the agency. The other passengers came down with diarrhea prior to her death during an earlier part of the trip, the ship's doctors said. The Veendam left New York 36 days ago for a South America tour.'
 'highlights': 'The elderly woman suffered from diabetes and hypertension, ship's doctors say .\nPreviously, 86 passengers had fallen ill on the ship, Agencia Brasil says .'
}

Citation information:

@inproceedings{see-etal-2017-get,
               title = "Get To The Point: Summarization with Pointer-Generator Networks",
               author = "See, Abigail  and Liu, Peter J.  and Manning, Christopher D.",
               booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
               month = jul,
               year = "2017",
               address = "Vancouver, Canada",
               publisher = "Association for Computational Linguistics",
               url = "https://www.aclweb.org/anthology/P17-1099",
               doi = "10.18653/v1/P17-1099",
               pages = "1073--1083",
               abstract = "Neural sequence-to-sequence models have provided a viable new approach for abstractive text summarization (meaning they are not restricted to simply selecting and rearranging passages from the original text). However, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat themselves. In this work we propose a novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways. First, we use a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator. Second, we use coverage to keep track of what has been summarized, which discourages repetition. We apply our model to the CNN / Daily Mail summarization task, outperforming the current abstractive state-of-the-art by at least 2 ROUGE points.",
}
 
@inproceedings{DBLP:conf/nips/HermannKGEKSB15,
               author={Karl Moritz Hermann and Tomás Kociský and Edward Grefenstette and Lasse Espeholt and Will Kay and Mustafa Suleyman and Phil Blunsom},
               title={Teaching Machines to Read and Comprehend},
               year={2015},
               cdate={1420070400000},
               pages={1693-1701},
               url={http://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend},
               booktitle={NIPS},
               crossref={conf/nips/2015}
}

Licensing information:

apache-2.0

GSM8K

Data description:

GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

  • These problems take between 2 and 8 steps to solve.
  • Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer.
  • Solutions are provided in natural language, as opposed to pure math expressions.

Dataset structure:

Amount of source data:

DataTrainValidation
main74731319
socratic74731319

Data detail:

KEYSEXPLAIN
questionThe question string to a grade school math problem
answerThe full solution string to the question. It contains multiple steps of reasoning with calculator annotations and the final numeric solution

Sample of source dataset:

main

Each instance contains a string for the grade-school level math question and a string for the corresponding answer with multiple steps of reasoning and calculator annotations.

{
    'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
    'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72',
}

socratic

Each instance contains a string for a grade-school level math question, a string for the corresponding answer with multiple steps of reasoning, calculator annotations (explained here), and Socratic sub-questions.

{
    'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
    'answer': 'How many clips did Natalia sell in May? ** Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nHow many clips did Natalia sell altogether in April and May? ** Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72',
}

Citation information:

@article{cobbe2021gsm8k,
         title={Training Verifiers to Solve Math Word Problems},
         author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John},
         journal={arXiv preprint arXiv:2110.14168},
         year={2021}
}

Licensing information:

MIT License

LAMBADA

Data description:

The LAMBADA dataset evaluates the capabilities of computational models for text understanding by means of a word prediction task. It is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse.

Dataset structure:

Amount of source data:

Train (2662 novels), dev (4869 passages), test (5153 passages)

Data detail:

KEYSEXPLAIN
categorythe sub-category of books from which the book was extracted from (Only available for the training split)
textthe text (concatenation of context, target sentence and target word). The word to be guessed is the last one.

Sample of source dataset:

{"category": "Mystery",
 "text": "bob could have been called in at this point , but he was n't miffed at his exclusion at all . he was relieved at not being brought into this initial discussion with central command . `` let 's go make some grub , '' said bob as he turned to danny . danny did n't keep his stoic expression , but with a look of irritation got up and left the room with bob",
}

Citation information:

@InProceedings{paperno-EtAl:2016:P16-1,
               author    = {Paperno, Denis  and  Kruszewski, Germ\'{a}n  and  Lazaridou, Angeliki  and  Pham, Ngoc Quan  and  Bernardi, Raffaella  and  Pezzelle, Sandro  and  Baroni, Marco  and  Boleda, Gemma  and  Fernandez, Raquel},
               title     = {The {LAMBADA} dataset: Word prediction requiring a broad discourse context},
               booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
               month     = {August},
               year      = {2016},
               address   = {Berlin, Germany},
               publisher = {Association for Computational Linguistics},
               pages     = {1525--1534},
               url       = {http://www.aclweb.org/anthology/P16-1144}
}

Licensing information:

cc-by-4.0

MATH

Data description:

The Mathematics Aptitude Test of Heuristics (MATH) dataset consists of problems from mathematics competitions, including the AMC 10, AMC 12, AIME, and more. Each problem in MATH has a full step-by-step solution.

Dataset structure:

Amount of source data:

Train (7500), test (5000)

Data detail:

KEYSEXPLAIN
problemThe math problem in competition
solutionThe step-by-step solution
levelThe problem's difficulty level from 'Level 1' to 'Level 5', where a subject's easiest problems for humans are assigned to 'Level 1' and a subject's hardest problems are assigned to 'Level 5'
typeThe subject of the problem: Algebra, Counting & Probability, Geometry, Intermediate Algebra, Number Theory, Prealgebra and Precalculus

Sample of source dataset:

{'problem': 'A board game spinner is divided into three parts labeled $A$, $B$  and $C$. The probability of the spinner landing on $A$ is $\\frac{1}{3}$ and the probability of the spinner landing on $B$ is $\\frac{5}{12}$.  What is the probability of the spinner landing on $C$? Express your answer as a common fraction.',
 'level': 'Level 1',
 'type': 'Counting & Probability',
 'solution': 'The spinner is guaranteed to land on exactly one of the three regions, so we know that the sum of the probabilities of it landing in each region will be 1. If we let the probability of it landing in region $C$ be $x$, we then have the equation $1 = \\frac{5}{12}+\\frac{1}{3}+x$, from which we have $x=\\boxed{\\frac{1}{4}}$.'
}

Citation information:

@article{hendrycksmath2021,
         title={Measuring Mathematical Problem Solving With the MATH Dataset},
         author={Dan Hendrycks
         and Collin Burns
         and Saurav Kadavath
         and Akul Arora
         and Steven Basart
         and Eric Tang
         and Dawn Song
         and Jacob Steinhardt},
         journal={arXiv preprint arXiv:2103.03874},
         year={2021}
}

Licensing information:

MIT License

MS MARCO

Data description:

The initial release was a question answering dataset featuring 100,000 real Bing questions and a human generated answer. Later it included a 1,000,000 question dataset, a natural langauge generation dataset, a passage ranking dataset, keyphrase extraction dataset, crawling dataset, and a conversational search. The data comes in three tasks/forms: Original QnA dataset(v1.1), Question Answering(v2.1), Natural Language Generation(v2.1).

Dataset structure:

Amount of source data:

DataTrainValidationTest
v1.182326100479650
v2.1808731101093101092

Data detail:

KEYSEXPLAIN
answersa list of string features
passagesa dictionary feature containing is_selected (integer),passage_text (a string feature),url (a string feature)
querya string feature
query_idinteger
query_typea string feature
wellFormedAnswersa list of string features

Sample of source dataset:

{
    "answers": ["Capillaries."],
    "passages": {
        "is_selected": [ 1, 0, 0, 0, 0, 0 ],
        "passage_text": [ "Gas exchange is the delivery of oxygen from the lungs to the bloodstream, and the elimination of carbon dioxide from the bloodstream to the lungs. It occurs in the lungs between the alveoli and a network of tiny blood vessels called capillaries, which are located in the walls of the alveoli. The walls of the alveoli actually share a membrane with the capillaries in which oxygen and carbon dioxide move freely between the respiratory system and the bloodstream.", "Arterioles branch into capillaries, the smallest of all blood vessels. Capillaries are the sites of nutrient and waste exchange between the blood and body cells. Capillaries are microscopic vessels that join the arterial system with the venous system.", "Arterioles are the smallest arteries and regulate blood flow into capillary beds through vasoconstriction and vasodilation. Capillaries are the smallest vessels and allow for exchange of substances between the blood and interstitial fluid. Continuous capillaries are most common and allow passage of fluids and small solutes. Fenestrated capillaries are more permeable to fluids and solutes than continuous capillaries.", "Tweet. The smallest blood vessels in the human body are capillaries. They are responsible for the absorption of oxygen into the blood stream and for removing the deoxygenated red blood cells for return to the heart and lungs for reoxygenation.", "2. Capillaries-these are the sites of gas exchange between the tissues. 3. Veins-these return oxygen poor blood to the heart, except for the vein that carries blood from the lungs. On the right is a diagram showing how the three connect. Notice the artery and vein are much larger than the capillaries.", "Gas exchange occurs in the capillaries which are the smallest blood vessels in the body. Each artery that comes from the heart is surrounded by capillaries so that they can ta … ke it to the various parts of the body." ], 
"url": [ "https://www.nlm.nih.gov/medlineplus/ency/anatomyvideos/000059.htm", "http://www.biosbcc.net/doohan/sample/htm/vessels.htm", "http://classes.midlandstech.edu/carterp/Courses/bio211/chap19/chap19.html", "http://www.wereyouwondering.com/what-is-the-smallest-blood-vessel-in-the-human-body/", "http://peer.tamu.edu/curriculum_modules/OrganSystems/module_4/whatweknow_circulation.htm", "http://www.answers.com/Q/What_are_the_smallest_blood_vessels_where_exchange_occurs_called" ]
    },
    "query": "The smallest blood vessels in your body,where gas exchange occurs are called",
    "query_id": 19,726,
    "query_type": "description",
    "wellFormedAnswers": []
}

Citation information:

@article{DBLP:journals/corr/NguyenRSGTMD16,
         author    = {Tri Nguyen and
                      Mir Rosenberg and
                      Xia Song and
                      Jianfeng Gao and
                      Saurabh Tiwary and
                      Rangan Majumder and
                      Li Deng},
         title     = {{MS} {MARCO:} {A} Human Generated MAchine Reading COmprehension Dataset},
         journal   = {CoRR},
         volume    = {abs/1611.09268},
         year      = {2016},
         url       = {http://arxiv.org/abs/1611.09268},
         archivePrefix = {arXiv},
         eprint    = {1611.09268},
         timestamp = {Mon, 13 Aug 2018 16:49:03 +0200},
         biburl    = {https://dblp.org/rec/journals/corr/NguyenRSGTMD16.bib},
         bibsource = {dblp computer science bibliography, https://dblp.org}
}

Licensing information:

None

Natural_QA

Data description:

The Natural Questions corpus is a question answering dataset. Each example is comprised of a google.com query and a corresponding Wikipedia page. Each Wikipedia page has a passage (or long answer) annotated on the page that answers the question and one or more short spans from the annotated passage containing the actual answer. The long and the short answer annotations can however be empty. If they are both empty, then there is no answer on the page at all. If the long answer annotation is non-empty, but the short answer annotation is empty, then the annotated passage answers the question but no explicit short answer could be found.

Dataset structure:

Amount of source data:

Train (307373), development (7830), test (7842)

Data detail:

KEYSEXPLAIN
ida string feature
documenta dictionary feature containing: title (a string feature),url (a string feature),html (a string feature),tokens (a dictionary feature including: token (a string feature),is_html (a bool feature),start_byte (an integer feature),end_byte (an integer feature))
questiona dictionary feature containing: text (a string feature),tokens (a list of string features)
long_answer_candidatesa dictionary feature containing: start_token (an integer feature),end_token (an integer feature),start_byte (an integer feature),end_byte (an integer feature),top_level (a bool feature)
annotationsa dictionary feature containing:
id (a string feature),
long_answers (a dictionary feature containing: start_token (an integer feature),end_token (an integer feature),start_byte (an integer feature),end_byte (an integer feature),candidate_index (an integer feature)),
short_answers (a dictionary feature containing: start_token (an integer feature),end_token (an integer feature),start_byte (an integer feature),end_byte (an integer feature),text (a string feature)),
yes_no_answer (a classification label, with possible values including NO (0), YES (1))

Sample of source dataset:

{
  "id": "797803103760793766",
  "document": {
    "title": "Google",
    "url": "http://www.wikipedia.org/Google",
    "html": "<html><body><h1>Google Inc.</h1><p>Google was founded in 1998 By:<ul><li>Larry</li><li>Sergey</li></ul></p></body></html>",
    "tokens":[
      {"token": "<h1>", "start_byte": 12, "end_byte": 16, "is_html": True},
      {"token": "Google", "start_byte": 16, "end_byte": 22, "is_html": False},
      {"token": "inc", "start_byte": 23, "end_byte": 26, "is_html": False},
      {"token": ".", "start_byte": 26, "end_byte": 27, "is_html": False},
      {"token": "</h1>", "start_byte": 27, "end_byte": 32, "is_html": True},
      {"token": "<p>", "start_byte": 32, "end_byte": 35, "is_html": True},
      {"token": "Google", "start_byte": 35, "end_byte": 41, "is_html": False},
      {"token": "was", "start_byte": 42, "end_byte": 45, "is_html": False},
      {"token": "founded", "start_byte": 46, "end_byte": 53, "is_html": False},
      {"token": "in", "start_byte": 54, "end_byte": 56, "is_html": False},
      {"token": "1998", "start_byte": 57, "end_byte": 61, "is_html": False},
      {"token": "by", "start_byte": 62, "end_byte": 64, "is_html": False},
      {"token": ":", "start_byte": 64, "end_byte": 65, "is_html": False},
      {"token": "<ul>", "start_byte": 65, "end_byte": 69, "is_html": True},
      {"token": "<li>", "start_byte": 69, "end_byte": 73, "is_html": True},
      {"token": "Larry", "start_byte": 73, "end_byte": 78, "is_html": False},
      {"token": "</li>", "start_byte": 78, "end_byte": 83, "is_html": True},
      {"token": "<li>", "start_byte": 83, "end_byte": 87, "is_html": True},
      {"token": "Sergey", "start_byte": 87, "end_byte": 92, "is_html": False},
      {"token": "</li>", "start_byte": 92, "end_byte": 97, "is_html": True},
      {"token": "</ul>", "start_byte": 97, "end_byte": 102, "is_html": True},
      {"token": "</p>", "start_byte": 102, "end_byte": 106, "is_html": True}
    ],
  },
  "question" :{
    "text": "who founded google",
    "tokens": ["who", "founded", "google"]
  },
  "long_answer_candidates": [
    {"start_byte": 32, "end_byte": 106, "start_token": 5, "end_token": 22, "top_level": True},
    {"start_byte": 65, "end_byte": 102, "start_token": 13, "end_token": 21, "top_level": False},
    {"start_byte": 69, "end_byte": 83, "start_token": 14, "end_token": 17, "top_level": False},
    {"start_byte": 83, "end_byte": 92, "start_token": 17, "end_token": 20 , "top_level": False}
  ],
  "annotations": [{
    "id": "6782080525527814293",
    "long_answer": {"start_byte": 32, "end_byte": 106, "start_token": 5, "end_token": 22, "candidate_index": 0},
    "short_answers": [
      {"start_byte": 73, "end_byte": 78, "start_token": 15, "end_token": 16, "text": "Larry"},
      {"start_byte": 87, "end_byte": 92, "start_token": 18, "end_token": 19, "text": "Sergey"}
    ],
    "yes_no_answer": -1
  }]
}

Citation information:

@article{47761,
title        = {Natural Questions: a Benchmark for Question Answering Research},
author        = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov},
year        = {2019},
journal        = {Transactions of the Association of Computational Linguistics}
}

Licensing information:

cc-by-sa-3.0

TriviaQA

Data description:

TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, 6 per question on average, that provide high quality distant supervision for answering the questions.

Dataset structure:

Amount of source data:

DataTrainValidationTest
rc1383841866917210
rc.nocontext1383841866917210
unfiltered876221131310832
unfiltered.nocontext876221131310832

Data detail:

KEYSEXPLAIN
questiona string feature
question_ida string feature
question_sourcea string feature
entity_pagesa dictionary feature containing: doc_source (a string feature),filename (a string feature),title (a string feature),wiki_context (a string feature)
search_resultsa dictionary feature containing: description (a string feature),filename (a string feature),rank (an integer feature),title (a string feature),url (a string feature),search_context (a string feature)
answera dictionary feature containing: aliases (a list of string features),normalized_aliases (a list of string features),matched_wiki_entity_name (a string feature),normalized_matched_wiki_entity_name (a string feature),normalized_value (a string feature),type (a string feature),value (a string feature)

Sample of source dataset:

{
    "question": "Where in England was Dame Judi Dench born?",
    "question_id": "tc_3",
    "question_source": "http://www.triviacountry.com/",
    "entity_pages":  
         {"doc_source": ["TagMe", "TagMe"], 
          "filename": ["England.txt", "Judi_Dench.txt"], 
          "title": ["England", "Judi(...TRUNCATED)},
    "search_results": 
         {"description": ["Judi Dench, Actress: Skyfall. Judi Dench was born in York, ... Judi Dench was born (...TRUNCATED)},
    "answer": 
         {"aliases": ["Park Grove (1895)", "York UA", "Yorkish", "UN/LOCODE:GBYRK", "York, UK", "Eoforwic", "Park Gr(...TRUNCATED)}
}

Citation information:

@article{2017arXivtriviaqa,
       author = {{Joshi}, Mandar and {Choi}, Eunsol and {Weld},
                 Daniel and {Zettlemoyer}, Luke},
        title = "{triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension}",
      journal = {arXiv e-prints},
         year = 2017,
          eid = {arXiv:1705.03551},
        pages = {arXiv:1705.03551},
archivePrefix = {arXiv},
       eprint = {1705.03551},
}

Licensing information:

None

XSum

Data description:

XSum is a dataset for evaluation of abstractive single-document summarization systems. The goal is to create a short, one-sentence new summary answering the question “What is the article about?”. The dataset consists of 226,711 news articles accompanied with a one-sentence summary. The articles are collected from BBC articles (2010 to 2017) and cover a wide variety of domains (e.g., News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts).

Dataset structure:

Amount of source data:

Train (204045), validation (11332), test (11334)

Data detail:

KEYSEXPLAIN
documenta string feature
summarya string feature
ida string feature

Sample of source dataset:

{
    "document": "Authorities said the incident took place on Sao Joao beach in Caparica, south-west of Lisbon. The National Maritime Authority said a middle-aged man and a yound girl died after they were unable to avoid the plane. [6 sentences with 139 words are abbreviated from here.] Other reports said the victims had been sunbathing when the plane made its emergency landing. [Another 4 sentences with 67 words are abbreciated from here.] Video footage from the scene carried by local broadcasters showed a small recreational plane parked on the sand, apprarently intact and surrounded by beachgoers and emergency workers. [Last 2 sentences with 19 words are abbreviated.]",
    "id": "29750031",
    "summary": "A man and a child have been killed after a light aircraft made an emergency landing on a beach in Portugal."
}

Citation information:

@article{Narayan2018DontGM,
  title={Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization},
  author={Shashi Narayan and Shay B. Cohen and Mirella Lapata},
  journal={ArXiv},
  year={2018},
  volume={abs/1808.08745}
}

Licensing information:

None