评测数据
CNN / DailyMail
数据描述:
CNN / DailyMail 数据集是一个由CNN和《每日邮报》记者撰写的包含超过30万篇不同新闻报道的英语数据集。最初版本的数据集是为机器阅读理解以及抽象问题回答而创建的。现今版本的数据集支持抽取式和抽象式摘要总结。
数据集构成和规范:
源数据量:
训练集(287113),验证集(13368),测试集(11490)
数据字段:
KEYS | EXPLAIN |
---|---|
id | 从新闻报道网址中检索到的字符串 |
article | 包含新闻报道正文的字符串 |
highlights | 包含新闻报道作者所写的报道亮点的字符串 |
源数据集样例:
{'id': '0054d6d30dbcad772e20b22771153a2a9cbeaf62',
'article': '(CNN) -- An American woman died aboard a cruise ship that docked at Rio de Janeiro on Tuesday, the same ship on which 86 passengers previously fell ill, according to the state-run Brazilian news agency, Agencia Brasil. The American tourist died aboard the MS Veendam, owned by cruise operator Holland America. Federal Police told Agencia Brasil that forensic doctors were investigating her death. The ship's doctors told police that the woman was elderly and suffered from diabetes and hypertension, according the agency. The other passengers came down with diarrhea prior to her death during an earlier part of the trip, the ship's doctors said. The Veendam left New York 36 days ago for a South America tour.'
'highlights': 'The elderly woman suffered from diabetes and hypertension, ship's doctors say .\nPreviously, 86 passengers had fallen ill on the ship, Agencia Brasil says .'}
论文引用:
@inproceedings{see-etal-2017-get,
title = "Get To The Point: Summarization with Pointer-Generator Networks",
author = "See, Abigail and Liu, Peter J. and Manning, Christopher D.",
booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2017",
address = "Vancouver, Canada",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/P17-1099",
doi = "10.18653/v1/P17-1099",
pages = "1073--1083",
abstract = "Neural sequence-to-sequence models have provided a viable new approach for abstractive text summarization (meaning they are not restricted to simply selecting and rearranging passages from the original text). However, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat themselves. In this work we propose a novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways. First, we use a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator. Second, we use coverage to keep track of what has been summarized, which discourages repetition. We apply our model to the CNN / Daily Mail summarization task, outperforming the current abstractive state-of-the-art by at least 2 ROUGE points.",
}
@inproceedings{DBLP:conf/nips/HermannKGEKSB15,
author={Karl Moritz Hermann and Tomás Kociský and Edward Grefenstette and Lasse Espeholt and Will Kay and Mustafa Suleyman and Phil Blunsom},
title={Teaching Machines to Read and Comprehend},
year={2015},
cdate={1420070400000},
pages={1693-1701},
url={http://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend},
booktitle={NIPS},
crossref={conf/nips/2015}
}
数据集版权使用说明:
apache-2.0
GSM8K
数据描述:
GSM8K(小学数学8K)是一个包含约8500个高质量、语言多样化的小学数学文字问题的数据集。该数据集旨在支持需要多步推理的基础数学问题的问答任务。
- 这些问题需要2到8个步骤解决。
- 解决方案主要涉及使用四则运算(+ − × ÷)来执行一系列基本计算以达到最终答案。
- 解决方案以语言形式提供,非数学表达式。
数据集构成和规范:
源数据量:
数据 | 训练集 | 验证集 |
---|---|---|
main | 7473 | 1319 |
socratic | 7473 | 1319 |
数据字段:
KEYS | EXPLAIN |
---|---|
question | 小学数学问题字符串 |
answer | 问题的完整解决方案字符串。解决方案包含多个推理步骤,包括计算器注释和最终的数值解 |
源数据集样例:
main
Each instance contains a string for the grade-school level math question and a string for the corresponding answer with multiple steps of reasoning and calculator annotations.
{
'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72',
}
socratic
Each instance contains a string for a grade-school level math question, a string for the corresponding answer with multiple steps of reasoning, calculator annotations (explained here), and Socratic sub-questions.
{
'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
'answer': 'How many clips did Natalia sell in May? ** Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nHow many clips did Natalia sell altogether in April and May? ** Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72',
}
论文引用:
@article{cobbe2021gsm8k,
title={Training Verifiers to Solve Math Word Problems},
author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John},
journal={arXiv preprint arXiv:2110.14168},
year={2021}
}
数据集版权使用说明:
MIT License
LAMBADA
数据描述:
LAMBADA数据集通过单词预测任务评估计算模型在文本理解方面的能力。该数据集包含一系列叙述性段落。这些段落的共同特点是如果人类受试者能够阅读整个段落,那么他们能够猜出该段落的最后一个词;但如果他们只看到目标词前的最后一句话,则无法猜出。若要在LAMBADA数据集上获得成功,计算模型不能仅仅依赖部分上下文,而是必须能够收集更广泛的信息。
数据集构成和规范:
源数据量:
训练集(2662本小说),开发集(4869篇文章),测试集(5153篇文章)
数据字段:
KEYS | EXPLAIN |
---|---|
category | 这本书所属的子类别(仅适用于训练集) |
text | 文本(包括文章、目标句子和目标词)。需要猜的词是最后一个词。 |
源数据集样例:
{"category": "Mystery",
"text": "bob could have been called in at this point , but he was n't miffed at his exclusion at all . he was relieved at not being brought into this initial discussion with central command . `` let 's go make some grub , '' said bob as he turned to danny . danny did n't keep his stoic expression , but with a look of irritation got up and left the room with bob",
}
论文引用:
@InProceedings{paperno-EtAl:2016:P16-1,
author = {Paperno, Denis and Kruszewski, Germ\'{a}n and Lazaridou, Angeliki and Pham, Ngoc Quan and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fernandez, Raquel},
title = {The {LAMBADA} dataset: Word prediction requiring a broad discourse context},
booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {August},
year = {2016},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {1525--1534},
url = {http://www.aclweb.org/anthology/P16-1144}
}
数据集版权使用说明:
cc-by-4.0
MATH
数据描述:
Mathematics Aptitude Test of Heuristics (MATH)数据集包含来自AMC 10、AMC 12、AIME等数学竞赛的问题。MATH中的每个问题都有详细的解题步骤。
数据集构成和规范:
源数据量:
训练集(7500),测试集(5000)
数据字段:
KEYS | EXPLAIN |
---|---|
problem | 数学竞赛问题 |
solution | 详细解题答案 |
level | 问题的难度级别从“Level 1”到“Level 5”。科目中最容易的问题被分配到“Level 1”,最难的问题被分配到“Level 5” |
type | 问题所属科目:代数、计数与概率、几何、中级代数、数论、初等代数和初等微积分 |
源数据集样例:
{'problem': 'A board game spinner is divided into three parts labeled $A$, $B$ and $C$. The probability of the spinner landing on $A$ is $\\frac{1}{3}$ and the probability of the spinner landing on $B$ is $\\frac{5}{12}$. What is the probability of the spinner landing on $C$? Express your answer as a common fraction.',
'level': 'Level 1',
'type': 'Counting & Probability',
'solution': 'The spinner is guaranteed to land on exactly one of the three regions, so we know that the sum of the probabilities of it landing in each region will be 1. If we let the probability of it landing in region $C$ be $x$, we then have the equation $1 = \\frac{5}{12}+\\frac{1}{3}+x$, from which we have $x=\\boxed{\\frac{1}{4}}$.'}
论文引用:
@article{hendrycksmath2021,
title={Measuring Mathematical Problem Solving With the MATH Dataset},
author={Dan Hendrycks
and Collin Burns
and Saurav Kadavath
and Akul Arora
and Steven Basart
and Eric Tang
and Dawn Song
and Jacob Steinhardt},
journal={arXiv preprint arXiv:2103.03874},
year={2021}
}
数据集版权使用说明:
MIT License
MS MARCO
数据描述:
初始发布的数据集包含了10万个真实的必应问题和人类生成的答案。后续增加了100万个问题数据集、自然语言生成数据集、段落排名数据集、关键词提取数据集、爬取数据集和对话式搜索数据集。这些数据分为三个任务/形式:原始问答数据集(v1.1)、问答(v2.1)和自然语言生成(v2.1)。
数据集构成和规范:
源数据量:
数据 | 训练集 | 验证集 | 测试集 |
---|---|---|---|
v1.1 | 82326 | 10047 | 9650 |
v2.1 | 808731 | 101093 | 101092 |
数据字段:
KEYS | EXPLAIN |
---|---|
answers | 字符串列表 |
passages | 字典,包括is_selected(整数),passage_text(字符串),url(字符串) |
query | 字符串 |
query_id | 整数 |
query_type | 字符串 |
wellFormedAnswers | 字符串列表 |
源数据集样例:
{
"answers": ["Capillaries."],
"passages": {
"is_selected": [ 1, 0, 0, 0, 0, 0 ],
"passage_text": [ "Gas exchange is the delivery of oxygen from the lungs to the bloodstream, and the elimination of carbon dioxide from the bloodstream to the lungs. It occurs in the lungs between the alveoli and a network of tiny blood vessels called capillaries, which are located in the walls of the alveoli. The walls of the alveoli actually share a membrane with the capillaries in which oxygen and carbon dioxide move freely between the respiratory system and the bloodstream.", "Arterioles branch into capillaries, the smallest of all blood vessels. Capillaries are the sites of nutrient and waste exchange between the blood and body cells. Capillaries are microscopic vessels that join the arterial system with the venous system.", "Arterioles are the smallest arteries and regulate blood flow into capillary beds through vasoconstriction and vasodilation. Capillaries are the smallest vessels and allow for exchange of substances between the blood and interstitial fluid. Continuous capillaries are most common and allow passage of fluids and small solutes. Fenestrated capillaries are more permeable to fluids and solutes than continuous capillaries.", "Tweet. The smallest blood vessels in the human body are capillaries. They are responsible for the absorption of oxygen into the blood stream and for removing the deoxygenated red blood cells for return to the heart and lungs for reoxygenation.", "2. Capillaries-these are the sites of gas exchange between the tissues. 3. Veins-these return oxygen poor blood to the heart, except for the vein that carries blood from the lungs. On the right is a diagram showing how the three connect. Notice the artery and vein are much larger than the capillaries.", "Gas exchange occurs in the capillaries which are the smallest blood vessels in the body. Each artery that comes from the heart is surrounded by capillaries so that they can ta … ke it to the various parts of the body." ],
"url": [ "https://www.nlm.nih.gov/medlineplus/ency/anatomyvideos/000059.htm", "http://www.biosbcc.net/doohan/sample/htm/vessels.htm", "http://classes.midlandstech.edu/carterp/Courses/bio211/chap19/chap19.html", "http://www.wereyouwondering.com/what-is-the-smallest-blood-vessel-in-the-human-body/", "http://peer.tamu.edu/curriculum_modules/OrganSystems/module_4/whatweknow_circulation.htm", "http://www.answers.com/Q/What_are_the_smallest_blood_vessels_where_exchange_occurs_called" ]
},
"query": "The smallest blood vessels in your body,where gas exchange occurs are called",
"query_id": 19,726,
"query_type": "description",
"wellFormedAnswers": []
}
论文引用:
@article{DBLP:journals/corr/NguyenRSGTMD16,
author = {Tri Nguyen and
Mir Rosenberg and
Xia Song and
Jianfeng Gao and
Saurabh Tiwary and
Rangan Majumder and
Li Deng},
title = {{MS} {MARCO:} {A} Human Generated MAchine Reading COmprehension Dataset},
journal = {CoRR},
volume = {abs/1611.09268},
year = {2016},
url = {http://arxiv.org/abs/1611.09268},
archivePrefix = {arXiv},
eprint = {1611.09268},
timestamp = {Mon, 13 Aug 2018 16:49:03 +0200},
biburl = {https://dblp.org/rec/journals/corr/NguyenRSGTMD16.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
数据集版权使用说明:
无
Natural_QA
数据描述:
Natural Questions语料库是一个问答数据集。每个示例由一个google.com查询和一个相应的维基百科页面组成。每个维基百科页面上都有一个标有注释的回答问题的段落(或长答案),以及一个或多个包含实际答案的短片段。然而,长答案和短答案的注释可能为空。如果它们都为空,那么页面上没有答案。如果长答案注释不为空,但短答案注释为空,则注释的段落回答了问题,但没有明确的短答案。
数据集构成和规范:
源数据量:
训练集(307373),开发集(7830),测试集(7842)
数据字段:
KEYS | EXPLAIN |
---|---|
id | 字符串 |
document | 字典,包括title(字符串),url(字符串),html(字符串),tokens(字典,包括token(字符串),is_html(bool值),start_byte(整数),end_byte(整数)) |
question | 字典,包括text(字符串),tokens(字符串列表) |
long_answer_candidates | 字典,包括start_token(整数),end_token(整数),start_byte(整数),end_byte(整数),top_level(bool值) |
annotations | 字典,包括 id(字符串), long_answers(字典,包括start_token(整数),end_token(整数),start_byte(整数),end_byte(整数),candidate_index(整数)), short_answers(字典,包括start_token(整数),end_token(整数),start_byte(整数),end_byte(整数),text(字符串)), yes_no_answer(分类标签,可能的值包括 NO (0)、YES (1)) |
源数据集样例:
{
"id": "797803103760793766",
"document": {
"title": "Google",
"url": "http://www.wikipedia.org/Google",
"html": "<html><body><h1>Google Inc.</h1><p>Google was founded in 1998 By:<ul><li>Larry</li><li>Sergey</li></ul></p></body></html>",
"tokens":[
{"token": "<h1>", "start_byte": 12, "end_byte": 16, "is_html": True},
{"token": "Google", "start_byte": 16, "end_byte": 22, "is_html": False},
{"token": "inc", "start_byte": 23, "end_byte": 26, "is_html": False},
{"token": ".", "start_byte": 26, "end_byte": 27, "is_html": False},
{"token": "</h1>", "start_byte": 27, "end_byte": 32, "is_html": True},
{"token": "<p>", "start_byte": 32, "end_byte": 35, "is_html": True},
{"token": "Google", "start_byte": 35, "end_byte": 41, "is_html": False},
{"token": "was", "start_byte": 42, "end_byte": 45, "is_html": False},
{"token": "founded", "start_byte": 46, "end_byte": 53, "is_html": False},
{"token": "in", "start_byte": 54, "end_byte": 56, "is_html": False},
{"token": "1998", "start_byte": 57, "end_byte": 61, "is_html": False},
{"token": "by", "start_byte": 62, "end_byte": 64, "is_html": False},
{"token": ":", "start_byte": 64, "end_byte": 65, "is_html": False},
{"token": "<ul>", "start_byte": 65, "end_byte": 69, "is_html": True},
{"token": "<li>", "start_byte": 69, "end_byte": 73, "is_html": True},
{"token": "Larry", "start_byte": 73, "end_byte": 78, "is_html": False},
{"token": "</li>", "start_byte": 78, "end_byte": 83, "is_html": True},
{"token": "<li>", "start_byte": 83, "end_byte": 87, "is_html": True},
{"token": "Sergey", "start_byte": 87, "end_byte": 92, "is_html": False},
{"token": "</li>", "start_byte": 92, "end_byte": 97, "is_html": True},
{"token": "</ul>", "start_byte": 97, "end_byte": 102, "is_html": True},
{"token": "</p>", "start_byte": 102, "end_byte": 106, "is_html": True}
],
},
"question" :{
"text": "who founded google",
"tokens": ["who", "founded", "google"]
},
"long_answer_candidates": [
{"start_byte": 32, "end_byte": 106, "start_token": 5, "end_token": 22, "top_level": True},
{"start_byte": 65, "end_byte": 102, "start_token": 13, "end_token": 21, "top_level": False},
{"start_byte": 69, "end_byte": 83, "start_token": 14, "end_token": 17, "top_level": False},
{"start_byte": 83, "end_byte": 92, "start_token": 17, "end_token": 20 , "top_level": False}
],
"annotations": [{
"id": "6782080525527814293",
"long_answer": {"start_byte": 32, "end_byte": 106, "start_token": 5, "end_token": 22, "candidate_index": 0},
"short_answers": [
{"start_byte": 73, "end_byte": 78, "start_token": 15, "end_token": 16, "text": "Larry"},
{"start_byte": 87, "end_byte": 92, "start_token": 18, "end_token": 19, "text": "Sergey"}
],
"yes_no_answer": -1
}]
}
论文引用:
@article{47761,
title = {Natural Questions: a Benchmark for Question Answering Research},
author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov},
year = {2019},
journal = {Transactions of the Association of Computational Linguistics}
}
数据集版权使用说明:
cc-by-sa-3.0
TriviaQA
数据描述:
TriviaQA是一个包含超过65万个问题-答案-证据三元组的阅读理解数据集。TriviaQA包括由琐事爱好者撰写的9.5万个问题-答案对,以及平均每个问题有的6个独立收集的证据文件,为回答问题提供高质量的远程监督。
数据集构成和规范:
源数据量:
数据 | 训练集 | 验证集 | 测试集 |
---|---|---|---|
rc | 138384 | 18669 | 17210 |
rc.nocontext | 138384 | 18669 | 17210 |
unfiltered | 87622 | 11313 | 10832 |
unfiltered.nocontext | 87622 | 11313 | 10832 |
数据字段:
KEYS | EXPLAIN |
---|---|
question | 字符串 |
question_id | 字符串 |
question_source | 字符串 |
entity_pages | 字典,包括doc_source(字符串),filename(字符串),title(字符串),wiki_context(字符串) |
search_results | 字典,包括description(字符串),filename(字符串),rank(整数),title(字符串),url(字符串),search_context(字符串) |
answer | 字典,包括aliases(字符串列表),normalized_aliases(字符串列表),matched_wiki_entity_name(字符串),normalized_matched_wiki_entity_name(字符串),normalized_value(字符串),type(字符串),value(字符串) |
源数据集样例:
{
"question": "Where in England was Dame Judi Dench born?",
"question_id": "tc_3",
"question_source": "http://www.triviacountry.com/",
"entity_pages":
{"doc_source": ["TagMe", "TagMe"],
"filename": ["England.txt", "Judi_Dench.txt"],
"title": ["England", "Judi(...TRUNCATED)},
"search_results":
{"description": ["Judi Dench, Actress: Skyfall. Judi Dench was born in York, ... Judi Dench was born (...TRUNCATED)},
"answer":
{"aliases": ["Park Grove (1895)", "York UA", "Yorkish", "UN/LOCODE:GBYRK", "York, UK", "Eoforwic", "Park Gr(...TRUNCATED)}
}
论文引用:
@article{2017arXivtriviaqa,
author = {{Joshi}, Mandar and {Choi}, Eunsol and {Weld},
Daniel and {Zettlemoyer}, Luke},
title = "{triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension}",
journal = {arXiv e-prints},
year = 2017,
eid = {arXiv:1705.03551},
pages = {arXiv:1705.03551},
archivePrefix = {arXiv},
eprint = {1705.03551},
}
数据集版权使用说明:
无
XSum
数据描述:
XSum是一个用于评估抽象单文档摘要系统的数据集。它的目标是创建一个简短的、一句话的新摘要,回答“文章是关于什么的?”的问题。数据集包含了226,711篇新闻文章,每篇都附带一句摘要。这些文章来自BBC的文章(2010年至2017年),涵盖了广泛的领域(例如新闻、政治、体育、天气、商业、技术、科学、健康、家庭、教育、娱乐和艺术)。
数据集构成和规范:
源数据量:
训练集(204045),验证集(11332),测试集(11334)
数据字段:
KEYS | EXPLAIN |
---|---|
document | 字符串 |
summary | 字符串 |
id | 字符串 |
源数据集样例:
{
"document": "Authorities said the incident took place on Sao Joao beach in Caparica, south-west of Lisbon. The National Maritime Authority said a middle-aged man and a yound girl died after they were unable to avoid the plane. [6 sentences with 139 words are abbreviated from here.] Other reports said the victims had been sunbathing when the plane made its emergency landing. [Another 4 sentences with 67 words are abbreciated from here.] Video footage from the scene carried by local broadcasters showed a small recreational plane parked on the sand, apprarently intact and surrounded by beachgoers and emergency workers. [Last 2 sentences with 19 words are abbreviated.]",
"id": "29750031",
"summary": "A man and a child have been killed after a light aircraft made an emergency landing on a beach in Portugal."
}
论文引用:
@article{Narayan2018DontGM,
title={Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization},
author={Shashi Narayan and Shay B. Cohen and Mirella Lapata},
journal={ArXiv},
year={2018},
volume={abs/1808.08745}
}
数据集版权使用说明:
无