评测数据
以下数据集均转化为标准评测Prompt再进行评测
MATH-hard
数据描述:
Mathematics Aptitude Test of Heuristics (MATH)数据集包含来自AMC 10、AMC 12、AIME等数学竞赛的问题。MATH中的每个问题都有详细的解题步骤。 本次使用lm-eval-harness官方集成的其中的hard子集,仅包含原始数据中level 5的问题。
数据集构成和规范:
源数据量:
原始数据中level 5的问题共1,324条。
数据字段:
KEYS | EXPLAIN |
---|---|
problem | 数学竞赛问题 |
solution | 详细解题答案 |
level | 问题的难度级别从“Level 1”到“Level 5”。科目中最容易的问题被分配到“Level 1”,最难的问题被分配到“Level 5” |
type | 问题所属科目:代数、计数与概率、几何、中级代数、数论、初等代数和初等微积分 |
源数据集样例:
{
"problem": "John draws a regular five pointed star in the sand, and at each of the 5 outward-pointing points and 5 inward-pointing points he places one of ten different sea shells. How many ways can he place the shells, if reflections and rotations of an arrangement are considered equivalent?",
"level": "Level 5",
"type": "Counting & Probability"
}
论文引用:
MATH: https://arxiv.org/abs/2103.03874
@inproceedings{hendrycks2021MATH,
author = {Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob},
booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
title = {Measuring Mathematical Problem Solving With the MATH Dataset},
volume = {1},
year = {2021}
}
数据集版权使用说明:
MIT License
GPQA
数据描述:
Google-Proof Q&A (GPQA)是由生物、物理、化学领域专家命制的专业问题集,非专家通过无限制网络搜索仅能做到34%准确度,因此得名。 本数据总计546条选择题,包含448条主数据集(main set)以及198条最具挑战的“钻石”子集(diamond set)。
源数据集样例:
{
'Question': 'Two quantum states with energies E1 and E2 have a lifetime of 10^-9 sec and 10^-8 sec, respectively. We want to clearly distinguish these two energy levels. Which one of the following options could be their energy difference so that they can be clearly resolved?',
'Subdomain': 'Physics (general)',
'Correct Answer': '10^-4 eV',
'Incorrect Answer 1': '10^-11 eV',
'Incorrect Answer 2': '10^-8 eV',
'Incorrect Answer 3': '10^-9 eV',
'Explanation': 'According to the uncertainty principle, Delta E* Delta t=hbar/2. Delta t is the lifetime and Delta E is the width of the energy level. With Delta t=10^-9 s==> Delta E1= 3.3 10^-7 ev. And Delta t=10^-11 s gives Delta E2=3.310^-8 eV. Therefore, the energy difference between the two states must be significantly greater than 10^-7 ev. So the answer is 10^-4 ev.'
}
论文引用:
GPQA: https://openreview.net/forum?id=Ti67584b98
@inproceedings{
rein2024gpqa,
title={{GPQA}: A Graduate-Level Google-Proof Q\&A Benchmark},
author={David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman},
booktitle={First Conference on Language Modeling},
year={2024}
}
TheoremQA
数据描述:
TheoremQA包含800条高质量专业问答数据,围绕理工学科下共计350条知名定理(如泰勒定理、拉格朗日定理、量子定理等)命制。
数据集构成和规范:
评测数据量:
测试集包含800条数据供评测使用。
数据字段:
KEYS | EXPLAIN |
---|---|
Question | 问题 |
Answer | 答案 |
Answer_type | 答案类型 |
Picture | 图像(如果有) |
源数据集样例:
{'Question': 'How many ways are there to divide a set of 8 elements into 5 non-empty ordered subsets?',
'Answer': '11760',
'Answer_type': 'integer',
'Picture': ''}
论文引用:
TheoremQA: https://arxiv.org/abs/2305.12524
@inproceedings{chen2023theoremqa,
title = "{T}heorem{QA}: A Theorem-driven Question Answering Dataset",
author = "Chen, Wenhu and
Yin, Ming and
Ku, Max and
Lu, Pan and
Wan, Yixin and
Ma, Xueguang and
Xu, Jianyu and
Wang, Xinyi and
Xia, Tony",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
year = "2023",
pages = "7889--7901"
}