评测数据
以下数据集均转化为标准评测Prompt再进行评测
MATH-hard
数据描述:
Mathematics Aptitude Test of Heuristics (MATH)数据集包含来自AMC 10、AMC 12、AIME等数学竞赛的问题。MATH中的每个问题都有详细的解题步骤。 本次使用lm-eval-harness官方集成的其中的hard子集,仅包含原始数据中level 5的问题。
数据集构成和规范:
源数据量:
原始数据中level 5的问题共1,324条。
数据字段:
KEYS | EXPLAIN |
---|---|
problem | 数学竞赛问题 |
solution | 详细解题答案 |
level | 问题的难度级别从“Level 1”到“Level 5”。科目中最容易的问题被分配到“Level 1”,最难的问题被分配到“Level 5” |
type | 问题所属科目:代数、计数与概率、几何、中级代数、数论、初等代数和初等微积分 |
源数据集样例:
{
"problem": "John draws a regular five pointed star in the sand, and at each of the 5 outward-pointing points and 5 inward-pointing points he places one of ten different sea shells. How many ways can he place the shells, if reflections and rotations of an arrangement are considered equivalent?",
"level": "Level 5",
"type": "Counting & Probability"
}
论文引用:
MATH: https://arxiv.org/abs/2103.03874
@inproceedings{hendrycks2021MATH,
author = {Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob},
booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
title = {Measuring Mathematical Problem Solving With the MATH Dataset},
volume = {1},
year = {2021}
}
数据集版权使用说明:
MIT License
GPQA
数据描述:
Google-Proof Q&A (GPQA)是由生物、物理、化学领域专家命制的专业问题集,非专家通过无限制网络搜索仅能做到34%准确度,因此得名。 本数据总计546条选择题,包含448条主数据集(main set)以及198条最具挑战的“钻石”子集(diamond set)。
源数据集样例:
{
'Question': 'Two quantum states with energies E1 and E2 have a lifetime of 10^-9 sec and 10^-8 sec, respectively. We want to clearly distinguish these two energy levels. Which one of the following options could be their energy difference so that they can be clearly resolved?',
'Subdomain': 'Physics (general)',
'Correct Answer': '10^-4 eV',
'Incorrect Answer 1': '10^-11 eV',
'Incorrect Answer 2': '10^-8 eV',
'Incorrect Answer 3': '10^-9 eV',
'Explanation': 'According to the uncertainty principle, Delta E* Delta t=hbar/2. Delta t is the lifetime and Delta E is the width of the energy level. With Delta t=10^-9 s==> Delta E1= 3.3 10^-7 ev. And Delta t=10^-11 s gives Delta E2=3.310^-8 eV. Therefore, the energy difference between the two states must be significantly greater than 10^-7 ev. So the answer is 10^-4 ev.'
}
论文引用:
GPQA: https://openreview.net/forum?id=Ti67584b98
@inproceedings{
rein2024gpqa,
title={{GPQA}: A Graduate-Level Google-Proof Q\&A Benchmark},
author={David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman},
booktitle={First Conference on Language Modeling},
year={2024}
}
TheoremQA
数据描述:
TheoremQA包含800条高质量专业问答数据,围绕理工学科下共计350条知名定理(如泰勒定理、拉格朗日定理、量子定理等)命制。
数据集构成和规范:
评测数据量:
测试集包含800条数据供评测使用。
数据字段:
KEYS | EXPLAIN |
---|---|
Question | 问题 |
Answer | 答案 |
Answer_type | 答案类型 |
Picture | 图像(如果有) |
源数据集样例:
{'Question': 'How many ways are there to divide a set of 8 elements into 5 non-empty ordered subsets?',
'Answer': '11760',
'Answer_type': 'integer',
'Picture': ''}
论文引用:
TheoremQA: https://arxiv.org/abs/2305.12524
@inproceedings{chen2023theoremqa,
title = "{T}heorem{QA}: A Theorem-driven Question Answering Dataset",
author = "Chen, Wenhu and
Yin, Ming and
Ku, Max and
Lu, Pan and
Wan, Yixin and
Ma, Xueguang and
Xu, Jianyu and
Wang, Xinyi and
Xia, Tony",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
year = "2023",
pages = "7889--7901"
}
GSM
GSM-8K:https://github.com/openai/grade-school-math
数据描述:
GSM8K数据集是由OpenAI推出的,旨在评估和提升大型语言模型在解决数学文字问题方面的能力。该数据集包含8.5K 高质量的小学数学题,这些题目由人工题目编写者精心创作。我们将其划分为 7.5K 道训练题和 1K 道测试题。这些题目通常需要 2 到 8 个步骤来解决,解题过程主要通过一系列基础算术运算(加法 +,减法 -,除法 /,乘法 *)来逐步计算出最终答案。一名聪明的初中生应该能够解答所有题目。
- 原始数据文件位置
grade_school_math/data/train.jsonl
grade_school_math/data/test.jsonl
这些文件中的每一行对应一道小学数学题目,保存为一个 JSON 字典,包含 "question"(题目)和 "answer"(答案)两个键。答案的格式包含计算过程注释,最终的数值答案位于解答的最后一行,并以 #### 作为前缀。
源数据集问题样例:
论文引用:
GSM: https://arxiv.org/abs/2110.14168
@article{cobbe2021gsm8k,
title={Training Verifiers to Solve Math Word Problems},
author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John},
journal={arXiv preprint arXiv:2110.14168},
year={2021}
}
数据许可说明:
MIT License
AIME 2024 (MATH)
数据描述:
该数据集收录了 2024 年美国邀请数学考试(AIME)的 30 道英语原版高难度试题,并以 JSONL 格式提供,涵盖几何、代数、数论等多数学领域,附完整详细的解题步骤,适用于检验模型的高级多步数学推理与创造性思维能力。
源数据集样例(简化):
2024-II-4
Question: Let $x,y$ and $z$ be positive real numbers that satisfy the following system of equations:
\[\log_2\left({x \over yz}\right) = {1 \over 2}\]
\[\log_2\left({y \over xz}\right) = {1 \over 3}\]
\[\log_2\left({z \over xy}\right) = {1 \over 4}\]
Then the value of $\left|\log_2(x^4y^3z^2)\right|$ is $\tfrac{m}{n}$ where $m$ and $n$ are relatively prime positive integers. Find $m+n$.
Solution:
Denote $\log_2(x) = a$, $\log_2(y) = b$, and $\log_2(z) = c$.
Then, we have:
$a-b-c = \frac{1}{2}$,
$-a+b-c = \frac{1}{3}$,
$-a-b+c = \frac{1}{4}$.
Now, we can solve to get $a = \frac{-7}{24}, b = \frac{-9}{24}, c = \frac{-5}{12}$.
Plugging these values in, we obtain $|4a + 3b + 2c| = \frac{25}{8} \implies \boxed{033}$.
数据集版权使用说明:
Source: AIME 2024 I & II
License: MIT License Copyright © [year] [fullname]
minerva_math_algebra
数据描述:
Minerva 使用的数学数据集(包括代数部分)主要来自:
- MATH 数据集(Hendrycks 等创建):包含 12,000 道初高中竞赛水平的数学题,涵盖代数、几何、数论等分支,问题以 LaTeX 格式描述。
- arXiv 论文及数学网页:Google 额外收集了 118GB 的科学论文(含 LaTeX 公式)和网页中带有 MathJax/LaTeX 标记的数学内容,确保保留符号表达式(如 E=mc^2 而非简化为 E=mc2)。
源数据集样例(简化):
问题:一条直线与 y=4x+6 平行且经过点 (5,10)。求该直线与 y 轴交点的纵坐标。
答案:-10
解题步骤:
- 确定斜率 k=4 → 直线方程 y=4x+b;
- 代入点 (5,10) 得 b=-10;
- y 轴交点即 x=0,故 y=-10。 答案:this gives 10=4⋅5+b⇒b=−1010=4⋅5+𝑏⇒𝑏=−10 which is what we wanted.
论文引用:
https://arxiv.org/abs/2206.14858
数据集版权使用说明:
CC BY 4.0
math_500
数据描述:
Math500是一个专注于评估数学模型推理能力的基准数据集,包含500道数学题,按难度分为三个级别:简单(easy)、中等(medium)和困难(difficult)。该数据集最初由香港科技大学自然语言处理团队在simpleRL-reason项目中采用,用于测试语言模型(如Qwen)的数学推理能力。 原始数据以JSONL格式提供,每个条目包含问题描述、解题步骤和最终答案。
源数据集样例(简化):
{
"question": "若x + 2 = 5,求x的值。",
"solution": "x = 5 - 2 = 3",
"answer": "3",
"difficulty": "easy",
"topic": "algebra"
}
论文引用:
无
数据集版权使用说明:
MIT License