Skip to content

评测数据

以下数据集均转化为标准评测Prompt再进行评测

MATH-hard

数据描述:

Mathematics Aptitude Test of Heuristics (MATH)数据集包含来自AMC 10、AMC 12、AIME等数学竞赛的问题。MATH中的每个问题都有详细的解题步骤。 本次使用lm-eval-harness官方集成的其中的hard子集,仅包含原始数据中level 5的问题。

数据集构成和规范:

源数据量:

原始数据中level 5的问题共1,324条。

数据字段:

KEYSEXPLAIN
problem数学竞赛问题
solution详细解题答案
level问题的难度级别从“Level 1”到“Level 5”。科目中最容易的问题被分配到“Level 1”,最难的问题被分配到“Level 5”
type问题所属科目:代数、计数与概率、几何、中级代数、数论、初等代数和初等微积分

源数据集样例:

{
    "problem": "John draws a regular five pointed star in the sand, and at each of the 5 outward-pointing points and 5 inward-pointing points he places one of ten different sea shells. How many ways can he place the shells, if reflections and rotations of an arrangement are considered equivalent?",
    "level": "Level 5",
    "type": "Counting & Probability"
}

论文引用:

MATH: https://arxiv.org/abs/2103.03874

@inproceedings{hendrycks2021MATH,
 author = {Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob},
 booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
 title = {Measuring Mathematical Problem Solving With the MATH Dataset},
 volume = {1},
 year = {2021}
}

数据集版权使用说明:

MIT License

GPQA

数据描述:

Google-Proof Q&A (GPQA)是由生物、物理、化学领域专家命制的专业问题集,非专家通过无限制网络搜索仅能做到34%准确度,因此得名。 本数据总计546条选择题,包含448条主数据集(main set)以及198条最具挑战的“钻石”子集(diamond set)。

源数据集样例:

{
  'Question': 'Two quantum states with energies E1 and E2 have a lifetime of 10^-9 sec and 10^-8 sec, respectively. We want to clearly distinguish these two energy levels. Which one of the following options could be their energy difference so that they can be clearly resolved?',
  'Subdomain': 'Physics (general)',
  'Correct Answer': '10^-4 eV',
  'Incorrect Answer 1': '10^-11 eV',
  'Incorrect Answer 2': '10^-8 eV',
  'Incorrect Answer 3': '10^-9 eV',
  'Explanation': 'According to the uncertainty principle, Delta E* Delta t=hbar/2. Delta t is the lifetime and Delta E is the width of the energy level. With Delta t=10^-9 s==> Delta E1= 3.3 10^-7 ev. And Delta t=10^-11 s gives Delta E2=3.310^-8 eV. Therefore, the energy difference between the two states must be significantly greater than 10^-7 ev. So the answer is 10^-4 ev.'
}

论文引用:

GPQA: https://openreview.net/forum?id=Ti67584b98

@inproceedings{
rein2024gpqa,
title={{GPQA}: A Graduate-Level Google-Proof Q\&A Benchmark},
author={David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman},
booktitle={First Conference on Language Modeling},
year={2024}
}

TheoremQA

数据描述:

TheoremQA包含800条高质量专业问答数据,围绕理工学科下共计350条知名定理(如泰勒定理、拉格朗日定理、量子定理等)命制。

数据集构成和规范:

评测数据量:

测试集包含800条数据供评测使用。

数据字段:

KEYSEXPLAIN
Question问题
Answer答案
Answer_type答案类型
Picture图像(如果有)

源数据集样例:

{'Question': 'How many ways are there to divide a set of 8 elements into 5 non-empty ordered subsets?',
 'Answer': '11760',
 'Answer_type': 'integer',
 'Picture': ''}

论文引用:

TheoremQA: https://arxiv.org/abs/2305.12524

@inproceedings{chen2023theoremqa,
    title = "{T}heorem{QA}: A Theorem-driven Question Answering Dataset",
    author = "Chen, Wenhu  and
      Yin, Ming  and
      Ku, Max  and
      Lu, Pan  and
      Wan, Yixin  and
      Ma, Xueguang  and
      Xu, Jianyu  and
      Wang, Xinyi  and
      Xia, Tony",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    year = "2023",
    pages = "7889--7901"
}