Skip to content

Evaluation Data

The following datasets were converted to standard evaluation Prompts before being reviewed

MATH-hard

Data description:

The Mathematics Aptitude Test of Heuristics (MATH) dataset contains problems from math competitions such as AMC 10, AMC 12, and AIME. Each problem in the MATH dataset is accompanied by a detailed solution.This time, we use the HARD subset of it from the official lm-eval-harness integration, which contains only level 5 questions from the original data.

Dataset composition and specification:

Source data volume:

There are 1,324 questions of level 5 in the original data.

Data Segments:

KEYSEXPLAIN
problemMath competition problem
solutionDetailed solution steps
levelThe difficulty level of the problem ranges from "Level 1" to "Level 5". Level 1 is the easiest, and Level 5 is the hardest.
typeThe subject category of the problem: Algebra, Counting & Probability, Geometry, Intermediate Algebra, Number Theory, Prealgebra, and Precalculus

Sample of source dataset:

{
    "problem": "John draws a regular five pointed star in the sand, and at each of the 5 outward-pointing points and 5 inward-pointing points he places one of ten different sea shells. How many ways can he place the shells, if reflections and rotations of an arrangement are considered equivalent?",
    "level": "Level 5",
    "type": "Counting & Probability"
}

Paper Citation:

MATH: https://arxiv.org/abs/2103.03874

@inproceedings{hendrycks2021MATH,
 author = {Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob},
 booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
 title = {Measuring Mathematical Problem Solving With the MATH Dataset},
 volume = {1},
 year = {2021}
}

MIT License

GPQA

Data description:

Google-Proof Q&A (GPQA) is a set of specialised questions created by experts in the fields of biology, physics and chemistry, so named because non-experts can only achieve 34% accuracy through unlimited web searches. The data consists of 546 multiple-choice questions, including 448 in the main set and 198 in the most challenging ‘diamond’ subset.

Sample source data set:

{
  'Question': 'Two quantum states with energies E1 and E2 have a lifetime of 10^-9 sec and 10^-8 sec, respectively. We want to clearly distinguish these two energy levels. Which one of the following options could be their energy difference so that they can be clearly resolved?',
  'Subdomain': 'Physics (general)',
  'Correct Answer': '10^-4 eV',
  'Incorrect Answer 1': '10^-11 eV',
  'Incorrect Answer 2': '10^-8 eV',
  'Incorrect Answer 3': '10^-9 eV',
  'Explanation': 'According to the uncertainty principle, Delta E* Delta t=hbar/2. Delta t is the lifetime and Delta E is the width of the energy level. With Delta t=10^-9 s==> Delta E1= 3.3 10^-7 ev. And Delta t=10^-11 s gives Delta E2=3.310^-8 eV. Therefore, the energy difference between the two states must be significantly greater than 10^-7 ev. So the answer is 10^-4 ev.'
}

Paper Citation:

GPQA: https://openreview.net/forum?id=Ti67584b98

@inproceedings{
rein2024gpqa,
title={{GPQA}: A Graduate-Level Google-Proof Q\&A Benchmark},
author={David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman},
booktitle={First Conference on Language Modeling},
year={2024}
}

TheoremQA

Data description:

TheoremQA contains 800 high-quality professional Q&A data, proposing around a total of 350 well-known theorems (e.g., Taylor's Theorem, Lagrange's Theorem, Quantum's Theorem, etc.) under the discipline of Science and Technology.

Dataset composition and specification:

Assessment Data Volume:

The test set contains 800 items of data for evaluation.

Data Segments:

KEYSEXPLAIN
QuestionQuestion
AnswerAnswer
Answer_typeType of the answer
PictureImage (if available)

Sample from the source dataset:

{'Question': 'How many ways are there to divide a set of 8 elements into 5 non-empty ordered subsets?',
 'Answer': '11760',
 'Answer_type': 'integer',
 'Picture': ''}

paper citation:

TheoremQA: https://arxiv.org/abs/2305.12524

@inproceedings{chen2023theoremqa,
    title = "{T}heorem{QA}: A Theorem-driven Question Answering Dataset",
    author = "Chen, Wenhu  and
      Yin, Ming  and
      Ku, Max  and
      Lu, Pan  and
      Wan, Yixin  and
      Ma, Xueguang  and
      Xu, Jianyu  and
      Wang, Xinyi  and
      Xia, Tony",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    year = "2023",
    pages = "7889--7901"
}