Evaluation Data
The following datasets were converted to standard evaluation Prompts before being reviewed
MATH-hard
Data description:
The Mathematics Aptitude Test of Heuristics (MATH) dataset contains problems from math competitions such as AMC 10, AMC 12, and AIME. Each problem in the MATH dataset is accompanied by a detailed solution.This time, we use the HARD subset of it from the official lm-eval-harness integration, which contains only level 5 questions from the original data.
Dataset composition and specification:
Source data volume:
There are 1,324 questions of level 5 in the original data.
Data Segments:
KEYS | EXPLAIN |
---|---|
problem | Math competition problem |
solution | Detailed solution steps |
level | The difficulty level of the problem ranges from "Level 1" to "Level 5". Level 1 is the easiest, and Level 5 is the hardest. |
type | The subject category of the problem: Algebra, Counting & Probability, Geometry, Intermediate Algebra, Number Theory, Prealgebra, and Precalculus |
Sample of source dataset:
{
"problem": "John draws a regular five pointed star in the sand, and at each of the 5 outward-pointing points and 5 inward-pointing points he places one of ten different sea shells. How many ways can he place the shells, if reflections and rotations of an arrangement are considered equivalent?",
"level": "Level 5",
"type": "Counting & Probability"
}
Paper Citation:
MATH: https://arxiv.org/abs/2103.03874
@inproceedings{hendrycks2021MATH,
author = {Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob},
booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
title = {Measuring Mathematical Problem Solving With the MATH Dataset},
volume = {1},
year = {2021}
}
Dataset Copyright Usage Instructions:
MIT License
GPQA
Data description:
Google-Proof Q&A (GPQA) is a set of specialised questions created by experts in the fields of biology, physics and chemistry, so named because non-experts can only achieve 34% accuracy through unlimited web searches. The data consists of 546 multiple-choice questions, including 448 in the main set and 198 in the most challenging ‘diamond’ subset.
Sample source data set:
{
'Question': 'Two quantum states with energies E1 and E2 have a lifetime of 10^-9 sec and 10^-8 sec, respectively. We want to clearly distinguish these two energy levels. Which one of the following options could be their energy difference so that they can be clearly resolved?',
'Subdomain': 'Physics (general)',
'Correct Answer': '10^-4 eV',
'Incorrect Answer 1': '10^-11 eV',
'Incorrect Answer 2': '10^-8 eV',
'Incorrect Answer 3': '10^-9 eV',
'Explanation': 'According to the uncertainty principle, Delta E* Delta t=hbar/2. Delta t is the lifetime and Delta E is the width of the energy level. With Delta t=10^-9 s==> Delta E1= 3.3 10^-7 ev. And Delta t=10^-11 s gives Delta E2=3.310^-8 eV. Therefore, the energy difference between the two states must be significantly greater than 10^-7 ev. So the answer is 10^-4 ev.'
}
Paper Citation:
GPQA: https://openreview.net/forum?id=Ti67584b98
@inproceedings{
rein2024gpqa,
title={{GPQA}: A Graduate-Level Google-Proof Q\&A Benchmark},
author={David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman},
booktitle={First Conference on Language Modeling},
year={2024}
}
TheoremQA
Data description:
TheoremQA contains 800 high-quality professional Q&A data, proposing around a total of 350 well-known theorems (e.g., Taylor's Theorem, Lagrange's Theorem, Quantum's Theorem, etc.) under the discipline of Science and Technology.
Dataset composition and specification:
Assessment Data Volume:
The test set contains 800 items of data for evaluation.
Data Segments:
KEYS | EXPLAIN |
---|---|
Question | Question |
Answer | Answer |
Answer_type | Type of the answer |
Picture | Image (if available) |
Sample from the source dataset:
{'Question': 'How many ways are there to divide a set of 8 elements into 5 non-empty ordered subsets?',
'Answer': '11760',
'Answer_type': 'integer',
'Picture': ''}
paper citation:
TheoremQA: https://arxiv.org/abs/2305.12524
@inproceedings{chen2023theoremqa,
title = "{T}heorem{QA}: A Theorem-driven Question Answering Dataset",
author = "Chen, Wenhu and
Yin, Ming and
Ku, Max and
Lu, Pan and
Wan, Yixin and
Ma, Xueguang and
Xu, Jianyu and
Wang, Xinyi and
Xia, Tony",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
year = "2023",
pages = "7889--7901"
}