Skip to content

Evaluation Data

The following datasets were converted to standard evaluation Prompts before being evaluated

MMLU-Pro

#Evaluation Metrics-Exact Match

Data description:

Includes multiple-choice questions from different branches of knowledge, equivalent to an upgraded version of the large multitasking test dataset MMLU:

  • The number of options has been increased from 4 to 10, which greatly reduces the likelihood of guessing the right answer.
  • The incorporation of additional data sources, accompanied by an elevated level of difficulty, placing greater emphasis on the application of knowledge and reasoning skills.
  • A restructuring of the original 57 subject themes into 14 broader disciplinary categories, encompassing fields such as mathematics, physics, chemistry, economics, computer science, psychology, and law, among others.

Review Data Volumes:

The evaluation data is 12,032 instances in the source data test set

Data Fields:

KEYSEXPLAIN
questionQuestion
optionsA list containing multiple choices
answerThe correct option

Sample question for the source dataset:

{
  "question": "According to Moore’s “ideal utilitarianism,” the right action is the one that brings about the greatest amount of:",
  "choices": ["wealth.", "virtue.", "fairness.", "pleasure.", "peace.", "justice.", "happiness.", "power.", "good.", "knowledge."]
}
{
  "question": "A new compound is synthesized and found to be a monoprotic acid with a molar mass of 248 g/mol. When 0.0050 mol of this acid are dissolved in 0.500 L of water, the pH is measured as 3.89. What is the pKa of this acid?",
  "choices": ["5.78", "4.78", "4.56", "6.89", "7.78", "3.89", "1.23", "2.89", "2.33", "5.33"]
}

paper citation:

MMLU-Pro: https://arxiv.org/abs/2406.01574

@inproceedings{wang2024mmlupro,
 author = {Yubo Wang and Xueguang Ma and Ge Zhang and Yuansheng Ni and Abhranil Chandra and Shiguang Guo and Weiming Ren and Aaran Arulraj and Xuan He and Ziyan Jiang and Tianle Li and Max Ku and Kai Wang and Alex Zhuang and Rongqi Fan and Xiang Yue and Wenhu Chen},
 booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
 pages = {},
 title = {{MMLU-Pro}: A More Robust and Challenging Multi-Task Language Understanding Benchmark},
 year = {2024}
}

original MMLU: https://arxiv.org/abs/2009.03300

@article{hendryckstest2021,
      title={Measuring Massive Multitask Language Understanding},
      author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
      journal={Proceedings of the International Conference on Learning Representations (ICLR)},
      year={2021}
    }

MIT License

LiveBench

#Evaluation Metrics-Exact Match

Data Description:

In order to avoid the impact of dataset leakage on the assessment results, this data was constructed based on dynamic sources for six main categories of abilities:

  • Maths: multiple high school maths competitions (counted as Competitions in this data) or the most recent Olympiad, as well as synthetic difficult maths questions (AMPS_Hard).
  • Code: LeetCode, AtCoder code generation class questions (LCB Generation, adapted from LiveCodeBench), and original code-completion questions (Completion, where the second half of the code is erased from the GitHub solution based on a recent LiveCodeBench question).
  • Reasoning: Harder ‘who is lying’ type questions (web of lies), zebra puzzles, and more!
  • Language comprehension: WSJ Connections vocabulary grouping questions, fixing typos, reordering jumbled statements.
  • Command following: paraphrasing, simplifying, summarising, story generation of recent Guardian news content in a specific format.
  • Data analysis: format conversion, detection of mergeable columns, prediction of column names based on recent Kaggle and Socrata data.

The questions were in the form of objective questions, such as multiple-choice and fill-in-the-blanks, where the correctness of the answer could be accurately determined.

Assessment data volume:

2024-08-31 Data: 1,136 = 368 (Maths) + 128 (Code) + 150 (Reasoning) + 140 (Language) + 200 (Instructions) + 150 (Analysis)

Data Segments:

KEYSEXPLAIN
turnsQuestion (including options)
ground_truthCorrect answer

Sample from the source dataset:

{
  <!-- categories: "math", -->
  "turns": ["Let $ABCDEF$ be a convex equilateral hexagon in which all pairs of opposite sides are parallel. The triangle whose sides are extensions of segments $\\overline{AB}$, $\\overline{CD}$, and $\\overline{EF}$ has side lengths $200, 240,$ and $300$. Find the side length of the hexagon. Please think step by step, and then display the answer at the very end of your response. The answer is an integer consisting of exactly 3 digits (including leading zeros), ranging from 000 to 999, inclusive. For example, the answer might be 068 or 972. If you cannot determine the correct answer, take your best guess. Remember to have the three digits as the last part of the response."]
}
{
  <!-- categories: "reasoning", -->
  "turns": ["There are 3 people standing in a line numbered 1 through 3 in a left to right order.\nEach person has a set of attributes: Sport, Music-Genre, Hobby, Nationality.\nThe attributes have the following possible values:\n... Answer the following question:\nWhat is the nationality of the person who listens to dubstep? Return your answer as a single word, in the following format: ***X***, where X is the answer."]
}
{
  <!-- categories: "data_analysis", -->
  "turns": ["Pick the column's class based on the provided column sample. Choose exactly one of the listed classes. Please respond only with the name of the class. \n Column sample: [[1995], [1964], [1986], [2022], [1985]] \n Classes: ['Maize yield' 'code country' 'Year' 'country'] \n Output: \n"]
}
{
  <!-- categories: "instruction_following", -->
  "turns": ["The following are the beginning sentences of a news article from the Guardian: ... Please summarize based on the sentences provided. Your answer must contain a title, wrapped in double angular brackets, such as <<poem of joy>>. Finish your response with this exact phrase Any other questions?. No other words should follow this phrase. There should be 4 paragraphs. Paragraphs are separated with the markdown divider: ***"]
}
{
  <!-- categories: "language", -->
  "turns": ["Please output this exact text, with no changes at all except for fixing the misspellings. Please leave all other stylistic decisions like commas and US vs British spellings as in the original text. ..."]
}

paper citation:

LiveBench: https://arxiv.org/abs/2406.19314

@article{livebench,
  author    = {White, Colin and Dooley, Samuel and Roberts, Manley and Pal, Arka and Feuer, Ben and Jain, Siddhartha and Shwartz-Ziv, Ravid and Jain, Neel and Saifullah, Khalid and Naidu, Siddartha and Hegde, Chinmay and LeCun, Yann and Goldstein, Tom and Neiswanger, Willie and Goldblum, Micah},
  title     = {LiveBench: A Challenging, Contamination-Free LLM Benchmark},
  url       = {arXiv preprint arXiv:2406.19314},
  year      = {2024},
}

Apache 2.0 (Detailed Copyright Statement