Skip to content

Evaluation Data

The following datasets were converted to standard evaluation Prompts before being evaluated

MMLU-Pro

#Evaluation Metrics-Exact Match

Data description:

Includes multiple-choice questions from different branches of knowledge, equivalent to an upgraded version of the large multitasking test dataset MMLU:

  • The number of options has been increased from 4 to 10, which greatly reduces the likelihood of guessing the right answer.
  • The incorporation of additional data sources, accompanied by an elevated level of difficulty, placing greater emphasis on the application of knowledge and reasoning skills.
  • A restructuring of the original 57 subject themes into 14 broader disciplinary categories, encompassing fields such as mathematics, physics, chemistry, economics, computer science, psychology, and law, among others.

Review Data Volumes:

The evaluation data is 12,032 instances in the source data test set

Data Fields:

KEYSEXPLAIN
questionQuestion
optionsA list containing multiple choices
answerThe correct option

Sample question for the source dataset:

{
  "question": "According to Moore’s “ideal utilitarianism,” the right action is the one that brings about the greatest amount of:",
  "choices": ["wealth.", "virtue.", "fairness.", "pleasure.", "peace.", "justice.", "happiness.", "power.", "good.", "knowledge."]
}
{
  "question": "A new compound is synthesized and found to be a monoprotic acid with a molar mass of 248 g/mol. When 0.0050 mol of this acid are dissolved in 0.500 L of water, the pH is measured as 3.89. What is the pKa of this acid?",
  "choices": ["5.78", "4.78", "4.56", "6.89", "7.78", "3.89", "1.23", "2.89", "2.33", "5.33"]
}

paper citation:

MMLU-Pro: https://arxiv.org/abs/2406.01574

@inproceedings{wang2024mmlupro,
 author = {Yubo Wang and Xueguang Ma and Ge Zhang and Yuansheng Ni and Abhranil Chandra and Shiguang Guo and Weiming Ren and Aaran Arulraj and Xuan He and Ziyan Jiang and Tianle Li and Max Ku and Kai Wang and Alex Zhuang and Rongqi Fan and Xiang Yue and Wenhu Chen},
 booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
 pages = {},
 title = {{MMLU-Pro}: A More Robust and Challenging Multi-Task Language Understanding Benchmark},
 year = {2024}
}

original MMLU: https://arxiv.org/abs/2009.03300

@article{hendryckstest2021,
      title={Measuring Massive Multitask Language Understanding},
      author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
      journal={Proceedings of the International Conference on Learning Representations (ICLR)},
      year={2021}
    }

MIT License

LiveBench

#Evaluation Metrics-Exact Match

Data Description:

In order to avoid the impact of dataset leakage on the assessment results, this data was constructed based on dynamic sources for six main categories of abilities:

  • Maths: multiple high school maths competitions (counted as Competitions in this data) or the most recent Olympiad, as well as synthetic difficult maths questions (AMPS_Hard).
  • Code: LeetCode, AtCoder code generation class questions (LCB Generation, adapted from LiveCodeBench), and original code-completion questions (Completion, where the second half of the code is erased from the GitHub solution based on a recent LiveCodeBench question).
  • Reasoning: Harder ‘who is lying’ type questions (web of lies), zebra puzzles, and more!
  • Language comprehension: WSJ Connections vocabulary grouping questions, fixing typos, reordering jumbled statements.
  • Command following: paraphrasing, simplifying, summarising, story generation of recent Guardian news content in a specific format.
  • Data analysis: format conversion, detection of mergeable columns, prediction of column names based on recent Kaggle and Socrata data.

The questions were in the form of objective questions, such as multiple-choice and fill-in-the-blanks, where the correctness of the answer could be accurately determined.

Assessment data volume:

2024-08-31 Data: 1,136 = 368 (Maths) + 128 (Code) + 150 (Reasoning) + 140 (Language) + 200 (Instructions) + 150 (Analysis)

Data Segments:

KEYSEXPLAIN
turnsQuestion (including options)
ground_truthCorrect answer

Sample from the source dataset:

{
  <!-- categories: "math", -->
  "turns": ["Let $ABCDEF$ be a convex equilateral hexagon in which all pairs of opposite sides are parallel. The triangle whose sides are extensions of segments $\\overline{AB}$, $\\overline{CD}$, and $\\overline{EF}$ has side lengths $200, 240,$ and $300$. Find the side length of the hexagon. Please think step by step, and then display the answer at the very end of your response. The answer is an integer consisting of exactly 3 digits (including leading zeros), ranging from 000 to 999, inclusive. For example, the answer might be 068 or 972. If you cannot determine the correct answer, take your best guess. Remember to have the three digits as the last part of the response."]
}
{
  <!-- categories: "reasoning", -->
  "turns": ["There are 3 people standing in a line numbered 1 through 3 in a left to right order.\nEach person has a set of attributes: Sport, Music-Genre, Hobby, Nationality.\nThe attributes have the following possible values:\n... Answer the following question:\nWhat is the nationality of the person who listens to dubstep? Return your answer as a single word, in the following format: ***X***, where X is the answer."]
}
{
  <!-- categories: "data_analysis", -->
  "turns": ["Pick the column's class based on the provided column sample. Choose exactly one of the listed classes. Please respond only with the name of the class. \n Column sample: [[1995], [1964], [1986], [2022], [1985]] \n Classes: ['Maize yield' 'code country' 'Year' 'country'] \n Output: \n"]
}
{
  <!-- categories: "instruction_following", -->
  "turns": ["The following are the beginning sentences of a news article from the Guardian: ... Please summarize based on the sentences provided. Your answer must contain a title, wrapped in double angular brackets, such as <<poem of joy>>. Finish your response with this exact phrase Any other questions?. No other words should follow this phrase. There should be 4 paragraphs. Paragraphs are separated with the markdown divider: ***"]
}
{
  <!-- categories: "language", -->
  "turns": ["Please output this exact text, with no changes at all except for fixing the misspellings. Please leave all other stylistic decisions like commas and US vs British spellings as in the original text. ..."]
}

paper citation:

LiveBench: https://arxiv.org/abs/2406.19314

@article{livebench,
  author    = {White, Colin and Dooley, Samuel and Roberts, Manley and Pal, Arka and Feuer, Ben and Jain, Siddhartha and Shwartz-Ziv, Ravid and Jain, Neel and Saifullah, Khalid and Naidu, Siddartha and Hegde, Chinmay and LeCun, Yann and Goldstein, Tom and Neiswanger, Willie and Goldblum, Micah},
  title     = {LiveBench: A Challenging, Contamination-Free LLM Benchmark},
  url       = {arXiv preprint arXiv:2406.19314},
  year      = {2024},
}

Apache 2.0 (Detailed Copyright Statement

CMMU

#Evaluation Metrics-Exact Match

Data description:

The CMMU v0.1 version contains 3,603 questions, with 2,585 questions providing detailed answer explanations. The dataset is split 1:1 into a validation set and a test set (the validation set contains 1,800 questions, and the test set contains 1,803 questions). The validation set will be fully open to facilitate model testing by researchers.

  • In terms of educational stages, there are 250 elementary school questions, and 1,697 and 1,656 questions for middle school and high school, respectively. The elementary school set only includes math questions, while the middle school and high school sets cover seven subjects.

  • The difficulty distribution of the questions is roughly 80% "normal" and 20% "hard." The difficulty levels were determined by experienced teachers who categorized the questions into "normal" and "hard" based on their complexity.

Sample question for the source dataset:

The original question is:
{
    "type": "fill-in-the-blank",
    "question_info": "question",
    "id": "subject_1234",
    "sub_questions": ["sub_question_0", "sub_question_1"],
    "answer": ["answer_0", "answer_1"]
}
Converted questions are:
[
{
    "type": "fill-in-the-blank",
    "question_info": "question" + "sub_question_0",
    "id": "subject_1234-0",
    "answer": "answer_0"
},
{
    "type": "fill-in-the-blank",
    "question_info": "question" + "sub_question_1",
    "id": "subject_1234-1",
    "answer": "answer_1"
}
]

paper citation:

CMMU: https://arxiv.org/pdf/2401.14011v3

@article{he2024cmmu,
        title={CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning},
        author={Zheqi He, Xinya Wu, Pengfei Zhou, Richeng Xuan, Guang Liu, Xi Yang, Qiannan Zhu and Hua Huang},
        journal={arXiv preprint arXiv:2401.14011},
        year={2024},
      }

Apache License 2.0

CMMLU

#Evaluation Metrics-Exact Match

Data description:

CMMLU is a comprehensive Chinese evaluation benchmark specifically designed to assess the knowledge and reasoning capabilities of language models within a Chinese context. CMMLU covers 67 subjects, ranging from basic academic disciplines to advanced professional fields. It includes natural sciences that require calculation and reasoning, humanities and social sciences that demand domain knowledge, as well as practical knowledge such as Chinese driving regulations.

Moreover, many tasks in CMMLU contain answers that are specific to China and may not be generally applicable in other regions or languages, making it a fully localized Chinese benchmark.

Each question in the dataset is a multiple-choice question with four options, of which only one is correct. The data is stored in comma-separated .csv files.

Sample question for the source dataset:

### Question 1

Two types of cells from the same species each produce a secreted protein. The amino acid content of these two proteins is identical, but the sequence of the amino acids is different. The reason for this difference is that the:

- A. Types of tRNA involved are different  
- B. Same codon determines different amino acids  
- C. mRNA nucleotide sequences are different  
- D. Ribosomal components are different  

**Answer:** C


### Question 2

A certain plant virus, Virus V, is transmitted between rice plants by rice planthoppers when they feed on rice sap. An increase in the number of frogs in the rice fields can reduce the transmission of this virus among rice plants. Which of the following statements is correct?

- A. Frogs and rice planthoppers have a predator-prey relationship  
- B. Rice plants and Virus V have a mutualistic relationship  
- C. Virus V and frogs have a parasitic relationship  
- D. Rice plants and frogs are competitors  

**Answer:** (To be filled)

paper citation:

CMMLU: https://arxiv.org/abs/2306.09212

@misc{li2023cmmlu,
      title={CMMLU: Measuring massive multitask language understanding in Chinese},
      author={Haonan Li and Yixuan Zhang and Fajri Koto and Yifei Yang and Hai Zhao and Yeyun Gong and Nan Duan and Timothy Baldwin},
      year={2023},
      eprint={2306.09212},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

The CMMLU dataset uses Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License