评测数据

以下数据集均转化为标准评测Prompt再进行评测

MMLU-Pro

#评测指标-Exact Match

数据描述：

包括来自不同知识分支的多项选择题，相当于大型多任务测试数据集MMLU的一个升级版本：

选项由原来的4个增加到10个，大大降低了“蒙对”答案的可能
补充了一些数据来源，难度上也有所增加，更考验知识活用与推理。
把原来的57个学科主题合并为14个学科大类，包括数学、物理、化学、经济学、计算机科学、心理学、法律等。

评测数据量：

评测数据为源数据测试集中的12,032个实例

数据字段：

KEYS	EXPLAIN
question	问题
options	包含多个选项的列表
answer	正确选项

源数据集问题样例：

{
  "question": "According to Moore’s “ideal utilitarianism,” the right action is the one that brings about the greatest amount of:",
  "choices": ["wealth.", "virtue.", "fairness.", "pleasure.", "peace.", "justice.", "happiness.", "power.", "good.", "knowledge."]
}
{
  "question": "A new compound is synthesized and found to be a monoprotic acid with a molar mass of 248 g/mol. When 0.0050 mol of this acid are dissolved in 0.500 L of water, the pH is measured as 3.89. What is the pKa of this acid?",
  "choices": ["5.78", "4.78", "4.56", "6.89", "7.78", "3.89", "1.23", "2.89", "2.33", "5.33"]
}

论文引用：

MMLU-Pro: https://arxiv.org/abs/2406.01574

@inproceedings{wang2024mmlupro,
 author = {Yubo Wang and Xueguang Ma and Ge Zhang and Yuansheng Ni and Abhranil Chandra and Shiguang Guo and Weiming Ren and Aaran Arulraj and Xuan He and Ziyan Jiang and Tianle Li and Max Ku and Kai Wang and Alex Zhuang and Rongqi Fan and Xiang Yue and Wenhu Chen},
 booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
 pages = {},
 title = {{MMLU-Pro}: A More Robust and Challenging Multi-Task Language Understanding Benchmark},
 year = {2024}
}

原始MMLU: https://arxiv.org/abs/2009.03300

@article{hendryckstest2021,
      title={Measuring Massive Multitask Language Understanding},
      author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
      journal={Proceedings of the International Conference on Learning Representations (ICLR)},
      year={2021}
    }

数据集版权使用说明：

MIT License

LiveBench

#评测指标-Exact Match

数据描述：

为避免数据集泄露对评测结果产生影响，该数据根据动态来源构造了针对六大类能力的评测数据：

数学：多项高中数学竞赛（本数据中计为Competitions）或奥赛（Olympiad）最近一届赛题，以及合成的困难数学题（AMPS_Hard）
代码：LeetCode、AtCoder中的代码生成类问题（LCB Generation，改编自LiveCodeBench），以及原创代码补全问题（Completion，根据LiveCodeBench近期原题从GitHub上的题解里抹掉后半部分代码）
推理：增强难度的“谁在说谎”类问题（web of lies）、斑马逻辑谜题（zebra puzzle）等
语言理解：华尔街日报Connections词汇分组题、修改错别字、乱序语句重排序
指令遵循：按特定格式要求对卫报近期新闻内容进行复述、简化、总结、故事生成
数据分析：根据近期Kaggle和Socrata数据进行格式转换、可合并数据列检测、数据列名预测

题型均为可以精确判定答案正确与否的客观题形式，如选择题、填空题等。

评测数据量：

2024-08-31数据：1,136 = 368（数学）+ 128（代码）+ 150（推理）+ 140（语言）+ 200（指令）+ 150（分析）

数据字段：

KEYS	EXPLAIN
turns	问题（包括选项）
ground_truth	正确答案

源数据集问题样例：

{
  <!-- 类别: "math", -->
  "turns": ["Let $ABCDEF$ be a convex equilateral hexagon in which all pairs of opposite sides are parallel. The triangle whose sides are extensions of segments $\\overline{AB}$, $\\overline{CD}$, and $\\overline{EF}$ has side lengths $200, 240,$ and $300$. Find the side length of the hexagon. Please think step by step, and then display the answer at the very end of your response. The answer is an integer consisting of exactly 3 digits (including leading zeros), ranging from 000 to 999, inclusive. For example, the answer might be 068 or 972. If you cannot determine the correct answer, take your best guess. Remember to have the three digits as the last part of the response."]
}
{
  <!-- 类别: "reasoning", -->
  "turns": ["There are 3 people standing in a line numbered 1 through 3 in a left to right order.\nEach person has a set of attributes: Sport, Music-Genre, Hobby, Nationality.\nThe attributes have the following possible values:\n... Answer the following question:\nWhat is the nationality of the person who listens to dubstep? Return your answer as a single word, in the following format: ***X***, where X is the answer."]
}
{
  <!-- 类别: "data_analysis", -->
  "turns": ["Pick the column's class based on the provided column sample. Choose exactly one of the listed classes. Please respond only with the name of the class. \n Column sample: [[1995], [1964], [1986], [2022], [1985]] \n Classes: ['Maize yield' 'code country' 'Year' 'country'] \n Output: \n"]
}
{
  <!-- 类别: "instruction_following", -->
  "turns": ["The following are the beginning sentences of a news article from the Guardian: ... Please summarize based on the sentences provided. Your answer must contain a title, wrapped in double angular brackets, such as <<poem of joy>>. Finish your response with this exact phrase Any other questions?. No other words should follow this phrase. There should be 4 paragraphs. Paragraphs are separated with the markdown divider: ***"]
}
{
  <!-- 类别: "language", -->
  "turns": ["Please output this exact text, with no changes at all except for fixing the misspellings. Please leave all other stylistic decisions like commas and US vs British spellings as in the original text. ..."]
}

论文引用：

LiveBench: https://arxiv.org/abs/2406.19314

@article{livebench,
  author    = {White, Colin and Dooley, Samuel and Roberts, Manley and Pal, Arka and Feuer, Ben and Jain, Siddhartha and Shwartz-Ziv, Ravid and Jain, Neel and Saifullah, Khalid and Naidu, Siddartha and Hegde, Chinmay and LeCun, Yann and Goldstein, Tom and Neiswanger, Willie and Goldblum, Micah},
  title     = {LiveBench: A Challenging, Contamination-Free LLM Benchmark},
  url       = {arXiv preprint arXiv:2406.19314},
  year      = {2024},
}

数据集版权使用说明：

Apache 2.0 （详细版权说明）

评测数据 ​

MMLU-Pro ​

数据描述： ​

评测数据量： ​

数据字段： ​

源数据集问题样例： ​

论文引用： ​

数据集版权使用说明： ​

LiveBench ​

数据描述： ​

评测数据量： ​

数据字段： ​

源数据集问题样例： ​

论文引用： ​

数据集版权使用说明： ​

评测数据

MMLU-Pro

数据描述：

评测数据量：

数据字段：

源数据集问题样例：

论文引用：

数据集版权使用说明：

LiveBench

数据描述：

评测数据量：

数据字段：

源数据集问题样例：

论文引用：

数据集版权使用说明：