Skip to content

评测数据

TACO

#评测指标-Pass@k

数据描述:

TACO(Topics in Algorithmic COde generation dataset)是一个专注于算法的代码生成数据集,旨在为代码生成模型领域提供一个更具挑战性的训练数据集与评测基准。该数据集包含难度更大、更接近真实编程场景的编程竞赛题目,强调提升或评测模型在实际应用场景中对问题的理解和推理(Reasoning)能力,而不仅仅是实现既定的函数功能。

  • 规模更大:TACO 包括训练集(25443道题目)和测试集(1000道题目),是当前规模最大的代码生成数据集。
  • 质量更高:TACO 数据集中的每个题目都尽可能匹配多样化的解题答案,答案规模高达155万条,确保训练时模型不易过拟合以及评测结果的有效性。
  • 提供细粒度标签:TACO数据集中每个题目均包含任务主题、算法、技能及难度等细粒度标签,为代码生成模型的训练与评测更精确的参考。

数据集构成和规范:

数据量:

源数据集包括训练集(25443道题目)和测试集(1000道题目)。

评测数据量:

FlagEval 评测数据为测试集的 1000 个测试样例,每个难度等级 200 道。

源数据格式:

from datasets import load_dataset
load_dataset("BAAI/TACO")

DatasetDict({
    train: Dataset({
        features: ['question', 'solutions', 'starter_code', 'input_output', 'difficulty', 'raw_tags', 'name', 'source', 'tags', 'skill_types', 'url', 'Expected Auxiliary Space', 'time_limit', 'date', 'picture_num', 'memory_limit', 'Expected Time Complexity'],
        num_rows: 25443
    })
    test: Dataset({
        features: ['question', 'solutions', 'starter_code', 'input_output', 'difficulty', 'raw_tags', 'name', 'source', 'tags', 'skill_types', 'url', 'Expected Auxiliary Space', 'time_limit', 'date', 'picture_num', 'memory_limit', 'Expected Time Complexity'],
        num_rows: 1000
    })
})

源数据字段:

KEYSTYPEEXPLAIN
questionstringproblem description
solutionsstringsome python solutions
input_outputstringJson string with "inputs" and "outputs" of the test cases, might also include "fn_name" the name of the function
difficultystringdifficulty level of the problem
picture_numstringthe number of pictures in the problem
sourcestringthe source of the problem
urlstringurl of the source of the problem
datestringthe date of the problem
starter_codestringstarter code to include in prompts
time_limitstringthe time consumption limit to solve the problem
memory_limitstringthe memory consumption limit to solve the problem
Expected Auxiliary Spacestringthe extra auxiliary space expected to solve the problem
Expected Time Complexitystringthe time complexity expected to solve the problem
raw_tagsstringthe topics of the programming task
tagsstringthe manually annoatated algorithms needed to solve the problem
skill_typesstringthe mapped programming skill types to solve the problem

源数据集样例:

{
  "question": "You have a deck of $n$ cards, and you'd like to reorder it to a new one.\n\nEach card has a value between $1$ and $n$ equal to $p_i$. ...",
  "solutions": [
    "import heapq\nfrom math import sqrt\nimport operator\nimport sys\ninf_var = 0\nif inf_var == 1:\n\tinf = open('input.txt', 'r')\nelse:\n\tinf = sys.stdin\n ...",
    "t = int(input())\nfor _ in range(t):\n\tn = int(input())\n\tp = list(map(int, input().split()))\n\tans = []\n\tp1 = [-1] * (n + 1)\n\tfor i in range(n):\n\t\tp1[p[i]] = i\n\ti = n\n\twhile i:\n\t\twhile i > 0 and p1[i] == -1:\n\t\t\ti -= 1\n\t\telse:\n\t\t\tif i:\n\t\t\t\tk = 0\n\t\t\t\tfor j in range(p1[i], n):\n\t\t\t\t\tans.append(p[j])\n\t\t\t\t\tp1[p[j]] = -1\n\t\t\t\t\tk += 1\n\t\t\t\tn -= k\n\t\t\t\ti -= 1\n\t\t\telse:\n\t\t\t\tbreak\n\tprint(*ans)\n",
    "import sys\n\ndef get_ints():\n\treturn map(int, sys.stdin.readline().strip().split())\n\ndef get_list():\n\treturn list(map(int, sys.stdin.readline().strip().split()))\n\ndef get_list_string():\n\treturn list(map(str, sys.stdin.readline().strip().split()))\n\ndef get_string():\n\treturn sys.stdin.readline().strip()\n\ndef get_int():\n\treturn int(sys.stdin.readline().strip())\n\ndef get_print_int(x):\n\tsys.stdout.write(str(x) + '\\n')\n\ndef get_print(x):\n\tsys.stdout.write(x + '\\n')\n\ndef get_print_int_same(x):\n\tsys.stdout.write(str(x) + ' ')\n\ndef get_print_same(x):\n\tsys.stdout.write(x + ' ')\nfrom sys import maxsize\n\ndef solve():\n\tfor _ in range(get_int()):\n\t\tn = get_int()\n\t\tarr = get_list()\n\t\ti = n - 1\n\t\tj = n - 1\n\t\ttemp = sorted(arr)\n\t\tvis = [False] * n\n\t\tans = []\n\t\twhile j >= 0:\n\t\t\tt = j\n\t\t\ttt = []\n\t\t\twhile t >= 0 and arr[t] != temp[i]:\n\t\t\t\tvis[arr[t] - 1] = True\n\t\t\t\ttt.append(arr[t])\n\t\t\t\tt -= 1\n\t\t\tvis[arr[t] - 1] = True\n\t\t\ttt.append(arr[t])\n\t\t\ttt = tt[::-1]\n\t\t\tfor k in tt:\n\t\t\t\tans.append(k)\n\t\t\tj = t - 1\n\t\t\twhile i >= 0 and vis[i]:\n\t\t\t\ti -= 1\n\t\tget_print(' '.join(map(str, ans)))\nsolve()\n",
    ...
  ],
  "starter_code": "",
  "input_output": {
    "inputs": [
      "4\n4\n1 2 3 4\n5\n1 5 2 4 3\n6\n4 2 5 3 6 1\n1\n1\n",
      "4\n4\n2 1 3 4\n5\n1 5 2 4 3\n6\n4 2 5 3 6 1\n1\n1\n",
      "4\n4\n2 1 3 4\n5\n1 5 2 4 3\n6\n2 4 5 3 6 1\n1\n1\n",
      "4\n4\n1 2 3 4\n5\n1 5 2 4 3\n6\n4 2 5 3 6 1\n1\n1\n"
    ],
    "outputs": [
      "4 3 2 1\n5 2 4 3 1\n6 1 5 3 4 2\n1\n",
      "4 3 2 1\n5 2 4 3 1\n6 1 5 3 4 2\n1\n",
      "4 3 2 1\n5 2 4 3 1\n6 1 5 3 4 2\n1\n",
      "\n4 3 2 1\n5 2 4 3 1\n6 1 5 3 4 2\n1\n"
    ]
  },
  "difficulty": "EASY",
  "raw_tags": [
    "data structures",
    "greedy",
    "math"
  ],
  "name": null,
  "source": "codeforces",
  "tags": [
    "Data structures",
    "Mathematics",
    "Greedy algorithms"
  ],
  "skill_types": [
    "Data structures",
    "Greedy algorithms"
  ],
  "url": "https://codeforces.com/problemset/problem/1492/B",
  "Expected Auxiliary Space": null,
  "time_limit": "1 second",
  "date": "2021-02-23",
  "picture_num": "0",
  "memory_limit": "512 megabytes",
  "Expected Time Complexity": null
}

论文引用:

@article{li2023taco,
  title={TACO: Topics in Algorithmic COde generation dataset},
  author={Rongao Li and Jie Fu and Bo-Wen Zhang and Tao Huang and Zhihong Sun and Chen Lyu and Guang Liu and Zhi Jin and Ge Li},
  journal={arXiv preprint arXiv:2312.14852},
  year={2023}
}

数据集版权使用说明:

Apache 2.0 ,详细版权说明

HumanEval

#评测指标-Pass@k

指标说明:

模型针对每个单元测试问题生成k(k=1,10,100)个代码样本,如果有任何样本通过单元测试,则认为问题已解决,并报告问题解决的总比例,即 Pass@k 得分。注:实际测试中,一般的做法是采样n=200次,通过c次,用1-C(k, n-c)/C(k, n)来降低评测值的方差。

数据描述:

OpenAI 发布的 HumanEval 数据集包含 164 道编程问题,其中有函数符号、文档字符串、正文和几个单元测试。编程问题为 Python,注释和文档中包含英文自然文本,均为人工手写的,确保不会包含在代码生成模型的训练集中。

数据集构成和规范:

数据量:

源数据集共 164 道题目。

评测数据量:

评测数据为测试集的 164个测试样例。

源数据格式:

from datasets import load_dataset
load_dataset("openai_humaneval")

DatasetDict({
    test: Dataset({
        features: ['task_id', 'prompt', 'canonical_solution', 'test', 'entry_point'],
        num_rows: 164
    })
})

源数据字段:

KEYSEXPLAIN
task_id测试题的标识符
prompt模型的输入,包含函数头和文档说明
canonical_solutionprompt 问题的解决办法
test包含测试生成代码正确性的函数
entry_point测试入口

源数据集样例:

{
    "task_id": "test/0",
    "prompt": "def return1():\n",
    "canonical_solution": "    return 1",
    "test": "def check(candidate):\n    assert candidate() == 1",
    "entry_point": "return1"
}

论文引用:

@article{chen2021codex,
  title={Evaluating Large Language Models Trained on Code},
  author={Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harri Edwards and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray and Nick Ryder and Mikhail Pavlov and Alethea Power and Lukasz Kaiser and Mohammad Bavarian and Clemens Winter and Philippe Tillet and Felipe Petroski Such and Dave Cummings and Matthias Plappert and Fotios Chantzis and Elizabeth Barnes and Ariel Herbert-Voss and William Hebgen Guss and Alex Nichol and Alex Paino and Nikolas Tezak and Jie Tang and Igor Babuschkin and Suchir Balaji and Shantanu Jain and William Saunders and Christopher Hesse and Andrew N. Carr and Jan Leike and Josh Achiam and Vedant Misra and Evan Morikawa and Alec Radford and Matthew Knight and Miles Brundage and Mira Murati and Katie Mayer and Peter Welinder and Bob McGrew and Dario Amodei and Sam McCandlish and Ilya Sutskever and Wojciech Zaremba},
  year={2021},
  eprint={2107.03374},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

数据集版权使用说明:

MIT License