Skip to content

Evaluation Task and Evaluation Data Introduction

Natural Language Processing (NLP)

The main focus is on evaluating different types of capabilities of large language models. In addition to the self-constructed datasets, for some mainstream competence categories, we have also selected a number of public datasets that have not yet been saturated for evaluation:

Synthesis capability: including evaluation datasets such as MMLU-Pro.

Reasoning capability: including MuSR and other evaluation datasets.

Mathematical capability: including evaluation datasets such as GPQA.

Programming capability: including evaluation datasets such as LiveCodeBench.

Tool invocation: including evaluation datasets such as CLCC.