Evaluation Task and Evaluation Data Introduction
Natural Language Processing (NLP)
The main focus is on evaluating different types of capabilities of large language models. In addition to the self-constructed datasets, for some mainstream competence categories, we have also selected a number of public datasets that have not yet been saturated for evaluation:
Synthesis capability: including evaluation datasets such as MMLU-Pro.
Reasoning capability: including MuSR and other evaluation datasets.
Mathematical capability: including evaluation datasets such as GPQA.
Programming capability: including evaluation datasets such as LiveCodeBench.
Tool invocation: including evaluation datasets such as CLCC.