Evaluation Task and Evaluation Data Introduction
Natural Language Processing (NLP)
The main focus is on evaluating different types of capabilities of large language models. In addition to the self-constructed datasets, for some mainstream competence categories, we have also selected a number of public datasets that have not yet been saturated for evaluation:
- Synthesis Capability:includes evaluation datasets such as MMLU-Pro
- Reasoning Capability:includes datasets such as MuSR
- Mathematical Capability:includes evaluation datasets such as GPQA
- Programming Capability:includes evaluation datasets such as LiveCodeBench
- Tool Invocation:includes evaluation datasets such as CLCC