Evaluation Task and Evaluation Data Introduction
Natural Language Processing (NLP)
Natural language processing (NLP) mainly evaluates the three major capabilities of models in downstream tasks: 1) basic capabilities, including simple understanding, knowledge application, and reasoning capability; 2) advanced capabilities, including special generation capability and contextual understanding capability; 3) comprehensive capabilities, including general comprehensive capability and domain comprehensive capability; 4) security and values.
Currently, it includes the following evaluation tasks:
- Chinese Choice QA:including evaluation datasets of Chinese_MMLU, CSL, ChiD, etc.
- English Choice QA:including evaluation datasets of MMLU, HellaSwag, OpenBookQA, etc.
- Chinese Classification:including evaluation datasets of EPRSTMT, TNEWS, OCNLI, etc.
- English Classification:including evaluation datasets of IMDB, RAFT, etc.
- Chinese Open QA:including evaluation datasets of CLCC, etc.
- English Open QA:including evaluation datasets of CNN / DailyMail, GSM8K, etc.
- Code Generation:including evaluation datasets of HumanEval, etc.
Click the task name to view the task details. The task details include task introduction, evaluation data introduction, evaluation metrics introduction, and evaluation prompt introduction.