Skip to content

Evaluation Dataset

The following datasets are all transformed into standard Evaluation Prompts before evaluation.

Dataset 1(Chinese_MMLU)

#Metrics-Exact Match

Data description:

Chinese_MMLU,translated from MMLU,is a large multi-task test dataset consisting of multiple choice questions from different branches of knowledge. The tests cover the humanities, social sciences, natural sciences, and other important fields. It covers 57 tasks, including elementary math, American history, computer science, law, and more.

Dataset structure:

Amount of source data:

The dataset is split into auxiliary train(99842), validation(1531), test(14042),development(285)

Data detail:

KEYSEXPLAIN
questiona string feature
choicesa list of four options
answercorrect choice

Sample of source dataset:

{
"question" "舌骨的胚胎起源是什么?"
“choices”:[“第一咽弓”,“第一和第二咽弓”,“第二咽弓”,“第二和第三咽弓”],
“answer”:“D”
}

Licensing information:

MIT License

Dataset 2(CSL)

#Metris-Exact Match

Data description:

Chinese science and technology literature data set (CSL) is taken from Chinese abstracts and their keywords, and the papers are selected from some core Chinese social science and natural science journals. The task objective is to determine whether all the keywords are true keywords according to the abstract (true as 1, fake as 0).

Dataset structure:

Amount of source data:

The dataset is split into train(32), validation(32), public test(2828), test(3000), unsupervised (19841)

Data detail:

KEYSEXPLAIN
idpaper ID
abstpaper abstract
keywordkeyword
labeltrue and false label

Sample of dataset:

{"id": 1, 
"abst": "为解决传统均匀FFT波束形成算法引起的3维声呐成像分辨率降低的问题,该文提出分区域FFT波束形成算法.远场条件下,
以保证成像分辨率为约束条件,以划分数量最少为目标,采用遗传算法作为优化手段将成像区域划分为多个区域.在每个区域内选取一个波束方向,获得每一个接收阵元收到该方向回波时的解调输出,以此为原始数据在该区域内进行传统均匀FFT波束形成.对FFT计算过程进行优化,降低新算法的计算量,使其满足3维成像声呐实时性的要求.仿真与实验结果表明,采用分区域FFT波束形成算法的成像分辨率较传统均匀FFT波束形成算法有显著提高,且满足实时性要求.",
 "keyword": ["水声学", "FFT", "波束形成", "3维成像声呐"], 
"label": "1"}

Citation information:

{FewCLUE,
  title={FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark},
  author={Liang Xu, Xiaojing Lu, Chenyang Yuan, Xuanwei Zhang, Huilin Xu, Hu Yuan, Guoao Wei, Xiang Pan, Xin Tian, Libo Qin, Hu Hai},
  year={2021},
  howpublished={\url{https://arxiv.org/abs/2107.07498}},
}

Dataset 3(ChID)

#Metrics-Exact match

Data description:

ChID is a large-scale Chinese gestalt test data set, which is used to study the understanding of idioms, a unique linguistic phenomenon in Chinese. In this corpus, idioms in paragraphs are replaced by blank symbols and realized in the form of idiom cloze. Many idioms in the text are masked, and the candidate items include idioms with near meaning.

Dataset structure:

Amount of source data:

The dataset is split into train(42), validation(42), public test(2002), test(2000), unsupervised (7585)

Data detail:

KEYSEXPLAIN
iddata id
candidatesidiom candidate
contentcontent
answercorrect location of the idiom

Sample of source dataset:

{"id": 1421, 

"candidates": ["巧言令色", "措手不及", "风流人物", "八仙过海", "平铺直叙", "草木皆兵", "言行一致"],
"content": "当广州憾负北控,郭士强黯然退场那一刻,CBA季后赛悬念仿佛一下就消失了,可万万没想到,就在时隔1天后,北控外援约瑟夫-杨因个人裁决案(拖欠上一家经纪公司的费用),导致被禁赛,打了马布里一个#idiom#,加上郭士强带领广州神奇逆转天津,让...", 

"answer": 1}

Citation information:

{FewCLUE,
  title={FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark},
  author={Liang Xu, Xiaojing Lu, Chenyang Yuan, Xuanwei Zhang, Huilin Xu, Hu Yuan, Guoao Wei, Xiang Pan, Xin Tian, Libo Qin, Hu Hai},
  year={2021},
  howpublished={\url{https://arxiv.org/abs/2107.07498}},
}

Dataset 4(CLUEWSC)

#Metrics-Exact match

Data description:

The Winograd Scheme Challenge (WSC) is a kind of pronoun disambiguation task, that is, to determine which noun the pronoun in a sentence refers to. Questions appear in the form of true or false checks, such as: At that moment [mobile phone], which was placed on [bed] [pillow], rang. I felt strange because [it] had been stopped for two months because of arrears, and now [it] rang suddenly. Does "it" mean "bed", "pillow", or "phone"? Sentences are selected from the literary works of modern and contemporary Chinese writers, and then manually selected and marked by language experts.

Dataset structure:

Amount of source data:

The dataset is split into train(32), validation(32), public test(976), test(290), unsupervised (0)

Data detail:

KEYSEXPLAIN
targetwhat are pronouns and nouns and where do they appear in a sentence
idxdata id
labelThe true-false tag, "true" means the pronoun does refer to the noun in span1_text, and "false" means it does not
texttext

Sample of source dataset:

 {"target": 
     {"span2_index": 37, 
     "span1_index": 5, 
     "span1_text": "床", 
     "span2_text": "它"}, 
 "idx": 261, 
 "label": "false", 
 "text": "这时候放在床上枕头旁边的手机响了,我感到奇怪,因为欠费已被停机两个月,现在它突然响了。"}

Citation information:

{FewCLUE,
  title={FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark},
  author={Liang Xu, Xiaojing Lu, Chenyang Yuan, Xuanwei Zhang, Huilin Xu, Hu Yuan, Guoao Wei, Xiang Pan, Xin Tian, Libo Qin, Hu Hai},
  year={2021},
  howpublished={\url{https://arxiv.org/abs/2107.07498}},
}

Dataset 5 (C-SEM)

#Metrics-Exact Match

Semantic understanding is seen as a key cornerstone in the research and application of natural language processing. However, there is still a lack of publicly available benchmarks that approach from a linguistic perspective in the field of evaluating large Chinese language models.

Peking University and Minjiang College, as co-builders of the FlagEval flagship project, have collaborated to create the C-SEM (Chinese SEMantic evaluation dataset) semantic evaluation benchmark dataset.

C-SEM innovatively constructs various levels and difficulties of evaluation data to address the potential flaws and inadequacies of current large models. It examines the models' "thinking" process in understanding semantics, referencing human language cognition habits. The currently open-source version, C-SEM v1.0, includes four sub-evaluation items, assessing models' semantic understanding abilities at both the lexical and sentence levels, offering broad applicability for research comparison.

The sub-evaluation items of C-SEM are: Lexical Level Semantic Relationship Classification (LLSRC), Sentence Level Semantic Relationship Classification (SLSRC), Sentence Level Polysemous Words Classification (SLPWC), and Sentence Level Rhetoric Figure Classification (SLRFC). Future iterations of the C-SEM benchmark will continue to evolve, covering more semantic understanding-related knowledge and forming a multi-level semantic understanding evaluation system. Meanwhile, the FlagEval large model evaluation platform will integrate the latest versions promptly to enhance the comprehensiveness of evaluating Chinese capabilities of large language models.

Note: To ensure fair and impartial evaluation results and prevent the risk of evaluation set leakage, the C-SEM evaluation set used on the FlagEval official website (flageval.baai.ac.cn) will be asynchronously updated with the open-source version. The current FlagEval version, compared to the open-source one, has more questions, richer formats, and uses a 5-shot formation for evaluation, referencing the HELM approach.

LLSRC

Data description:

LLSRC (Lexical Level Semantic Relationship Classification) adopts a multiple-choice format, including relationship selection, word selection, and word pairing selection, requiring the model to give the correct answer. The semantic relationships of Chinese vocabulary involved include synonymous, synonymous antonyms, and hypernym-hyponym relationships, which are used to evaluate the model's understanding of semantic relationships at the vocabulary level. This dataset is unpublished and not available for public use.

Dataset structure:

Data detail:
KEYSEXPLAIN
questionString
choicesList with four options
answerThe correct answer
Sample of source dataset:
{
  "question": "花与菊花是什么关系?",
  "choices": ["上下位", "整体与部分", "近义", "反义"],
  "answer": "A"
}

SLSRC

Data description:

SLSRC (Sentence Level Semantic Relationship Classification) adopts a multiple-choice format, requiring the model to provide the correct semantic relationship judgment based on the context of the sentence when given a sentence and a specified word. This is used to evaluate the model's ability to understand the semantics of vocabulary in the context of sentences. This dataset is not publicly available for use.

Dataset structure:

Data detail:
KEYSEXPLAIN
questionString
choicesList with four options
answerThe correct answer
Sample of source dataset:
{
  "question": "“我最喜欢吃包心菜了。”这句话中“包心菜”与哪个词是同义或近义关系?",
  "choices":["大头菜", "茼蒿", "圆白菜", "西兰花"],
  "answer":"A"
}

SLPWC

Data description:

SLPWC (Sentence Level Polysemous Words Classification) adopts a multiple-choice format, requiring the model to provide the correct semantic relationship judgment based on the specified word when given a sentence and several candidate sentences. This is used to evaluate the model's ability to understand the semantics of polysemous vocabulary in sentences. This dataset is not publicly available for use.

Dataset structure:

Data detail:
KEYSEXPLAIN
questionString
choicesList with four options
answerThe correct answer
Sample of source dataset:
{
  "question": "以下哪句话中“泰山”的含义与其他句子意思不同。",
  "choices":[
    "为人民而死重于泰山。", 
    "登上泰山顶峰,眺望海上日出。", 
    "我们都知道,岳父还有一个称呼,叫“老泰山”", 
    "人固有一死,或重于泰山,或轻于鸿毛。司马迁"],
  "answer":"C"
}

SLRFC

Data description:

SLRFC (Sentence Level Rhetoric Figure Classification) adopts a multiple-choice format, requiring the model to provide the correct judgment of rhetorical figures, including metaphors, parallelism, rhetorical questions, and personification. This is used to evaluate the model's ability to understand the semantics of polysemous words in sentences. This dataset is not publicly available for use.

Dataset structure:

Data detail:
KEYSEXPLAIN
questionString
choicesList with four options
answerThe correct answer
Sample of source dataset:
{
  "question": "以下哪个句子使用了比喻修辞手法?",
  "choices":[
    "友谊是火,在寒风中给你温暖。", 
    "桃树杏树梨树,你不让我,我不让你,都开满了花赶趟儿", 
    "成功是什么,是一次考试的优异成绩,成功是什么,是给我们自信的泉源,成功是什么,是经过不懈努力最终达到目的的喜悦……", "月明人静的夜里,它们便唱起歌来,织,织,织,织呀。织,织,织,织呀。那歌声真好听。赛过催眠曲。"],
  "answer":"A"
}

Dataset 6 (Gaokao2023_v2)

#Metrics-Exact Match

Data description:

The GaoKao2023_v2 dataset has compiled a total of 364 objective questions from the 2023 Gaokao exam papers, removing any disruptive elements such as special symbols. The questions are categorized across different disciplines, including 62 in biology, 20 in chemistry, 12 in Chinese, 59 in English, 13 in geography, 64 in history, 66 in math, 11 in physics, and 57 in politics.

Dataset structure:

Amount of source data:

Testset(364)

Data:

KEYSEXPLAIN
questionString
choicesList with four options
answerThe correct answer
sourceThe source of the test papers

Sample of source dataset:

{
  "question": "孟子说:“五亩之宅,树之以桑,五十(岁)者可以衣帛矣;鸡豚狗彘之畜,无失其时,七十(岁)者可以食肉矣;百亩之田,勿夺其时,数口之家可以无饥矣。”这一观点所依托的时代背景是",
  "choices":[
    "休养生息政策的实施", 
    "井田制度的繁荣", 
    "农业生产技术的发展", 
    "商业活动的衰退"],
  "answer":"C"
  "source":"2023年全国乙卷文综历史高考真题文档版"
}

C-Eval

Data description:

C-Eval is a comprehensive Chinese evaluation dataset for foundation models. It consists of 13948 multiple choice questions spanning 52 diverse disciplines and 4 difficulty levels. Each subject consists of 3 splits: dev, val, and test. The dev set consists of 5 exemplars with explanations for few-shot evaluation. The val set is for hyperparameter tuning, and the test set is for model evaluation.

Dataset structure:

Amount of source data:

Test (12342), Val (1346), Dev (260)

Data detail:

KEYSEXPLAIN
idinteger
questiona string feature
Achoice A string
Bchoice B string
Cchoice C string
Dchoice D string
answera string feature
explanationa string feature

Sample of source dataset:

id: 1
question: 25 °C时,将pH=2的强酸溶液与pH=13的强碱溶液混合,所得混合液的pH=11,则强酸溶液与强碱溶液 的体积比是(忽略混合后溶液的体积变化)____
A: 11:1
B: 9:1
C: 1:11
D: 1:9
answer: B
explanation: 
1. pH=13的强碱溶液中c(OH-)=0.1mol/L, pH=2的强酸溶液中c(H+)=0.01mol/L,酸碱混合后pH=11,即c(OH-)=0.001mol/L。
2. 设强酸和强碱溶液的体积分别为x和y,则:c(OH-)=(0.1y-0.01x)/(x+y)=0.001,解得x:y=9:1。

Citation information:

@inproceedings{huang2023ceval,
               title={C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models}, 
               author={Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Jiayi and Fu, Yao and Sun, Maosong and He, Junxian},
               booktitle={Advances in Neural Information Processing Systems},
               year={2023}
}

Licensing information:

cc-by-nc-sa 4.0, MIT License