Evaluation Dataset
The following datasets are all transformed into standard Evaluation Prompts before evaluation.
Dataset 1(Chinese_MMLU)
Data description:
Chinese_MMLU,translated from MMLU,is a large multi-task test dataset consisting of multiple choice questions from different branches of knowledge. The tests cover the humanities, social sciences, natural sciences, and other important fields. It covers 57 tasks, including elementary math, American history, computer science, law, and more.
Dataset structure:
Amount of source data:
The dataset is split into auxiliary train(99842), validation(1531), test(14042),development(285)
Data detail:
KEYS | EXPLAIN |
---|---|
question | a string feature |
choices | a list of four options |
answer | correct choice |
Sample of source dataset:
{
"question" "舌骨的胚胎起源是什么?"
“choices”:[“第一咽弓”,“第一和第二咽弓”,“第二咽弓”,“第二和第三咽弓”],
“answer”:“D”
}
Licensing information:
Dataset 2(CSL)
Data description:
Chinese science and technology literature data set (CSL) is taken from Chinese abstracts and their keywords, and the papers are selected from some core Chinese social science and natural science journals. The task objective is to determine whether all the keywords are true keywords according to the abstract (true as 1, fake as 0).
Dataset structure:
Amount of source data:
The dataset is split into train(32), validation(32), public test(2828), test(3000), unsupervised (19841)
Data detail:
KEYS | EXPLAIN |
---|---|
id | paper ID |
abst | paper abstract |
keyword | keyword |
label | true and false label |
Sample of dataset:
{"id": 1,
"abst": "为解决传统均匀FFT波束形成算法引起的3维声呐成像分辨率降低的问题,该文提出分区域FFT波束形成算法.远场条件下,
以保证成像分辨率为约束条件,以划分数量最少为目标,采用遗传算法作为优化手段将成像区域划分为多个区域.在每个区域内选取一个波束方向,获得每一个接收阵元收到该方向回波时的解调输出,以此为原始数据在该区域内进行传统均匀FFT波束形成.对FFT计算过程进行优化,降低新算法的计算量,使其满足3维成像声呐实时性的要求.仿真与实验结果表明,采用分区域FFT波束形成算法的成像分辨率较传统均匀FFT波束形成算法有显著提高,且满足实时性要求.",
"keyword": ["水声学", "FFT", "波束形成", "3维成像声呐"],
"label": "1"}
Citation information:
{FewCLUE,
title={FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark},
author={Liang Xu, Xiaojing Lu, Chenyang Yuan, Xuanwei Zhang, Huilin Xu, Hu Yuan, Guoao Wei, Xiang Pan, Xin Tian, Libo Qin, Hu Hai},
year={2021},
howpublished={\url{https://arxiv.org/abs/2107.07498}},
}
Dataset 3(ChID)
Data description:
ChID is a large-scale Chinese gestalt test data set, which is used to study the understanding of idioms, a unique linguistic phenomenon in Chinese. In this corpus, idioms in paragraphs are replaced by blank symbols and realized in the form of idiom cloze. Many idioms in the text are masked, and the candidate items include idioms with near meaning.
Dataset structure:
Amount of source data:
The dataset is split into train(42), validation(42), public test(2002), test(2000), unsupervised (7585)
Data detail:
KEYS | EXPLAIN |
---|---|
id | data id |
candidates | idiom candidate |
content | content |
answer | correct location of the idiom |
Sample of source dataset:
{"id": 1421,
"candidates": ["巧言令色", "措手不及", "风流人物", "八仙过海", "平铺直叙", "草木皆兵", "言行一致"],
"content": "当广州憾负北控,郭士强黯然退场那一刻,CBA季后赛悬念仿佛一下就消失了,可万万没想到,就在时隔1天后,北控外援约瑟夫-杨因个人裁决案(拖欠上一家经纪公司的费用),导致被禁赛,打了马布里一个#idiom#,加上郭士强带领广州神奇逆转天津,让...",
"answer": 1}
Citation information:
{FewCLUE,
title={FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark},
author={Liang Xu, Xiaojing Lu, Chenyang Yuan, Xuanwei Zhang, Huilin Xu, Hu Yuan, Guoao Wei, Xiang Pan, Xin Tian, Libo Qin, Hu Hai},
year={2021},
howpublished={\url{https://arxiv.org/abs/2107.07498}},
}
Dataset 4(CLUEWSC)
Data description:
The Winograd Scheme Challenge (WSC) is a kind of pronoun disambiguation task, that is, to determine which noun the pronoun in a sentence refers to. Questions appear in the form of true or false checks, such as: At that moment [mobile phone], which was placed on [bed] [pillow], rang. I felt strange because [it] had been stopped for two months because of arrears, and now [it] rang suddenly. Does "it" mean "bed", "pillow", or "phone"? Sentences are selected from the literary works of modern and contemporary Chinese writers, and then manually selected and marked by language experts.
Dataset structure:
Amount of source data:
The dataset is split into train(32), validation(32), public test(976), test(290), unsupervised (0)
Data detail:
KEYS | EXPLAIN |
---|---|
target | what are pronouns and nouns and where do they appear in a sentence |
idx | data id |
label | The true-false tag, "true" means the pronoun does refer to the noun in span1_text, and "false" means it does not |
text | text |
Sample of source dataset:
{"target":
{"span2_index": 37,
"span1_index": 5,
"span1_text": "床",
"span2_text": "它"},
"idx": 261,
"label": "false",
"text": "这时候放在床上枕头旁边的手机响了,我感到奇怪,因为欠费已被停机两个月,现在它突然响了。"}
Citation information:
{FewCLUE,
title={FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark},
author={Liang Xu, Xiaojing Lu, Chenyang Yuan, Xuanwei Zhang, Huilin Xu, Hu Yuan, Guoao Wei, Xiang Pan, Xin Tian, Libo Qin, Hu Hai},
year={2021},
howpublished={\url{https://arxiv.org/abs/2107.07498}},
}
Dataset 5 (C-SEM)
Semantic understanding is seen as a key cornerstone in the research and application of natural language processing. However, there is still a lack of publicly available benchmarks that approach from a linguistic perspective in the field of evaluating large Chinese language models.
Peking University and Minjiang College, as co-builders of the FlagEval flagship project, have collaborated to create the C-SEM (Chinese SEMantic evaluation dataset) semantic evaluation benchmark dataset.
C-SEM innovatively constructs various levels and difficulties of evaluation data to address the potential flaws and inadequacies of current large models. It examines the models' "thinking" process in understanding semantics, referencing human language cognition habits. The currently open-source version, C-SEM v1.0, includes four sub-evaluation items, assessing models' semantic understanding abilities at both the lexical and sentence levels, offering broad applicability for research comparison.
The sub-evaluation items of C-SEM are: Lexical Level Semantic Relationship Classification (LLSRC), Sentence Level Semantic Relationship Classification (SLSRC), Sentence Level Polysemous Words Classification (SLPWC), and Sentence Level Rhetoric Figure Classification (SLRFC). Future iterations of the C-SEM benchmark will continue to evolve, covering more semantic understanding-related knowledge and forming a multi-level semantic understanding evaluation system. Meanwhile, the FlagEval large model evaluation platform will integrate the latest versions promptly to enhance the comprehensiveness of evaluating Chinese capabilities of large language models.
Note: To ensure fair and impartial evaluation results and prevent the risk of evaluation set leakage, the C-SEM evaluation set used on the FlagEval official website (flageval.baai.ac.cn) will be asynchronously updated with the open-source version. The current FlagEval version, compared to the open-source one, has more questions, richer formats, and uses a 5-shot formation for evaluation, referencing the HELM approach.
LLSRC
Data description:
LLSRC (Lexical Level Semantic Relationship Classification) adopts a multiple-choice format, including relationship selection, word selection, and word pairing selection, requiring the model to give the correct answer. The semantic relationships of Chinese vocabulary involved include synonymous, synonymous antonyms, and hypernym-hyponym relationships, which are used to evaluate the model's understanding of semantic relationships at the vocabulary level. This dataset is unpublished and not available for public use.
Dataset structure:
Data detail:
KEYS | EXPLAIN |
---|---|
question | String |
choices | List with four options |
answer | The correct answer |
Sample of source dataset:
{
"question": "花与菊花是什么关系?",
"choices": ["上下位", "整体与部分", "近义", "反义"],
"answer": "A"
}
SLSRC
Data description:
SLSRC (Sentence Level Semantic Relationship Classification) adopts a multiple-choice format, requiring the model to provide the correct semantic relationship judgment based on the context of the sentence when given a sentence and a specified word. This is used to evaluate the model's ability to understand the semantics of vocabulary in the context of sentences. This dataset is not publicly available for use.
Dataset structure:
Data detail:
KEYS | EXPLAIN |
---|---|
question | String |
choices | List with four options |
answer | The correct answer |
Sample of source dataset:
{
"question": "“我最喜欢吃包心菜了。”这句话中“包心菜”与哪个词是同义或近义关系?",
"choices":["大头菜", "茼蒿", "圆白菜", "西兰花"],
"answer":"A"
}
SLPWC
Data description:
SLPWC (Sentence Level Polysemous Words Classification) adopts a multiple-choice format, requiring the model to provide the correct semantic relationship judgment based on the specified word when given a sentence and several candidate sentences. This is used to evaluate the model's ability to understand the semantics of polysemous vocabulary in sentences. This dataset is not publicly available for use.
Dataset structure:
Data detail:
KEYS | EXPLAIN |
---|---|
question | String |
choices | List with four options |
answer | The correct answer |
Sample of source dataset:
{
"question": "以下哪句话中“泰山”的含义与其他句子意思不同。",
"choices":[
"为人民而死重于泰山。",
"登上泰山顶峰,眺望海上日出。",
"我们都知道,岳父还有一个称呼,叫“老泰山”",
"人固有一死,或重于泰山,或轻于鸿毛。司马迁"],
"answer":"C"
}
SLRFC
Data description:
SLRFC (Sentence Level Rhetoric Figure Classification) adopts a multiple-choice format, requiring the model to provide the correct judgment of rhetorical figures, including metaphors, parallelism, rhetorical questions, and personification. This is used to evaluate the model's ability to understand the semantics of polysemous words in sentences. This dataset is not publicly available for use.
Dataset structure:
Data detail:
KEYS | EXPLAIN |
---|---|
question | String |
choices | List with four options |
answer | The correct answer |
Sample of source dataset:
{
"question": "以下哪个句子使用了比喻修辞手法?",
"choices":[
"友谊是火,在寒风中给你温暖。",
"桃树杏树梨树,你不让我,我不让你,都开满了花赶趟儿",
"成功是什么,是一次考试的优异成绩,成功是什么,是给我们自信的泉源,成功是什么,是经过不懈努力最终达到目的的喜悦……", "月明人静的夜里,它们便唱起歌来,织,织,织,织呀。织,织,织,织呀。那歌声真好听。赛过催眠曲。"],
"answer":"A"
}
Dataset 6 (Gaokao2023_v2)
Data description:
The GaoKao2023_v2 dataset has compiled a total of 364 objective questions from the 2023 Gaokao exam papers, removing any disruptive elements such as special symbols. The questions are categorized across different disciplines, including 62 in biology, 20 in chemistry, 12 in Chinese, 59 in English, 13 in geography, 64 in history, 66 in math, 11 in physics, and 57 in politics.
Dataset structure:
Amount of source data:
Testset(364)
Data:
KEYS | EXPLAIN |
---|---|
question | String |
choices | List with four options |
answer | The correct answer |
source | The source of the test papers |
Sample of source dataset:
{
"question": "孟子说:“五亩之宅,树之以桑,五十(岁)者可以衣帛矣;鸡豚狗彘之畜,无失其时,七十(岁)者可以食肉矣;百亩之田,勿夺其时,数口之家可以无饥矣。”这一观点所依托的时代背景是",
"choices":[
"休养生息政策的实施",
"井田制度的繁荣",
"农业生产技术的发展",
"商业活动的衰退"],
"answer":"C"
"source":"2023年全国乙卷文综历史高考真题文档版"
}
C-Eval
Data description:
C-Eval is a comprehensive Chinese evaluation dataset for foundation models. It consists of 13948 multiple choice questions spanning 52 diverse disciplines and 4 difficulty levels. Each subject consists of 3 splits: dev, val, and test. The dev set consists of 5 exemplars with explanations for few-shot evaluation. The val set is for hyperparameter tuning, and the test set is for model evaluation.
Dataset structure:
Amount of source data:
Test (12342), Val (1346), Dev (260)
Data detail:
KEYS | EXPLAIN |
---|---|
id | integer |
question | a string feature |
A | choice A string |
B | choice B string |
C | choice C string |
D | choice D string |
answer | a string feature |
explanation | a string feature |
Sample of source dataset:
id: 1
question: 25 °C时,将pH=2的强酸溶液与pH=13的强碱溶液混合,所得混合液的pH=11,则强酸溶液与强碱溶液 的体积比是(忽略混合后溶液的体积变化)____
A: 11:1
B: 9:1
C: 1:11
D: 1:9
answer: B
explanation:
1. pH=13的强碱溶液中c(OH-)=0.1mol/L, pH=2的强酸溶液中c(H+)=0.01mol/L,酸碱混合后pH=11,即c(OH-)=0.001mol/L。
2. 设强酸和强碱溶液的体积分别为x和y,则:c(OH-)=(0.1y-0.01x)/(x+y)=0.001,解得x:y=9:1。
Citation information:
@inproceedings{huang2023ceval,
title={C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models},
author={Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Jiayi and Fu, Yao and Sun, Maosong and He, Junxian},
booktitle={Advances in Neural Information Processing Systems},
year={2023}
}
Licensing information:
cc-by-nc-sa 4.0, MIT License
C-IDM
Data Description
The Idiom Comprehension Dataset is constructed in the form of standardized multiple-choice questions, mainly including two core question-type modules. The first type is the context-matching multiple-choice question. The question provides a complete sentence context, with an idiom-filling blank in a key position. The examinee is required to accurately select the idiom that conforms to the semantic, grammatical and logical relationships of the context from multiple options. The second type focuses on the discrimination of idiom relationships. The question lists several idioms, and the examinee is required to judge the logical associations or differential characteristics among the idioms from the dimensions of near-synonym relationships, antonym relationships, differences in emotional colors, and differences in applicable contexts.
Adaptation Method
LoRA Adaptation
LoRA fine-tuning refers to using low-rank decomposition to represent the parameter updates of large models, thereby reducing the resources and time required for fine-tuning. LoRA is the abbreviation of "Low-Rank Adaptation of Large Language Models," originating from the paper LoRA: Low-Rank Adaptation of Large Language Models. The basic idea of LoRA is to assume that the amount of weight change in the model during task adaptation is low-rank. Therefore, parameter updates can be represented by two smaller matrices, while keeping the pre-trained weights unchanged. LoRA can be applied to various natural language processing tasks, such as content understanding and generation tasks. Experiments show that LoRA can significantly reduce training parameters and inference latency while maintaining or improving model performance.
Dataset Composition and Specifications
Source Data Volume
There are 700 items in the training set, 100 items in the validation set, and 200 items in the test set.
Evaluation Data Volume
The evaluation data volume is the publicly available test set of 200 items.
Source Data Fields
KEYS | EXPLAIN |
---|---|
question | question data |
choices | A list containing four options |
answer | The correct answer |
Sample of the Source Dataset
{
"question": "Which of the following idioms has the least similar meaning to the other three?",
"choices": [
"Lead by example",
"Flat-as-a-pancake",
"Take the lead",
"Set an example"
],
"answer": "B"
}
Paper Citation
@article{hu2022lora,
title={Lora: Low-rank adaptation of large language models.},
author={Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu and others},
journal={ICLR},
volume={1},
number={2},
pages={3},
year={2022}
}
Source Dataset Copyright Usage Notice:
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
C-RDL
Data Description
The Two-Part Allegorical Saying Comprehension Dataset is structured as standardized multiple-choice questions, comprising two main categories:
Upper-Lower Sentence Correlation Multiple-Choice Questions
Questions present the complete structural elements of two-part allegorical sayings:- Deriving the lower sentence from the upper sentence: Given the first half (metaphorical description) of an allegorical saying (e.g., "Confucius moving house"), examinees must select the logically matching second half (explanatory meaning) from options (e.g., "All books (losses)"). Distractors may include homophone misunderstandings or semantically related but inaccurate expressions.
- Deriving the upper sentence from the lower sentence: Providing the second half of an allegorical saying, examinees must reversely match the corresponding first half. This focuses on testing the accuracy of memory for fixed collocations and the depth of understanding of metaphorical logic. Through structured option design, such questions accurately assess examinees’ mastery of fixed collocations and their ability to analyze semantic mapping relationships between the two parts.
Contextual Application Multiple-Choice Questions
Questions construct real-life language scenarios with gaps for allegorical sayings at key positions, requiring examinees to select the most context-appropriate option based on:- Semantic fit: Determining whether the allegorical saying’s meaning aligns with the sentence’s core message (e.g., distinguishing between "Monkey breaking corn" and "墙头草 [fence-sitter]" in a context describing "capricious behavior").
- Rhetorical appropriateness: Evaluating whether rhetorical devices (e.g., metaphors, puns) match the context’s style (e.g., judging the suitability of colloquial allegorical sayings in formal written contexts).
- Cultural metaphor comprehension: Assessing knowledge of cultural allusions (e.g., understanding "Zhou Yu beats Huang Gai" requires familiarity with Romance of the Three Kingdoms典故 [historical allusions]). These questions evaluate practical application skills through semantic discrimination in specific contexts, covering competencies from basic memory to contextual transfer.
Adaptation Method
LoRA Adaptation
LoRA fine-tuning uses low-rank decomposition to represent parameter updates of large models, reducing resources and time. LoRA ("Low-Rank Adaptation of Large Language Models") originates from the paper LoRA: Low-Rank Adaptation of Large Language Models. Its core idea assumes weight changes during task adaptation are low-rank, allowing parameter updates via two smaller matrices while keeping pre-trained weights unchanged. Applicable to NLP tasks (e.g., content understanding/generation), LoRA significantly reduces training parameters and inference latency while maintaining or improving performance.
Dataset Composition and Specifications
Source Data Volume
- Training set: 796 items
- Validation set: 114 items
- Test set: 227 items
Evaluation Data Volume
- Publicly available test set: 227 items
Source Data Fields
KEY | EXPLAIN |
---|---|
question | Question data |
choices | List of four options |
answer | Correct answer (e.g., "C") |
Source Dataset Sample
{
"question": "____, which means not forcing others but waiting for willing participants or acceptors.",
"choices": [
"Confucius moving house — all books (losses)",
"The flood washes away the Dragon King's temple — family members not recognizing each other",
"Jiang Taigong fishing — those who are willing take the bait",
"Hanging a sheep's head but selling dog meat — having a reputation but no real substance"
],
"answer": "C"
}
Paper Citation
@article{hu2022lora,
title={Lora: Low-rank adaptation of large language models},
author={Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu and others},
journal={ICLR},
volume={1},
number={2},
pages={3},
year={2022}
}
Source Dataset Copyright Usage Notice
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
C-KLR
Data Description
The Knowledge Application and Logical Reasoning Dataset includes various question types such as multiple-choice and fill-in-the-blank, with reasoning difficulty levels ranging from 1 to 10 points.
Adaptation Method
LoRA Adaptation
LoRA fine-tuning refers to using low-rank decomposition to represent the parameter updates of large models, thereby reducing the resources and time required for fine-tuning. LoRA is the abbreviation of "Low-Rank Adaptation of Large Language Models," originating from the paper LoRA: Low-Rank Adaptation of Large Language Models.
The basic idea of LoRA is to assume that the amount of weight change in the model during task adaptation is low-rank. Therefore, parameter updates can be represented by two smaller matrices, while keeping the pre-trained weights unchanged. LoRA can be applied to various natural language processing tasks, such as content understanding and generation tasks. Experiments show that LoRA can significantly reduce training parameters and inference latency while maintaining or improving model performance.
Dataset Composition and Specifications
Source Data Volume
- Training set: 1,400 items
- Validation set: 200 items
- Test set: 400 items
Evaluation Data Volume
Publicly available test set of 400 items.
Source Data Fields
KEYS | EXPLAIN |
---|---|
question | Question data |
choices | List of four options |
answer | Correct answer |
Source Dataset Sample
{
"question": "Xiao Yan, Xiao Yi, and Xiao Kong stood out from their unit and participated in a job competition in the city. Five predictions were made: (1) Both Xiao Yan and Xiao Yi are selected; (2) At most one of Xiao Yan and Xiao Yi is selected; (3) Xiao Yan is selected, but Xiao Yi is not; (4) Xiao Yan is not selected, but Xiao Yi is selected; (5) If Xiao Yan is selected, then Xiao Kong is also selected. It turned out that only one prediction was correct. Which of the following can be inferred?",
"choices": [
"Neither Xiao Yan nor Xiao Yi is selected",
"Both Xiao Yi and Xiao Kong are selected",
"Neither Xiao Yan nor Xiao Kong is selected",
"Both Xiao Yan and Xiao Yi are selected"
],
"answer": "D"
}
Paper Citation
@article{hu2022lora,
title={Lora: Low-rank adaptation of large language models.},
author={Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu and others},
journal={ICLR},
volume={1},
number={2},
pages={3},
year={2022}
}
Source Dataset Copyright Usage Notice
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
C-CRM
#Accuracy
Data Description
The National College Entrance Examination rewriting dataset contains authentic multiple-choice questions from mathematics, chemistry, biology, history, and geography sections of provincial/municipal college entrance exams. These questions were manually rewritten following strict rules to maintain validity and consistency.
Rewriting Methods
Rewriting Options
- Original Correct Option Modification:
- Replace the original correct option with an incorrect one.
- Introduce a new correct option (substantive changes, not just value adjustments or order swaps).
- Option Modification Limit:
- Only modify two options (the original correct option and the newly added correct option).
- Keep all other options unchanged.
Rewriting the Question
- Make minimal modifications to the question stem (e.g., adjusting numerical values, reaction formulas, or factual descriptions).
- Ensure the question remains single-choice with a changed correct answer.
Dataset Composition and Specifications
Evaluation Data Size
Subject | Number of Questions |
---|---|
Mathematics | 150 |
Chemistry | 150 |
Biology | 304 |
History | 50 |
Geography | 35 |
Total | 689 |
Source Data Fields
KEYS | EXPLANATION |
---|---|
question | Question stem |
options | List of all candidate options |
answer | Correct answer after rewriting |
Source Dataset Example
{
"question": "(5 points) In the complex plane, the point representing the complex number z=i(2+i) is located in ( )",
"options": [
"First quadrant",
"Second quadrant",
"Third quadrant",
"Fourth quadrant",
"None of the above"
],
"answer": "B"
}
C-NRM
#Accuracy
Data Description
The Civil Servant Exam rewriting dataset contains 300 multiple-choice questions adapted from real administrative aptitude test questions. These questions were manually rewritten following strict rules to maintain validity and consistency.
Rewriting Methods
Rewriting Options
- Original Correct Option Modification:
- Replace the original correct option with an incorrect one
- Introduce a new correct option (substantive changes required, not just value adjustments or order swaps)
- Option Modification Limit:
- Only modify two options (the original correct option and the newly added correct option)
- Keep all other options unchanged
Rewriting the Question
- Make minimal modifications to the question stem (e.g., adjusting values, factual descriptions, or contextual details)
- Ensure the question remains single-choice with a changed correct answer
Dataset Composition and Specifications
Evaluation Data Size
- Publicly available test set of 300 questions
Source Data Fields
KEYS | EXPLANATION |
---|---|
question | Question stem |
options | List of all candidate options |
answer | Correct answer after rewriting |
Source Dataset Example
{
"question": "Which of the following statements about China's military and national defense is incorrect?",
"options": [
"Nuclear capability serves as the strategic cornerstone for safeguarding national sovereignty and security",
"Under the new circumstances, our military's strategic guideline is to actively advance and move toward the deep blue waters",
"China is situated at the junction of the maritime geostrategic region and the Eurasian continental geostrategic region",
"Maintaining regional and world peace is one of the primary strategic tasks of our armed forces",
"None of the above"
],
"answer": "B"
}
C-LRM
#Accuracy
Data Description
The National Unified Legal Professional Qualification Examination rewriting dataset contains 100 multiple-choice questions adapted from real exam questions. These questions were generated through manual rewriting while adhering to specific rules to ensure validity and consistency.
Rewriting Methods
Rewriting Options
- Original Correct Option Modification:
- Change the original correct option to an incorrect one.
- Introduce a completely new correct option (not merely adjusting values or reordering).
- Option Modification Limit:
- Only modify two options (the original correct option and the newly added correct option).
- Keep all other options unchanged.
Rewriting the Question
- Make minimal modifications to the question stem (e.g., adjusting numerical values, reaction formulas, or factual details).
- Ensure the question remains single-choice with a changed correct answer.
Dataset Composition and Specifications
Evaluation Data Size
Publicly available test set of 100 questions.
Source Data Fields
KEYS | EXPLANATION |
---|---|
question | Question stem |
options | List of all candidate options |
answer | Correct answer after rewriting |
Source Dataset Example
{
"question": "Xing is a world-renowned ceramic art master. On April 1, 2006, during an interview with CCTV-7's 'Countryside Date' program, Xing showcased his work—a five-layer 'hanging ball'—and boasted to the national audience: 'This is my first work, and it remains a global mystery. The layers are not fastened with wires but are interlocked in a way that no one has yet figured out. If anyone can replicate it, I will give them my three-story villa in downtown Dalian—the Xing Art Center—covering 2,000 square meters and worth 16 million yuan, along with all its assets.' After the broadcast, Sun, a ceramic enthusiast from Luoyang, Henan, successfully replicated the work. How should Xing's statement be legally characterized?",
"options": [
"An unconscionable contract",
"A contractual offer",
"A jesting statement, which Xing may revoke at any time",
"A reward advertisement, requiring Xing to hand over the villa",
"None of the above"
],
"answer": "C"
}
C-JRM
#Accuracy
Data Description
The Academic Test rewriting dataset contains authentic multiple-choice questions from mathematics, chemistry, biology, history, and geography sections of provincial/municipal junior high school graduation exams. These questions were manually rewritten following strict rules to maintain validity and consistency.
Rewriting Methods
Rewriting Options
- Original Correct Option Modification:
- Replace the original correct option with an incorrect one
- Introduce a new correct option (substantive changes required, not just value adjustments or order swaps)
- Option Modification Limit:
- Only modify two options (the original correct option and the newly added correct option)
- Keep all other options unchanged
Rewriting the Question
- Make minimal modifications to the question stem (e.g., adjusting values, formulas, or factual descriptions)
- Ensure the question remains single-choice with a changed correct answer
Dataset Composition and Specifications
Evaluation Data Size
Subject | Number of Questions |
---|---|
Mathematics | 150 |
Chemistry | 150 |
Biology | 221 |
History | 50 |
Geography | 65 |
Total | 636 |
Source Data Fields
KEYS | EXPLANATION |
---|---|
question | Question stem |
options | List of all candidate options |
answer | Correct answer after rewriting |
Source Dataset Example
{
"question": "The result of calculating $\\left(a^{3}\\right)^{2} \\cdot a^{3}\\cdot \\frac{1}{a} $ is $(\\quad)$",
"options": [
"$a^{8}$",
"$a^{9}$",
"$a^{10}$",
"$a^{11}$",
"None of the above"
],
"answer": "A"
}