Evaluation Dataset
The following datasets are all transformed into standard Evaluation Prompts before evaluation.
Dataset 1(EPRSTMT)
Data description:
EPRSTMT is a dataset that conducts sentiment analysis on reviews of e-commerce products( E-commerce Product Review Dataset for Sentiment Analysis)
Dataset structure:
Amount of source data:
The dataset is split into train(32), validation(32), public test(610), test(753), unsupervised (19565)
Data detail:
KEYS | EXPLAIN |
---|---|
id | id of the data in json file |
sentence | sentence |
label | label, 'Positive' means positive and 'Negative' means negative |
Sample of source dataset:
{"id": 23,
"sentence": "外包装上有点磨损,试听后感觉不错",
"label": "Positive"}
Citation information:
{FewCLUE,
title={FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark},
author={Liang Xu, Xiaojing Lu, Chenyang Yuan, Xuanwei Zhang, Huilin Xu, Hu Yuan, Guoao Wei, Xiang Pan, Xin Tian, Libo Qin, Hu Hai},
year={2021},
howpublished={\url{https://arxiv.org/abs/2107.07498}},
}
Dataset 2(TNEWS)
Data description:
TNEWS(Toutiao Short Text Classificaiton for News),comes from the news section of Toutiao and extracts news in 15 categories, including tourism, education, finance, military, etc.
Dataset structure:
Amount of source data:
Sample from test of source dataset(618)
Data detail:
KEYS | EXPLAIN |
---|---|
label | classification ID |
label_des | classification name |
setence | news string (title only) |
Sample of source dataset:
{"label": "102",
"label_des": "news_entertainment",
"sentence": "江疏影甜甜圈自拍,迷之角度竟这么好看,美吸引一切事物"}
Citation information:
{FewCLUE,
title={FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark},
author={Liang Xu, Xiaojing Lu, Chenyang Yuan, Xuanwei Zhang, Huilin Xu, Hu Yuan, Guoao Wei, Xiang Pan, Xin Tian, Libo Qin, Hu Hai},
year={2021},
howpublished={\url{https://arxiv.org/abs/2107.07498}},
}
Dataset 3(OCNLI)
Data description:
OCNLI, or Native Chinese Natural Language Inference dataset, is the first large non-translated Chinese natural language inference dataset using native Chinese. OCNLI contains more than 50,000 training data, 3,000 validation data, and 3,000 test data. In addition to test data, we will provide data and labels. Test data only provides data. OCNLI is part of the Chinese language Understanding benchmark(CLUE).
Dataset structure:
Amount of source data:
The dataset is split into train(32), validation(32), public test(2520), test(3000), unsupervised (20000)
Data detail:
KEYS | EXPLAIN |
---|---|
level | [Difficulty] : 'easy', 'medium', and 'hard' respectively represent the first, second, and third hypothesis written by the tagger for a label (such as entailment) |
sentence1 | [Sentence 1], the premise |
sentence2 | [Sentence 2], the assumption |
label | [label], the majority vote for label 0 - label 4. If labeled '-', this data should be removed |
label0 -- label4 | [5 labels], there are 5 labels for the data of both verification set and test set. The training set has five labels for only part of the data |
genre | [Text Category], a total of 5 categories: government bulletin, news, literature, TV talk show, telephone translation |
prem_id | [Prerequisite number] |
id | [General number] |
Sample of source dataset:
{
"level":"medium",
"sentence1":"身上裹一件工厂发的棉大衣,手插在袖筒里",
"sentence2":"身上至少一件衣服",
"label":"entailment",
"label0":"entailment","label1":"entailment","label2":"entailment","label3":"entailment","label4":"entailment",
"genre":"lit",
"prem_id":"lit_635",
"id":0
}
Citation information:
@inproceedings{ocnli,
title={OCNLI: Original Chinese Natural Language Inference},
author={Hai Hu and Kyle Richardson and Liang Xu and Lu Li and Sandra Kuebler and Larry Moss},
booktitle={Findings of EMNLP},
year={2020},
url={https://arxiv.org/abs/2010.05444}
}
Licensing information:
•Signature - Non-Commercial 2.0 Universal (CC BY-NC 2.0) •News type premises were sampled from the LCMC corpus (ISLRN ID: 990-638-120-222-2, ELRA reference: Elra-W0039) with the permission of ELRA.
Dataset 4(BUSTM)
Data description:
Conversational short text semantic matching data set, derived from the small cloth assistant. It is a voice assistant developed by OPPO for branded mobile phones and IoT devices, providing users with convenient conversational services. Intention recognition is a core task in dialogue system, and semantic matching of short text is one of the main algorithms of intention recognition. Ask to predict whether they belong to the same semantics based on the short text query-pair.
Dataset structure:
Amount of source data:
The dataset is split into train(32), validation(32), public test(1772), test(2000), unsupervised (4251)
Data:
KEYS | EXPLAIN |
---|---|
id | data id |
sentence1 | sentence1 |
sentence2 | sentence2 |
label | True or false labels, "1" means two sentences belong to the same semantic, "0" means not |
Sample of source dataset:
{"id": 5,
"sentence1": "女孩子到底是不是你",
"sentence2": "你不是女孩子吗",
"label": "1"}
Citation information:
{FewCLUE,
title={FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark},
author={Liang Xu, Xiaojing Lu, Chenyang Yuan, Xuanwei Zhang, Huilin Xu, Hu Yuan, Guoao Wei, Xiang Pan, Xin Tian, Libo Qin, Hu Hai},
year={2021},
howpublished={\url{https://arxiv.org/abs/2107.07498}},
}
C-FCT(CiteCheck Dataset)
Dataset Description
The CiteCheck dataset contains 3,000 Chinese samples for evaluating citation faithfulness in Retrieval-Augmented Generation (RAG) systems. Each sample consists of:
- Question: Input query to the RAG system
- Answer: System-generated response with citations
- Statement: Extracted claim from the answer with document references
- Documents: Cited reference texts (1-5 documents per statement)
- Label: Binary annotation (1 = fully supported, 0 = not fully supported)
Key Characteristics:
- Balanced distribution: 1,500 positive and 1,500 negative samples
- Real-world RAG outputs with verifiable citations
- Focus on factual accuracy of cited information
Adaptation Method
Zero-shot Prompt-based Evaluation
Determine whether the statement is fully supported by the reference text. Statement: {statement} Reference text: {quote} Answer (output only one word, yes or no):
text
Prediction Rules:
- Output contains only "yes" → Positive (1)
- Output contains only "no" → Negative (0)
- All other outputs → Considered invalid prediction
Dataset Composition
Data Volume
Total samples: 3,000
Data Fields
Field | Description |
---|---|
query | Original input question |
answer | RAG system response |
statement | Extracted factual claim |
quote | Cited reference text |
label | Support label (0/1) |
Dataset Samples
{
"query": "What is Tesla's market share in China's EV sector?",
"answer": "Tesla held 21.7% market share in the EV sector in H1 2023.",
"statement": "Tesla held 21.7% market share in the EV sector in H1 2023.",
"quote": "[1] Global EV sales report shows Tesla leading with 21.7% market share...",
"label": 1
}
{
"query": "How many USB ports does Hisense 40E2F have?",
"answer": "Hisense 40E2F has 2 USB ports located on the side panel...",
"statement": "The USB ports are located on the side panel.",
"quote": "[1] Product specs: 2 USB ports, HDMI inputs...",
"label": 0
}
Citation
@misc{xu2025citecheck,
title={CiteCheck: Towards Accurate Citation Faithfulness Detection},
author={Xu, Ziyao and Wei, Shaohang and Han, Zhuoheng et al.},
year={2025},
eprint={2502.10881},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
License
MIT License Permissions include commercial use, modification, and distribution with attribution.