Skip to content

Evaluation Dataset

The following datasets are all transformed into standard Evaluation Prompts before evaluation.

Dataset 1(EPRSTMT)

#Metrics-Exact match

Data description:

EPRSTMT is a dataset that conducts sentiment analysis on reviews of e-commerce products( E-commerce Product Review Dataset for Sentiment Analysis)

Dataset structure:

Amount of source data:

The dataset is split into train(32), validation(32), public test(610), test(753), unsupervised (19565)

Data detail:

KEYSEXPLAIN
idid of the data in json file
sentencesentence
labellabel, 'Positive' means positive and 'Negative' means negative

Sample of source dataset:

{"id": 23,

"sentence": "外包装上有点磨损,试听后感觉不错", 

"label": "Positive"}

Citation information:

{FewCLUE,
  title={FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark},
  author={Liang Xu, Xiaojing Lu, Chenyang Yuan, Xuanwei Zhang, Huilin Xu, Hu Yuan, Guoao Wei, Xiang Pan, Xin Tian, Libo Qin, Hu Hai},
  year={2021},
  howpublished={\url{https://arxiv.org/abs/2107.07498}},
}

Dataset 2(TNEWS)

#Metrics-Exact match

Data description:

TNEWS(Toutiao Short Text Classificaiton for News),comes from the news section of Toutiao and extracts news in 15 categories, including tourism, education, finance, military, etc.

Dataset structure:

Amount of source data:

Sample from test of source dataset(618)

Data detail:

KEYSEXPLAIN
labelclassification ID
label_desclassification name
setencenews string (title only)

Sample of source dataset:

{"label": "102", 

"label_des": "news_entertainment", 

"sentence": "江疏影甜甜圈自拍,迷之角度竟这么好看,美吸引一切事物"}

Citation information:

{FewCLUE,
  title={FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark},
  author={Liang Xu, Xiaojing Lu, Chenyang Yuan, Xuanwei Zhang, Huilin Xu, Hu Yuan, Guoao Wei, Xiang Pan, Xin Tian, Libo Qin, Hu Hai},
  year={2021},
  howpublished={\url{https://arxiv.org/abs/2107.07498}},
}

Dataset 3(OCNLI)

#Metrics-Exact match

Data description:

OCNLI, or Native Chinese Natural Language Inference dataset, is the first large non-translated Chinese natural language inference dataset using native Chinese. OCNLI contains more than 50,000 training data, 3,000 validation data, and 3,000 test data. In addition to test data, we will provide data and labels. Test data only provides data. OCNLI is part of the Chinese language Understanding benchmark(CLUE).

Dataset structure:

Amount of source data:

The dataset is split into train(32), validation(32), public test(2520), test(3000), unsupervised (20000)

Data detail:

KEYSEXPLAIN
level[Difficulty] : 'easy', 'medium', and 'hard' respectively represent the first, second, and third hypothesis written by the tagger for a label (such as entailment)
sentence1[Sentence 1], the premise
sentence2[Sentence 2], the assumption
label[label], the majority vote for label 0 - label 4. If labeled '-', this data should be removed
label0 -- label4[5 labels], there are 5 labels for the data of both verification set and test set. The training set has five labels for only part of the data
genre[Text Category], a total of 5 categories: government bulletin, news, literature, TV talk show, telephone translation
prem_id[Prerequisite number]
id[General number]

Sample of source dataset:

{
"level":"medium",
"sentence1":"身上裹一件工厂发的棉大衣,手插在袖筒里",
"sentence2":"身上至少一件衣服",
"label":"entailment",

"label0":"entailment","label1":"entailment","label2":"entailment","label3":"entailment","label4":"entailment",
"genre":"lit",

"prem_id":"lit_635",

"id":0
}

Citation information:

@inproceedings{ocnli,
	title={OCNLI: Original Chinese Natural Language Inference},
	author={Hai Hu and Kyle Richardson and Liang Xu and Lu Li and Sandra Kuebler and Larry Moss},
	booktitle={Findings of EMNLP},
	year={2020},
	url={https://arxiv.org/abs/2010.05444}
}

Licensing information:

•Signature - Non-Commercial 2.0 Universal (CC BY-NC 2.0) •News type premises were sampled from the LCMC corpus (ISLRN ID: 990-638-120-222-2, ELRA reference: Elra-W0039) with the permission of ELRA.

Dataset 4(BUSTM)

#Metrics-Exact match

Data description:

Conversational short text semantic matching data set, derived from the small cloth assistant. It is a voice assistant developed by OPPO for branded mobile phones and IoT devices, providing users with convenient conversational services. Intention recognition is a core task in dialogue system, and semantic matching of short text is one of the main algorithms of intention recognition. Ask to predict whether they belong to the same semantics based on the short text query-pair.

Dataset structure:

Amount of source data:

The dataset is split into train(32), validation(32), public test(1772), test(2000), unsupervised (4251)

Data:

KEYSEXPLAIN
iddata id
sentence1sentence1
sentence2sentence2
labelTrue or false labels, "1" means two sentences belong to the same semantic, "0" means not

Sample of source dataset:

{"id": 5,

 "sentence1": "女孩子到底是不是你",

 "sentence2": "你不是女孩子吗",

 "label": "1"}

Citation information:

{FewCLUE,
  title={FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark},
  author={Liang Xu, Xiaojing Lu, Chenyang Yuan, Xuanwei Zhang, Huilin Xu, Hu Yuan, Guoao Wei, Xiang Pan, Xin Tian, Libo Qin, Hu Hai},
  year={2021},
  howpublished={\url{https://arxiv.org/abs/2107.07498}},
}

C-FCT(CiteCheck Dataset)

Dataset Description

The CiteCheck dataset contains 3,000 Chinese samples for evaluating citation faithfulness in Retrieval-Augmented Generation (RAG) systems. Each sample consists of:

  • Question: Input query to the RAG system
  • Answer: System-generated response with citations
  • Statement: Extracted claim from the answer with document references
  • Documents: Cited reference texts (1-5 documents per statement)
  • Label: Binary annotation (1 = fully supported, 0 = not fully supported)

Key Characteristics:

  • Balanced distribution: 1,500 positive and 1,500 negative samples
  • Real-world RAG outputs with verifiable citations
  • Focus on factual accuracy of cited information

Adaptation Method

Zero-shot Prompt-based Evaluation

Determine whether the statement is fully supported by the reference text. Statement: {statement} Reference text: {quote} Answer (output only one word, yes or no):

text

Prediction Rules:

  • Output contains only "yes" → Positive (1)
  • Output contains only "no" → Negative (0)
  • All other outputs → Considered invalid prediction

Dataset Composition

Data Volume

Total samples: 3,000

Data Fields

FieldDescription
queryOriginal input question
answerRAG system response
statementExtracted factual claim
quoteCited reference text
labelSupport label (0/1)

Dataset Samples

json
{
  "query": "What is Tesla's market share in China's EV sector?",
  "answer": "Tesla held 21.7% market share in the EV sector in H1 2023.",
  "statement": "Tesla held 21.7% market share in the EV sector in H1 2023.",
  "quote": "[1] Global EV sales report shows Tesla leading with 21.7% market share...",
  "label": 1
}
{
  "query": "How many USB ports does Hisense 40E2F have?",
  "answer": "Hisense 40E2F has 2 USB ports located on the side panel...",
  "statement": "The USB ports are located on the side panel.",
  "quote": "[1] Product specs: 2 USB ports, HDMI inputs...",
  "label": 0
}

Citation

bibtex
@misc{xu2025citecheck,
  title={CiteCheck: Towards Accurate Citation Faithfulness Detection},
  author={Xu, Ziyao and Wei, Shaohang and Han, Zhuoheng et al.},
  year={2025},
  eprint={2502.10881},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

License

MIT License Permissions include commercial use, modification, and distribution with attribution.