SF Evaluation Data
SNIPS
#Accuracy, Robustness
Adaptation Method:
CTC Decoder: Features output by the upstream model are passed through two LSTM layers and a linear classifier with a fully connected layer. The input dimension is equal to the feature vector dimension, and the output dimension is equal to the number of slot types.
Related paper citation:
@inproceedings{graves2006connectionist,
title={Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks},
author={Graves, Alex and Fern{'a}ndez, Santiago and Gomez, Faustino and Schmidhuber, J{"u}rgen},
booktitle={Proceedings of the 23rd international conference on Machine learning},
pages={369–376},
year={2006}
}
Data Description:
The SNIPS Natural Language Understanding benchmark is a dataset containing over 16,000 crowdsourced queries in English, distributed across 7 user intents of varying complexity: SearchCreativeWork (e.g., find me a robot TV show), GetWeather (e.g., is it windy in Boston, Massachusetts right now?), BookRestaurant (e.g., I want to book a highly rated restaurant in Paris for tomorrow night), PlayMusic (e.g., play the last track by Beyoncé on Spotify), AddToPlaylist (e.g., add Diamonds to my road trip playlist), RateBook (e.g., give six stars to Of Mice and Men), SearchScreeningEvent (e.g., check the screening time for Wonder Woman in Paris).
Dataset structure:
Amount of source data:
Training set: 13,084 items, Validation set: 700 items, Test set: 700 items.
Amount of Evaluation data:
The evaluation data volume is the public test set of 700 items.
Data detail:
KEYS | EXPLAIN |
---|---|
id | Path to the data's MP3 file |
text | Text corresponding to the MP3 file |
label | Slot type for each token |
Sample of source dataset:
{
"id":"Aditi-snips-test-0",
"text":"BOS I'D LIKE TO HAVE THIS TRACK ONTO MY CLASSICAL RELAXATIONS PLAYLIST EOS"
"label":"O O O O O O music_item O playlist_owner playlist playlist O AddToPlaylist"
}
Citation information:
@article{coucke2018snips, title={Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces}, author={Coucke, Alice and Saade, Alaa and Ball, Adrien and Bluche, Th{\'e}odore and Caulier, Alexandre and Leroy, > David and Doumouro, Cl{\'e}ment and Gisselbrecht, Thibault and Caltagirone, Francesco and Lavril, Thibaut and others}, journal={arXiv preprint arXiv:1805.10190}, year={2018} } ```
Licensing information:
Creative Commons Zero v1.0 Universal
RealTalk-CN
#Accuracy
Adaptation Method:
CTC Decoder: The features output by the upstream model are passed through two LSTM layers followed by a fully connected linear layer as the classifier. The input dimension equals the feature vector dimension, and the output dimension equals the number of slot types.
Data Description:
RealTalk-CN is the first Chinese multi-turn, multi-domain speech-text bimodal task-oriented dialogue (TOD) benchmark dataset, designed to evaluate speech-based large language models (Speech-LLMs) for their understanding, robustness, and cross-modal interaction abilities in realistic spoken environments.
The dataset contains 5.4k dialogues (approximately 60,000 utterances), totaling 150 hours of audio recordings. All dialogues are collected from real spoken Chinese and cover 58 topical domains, 55 intent categories, and 115 slot types. It also explicitly annotates natural speech disfluencies.
RealTalk-CN introduces a Cross-Modal Chat task that allows users to dynamically switch between speech and text inputs, simulating real-world multimodal human-computer interaction scenarios.
Dataset Structure and Specification:
Amount of Source Data:
Set | Subset | Samples | Avg. Utterance Length | Avg. Turns per Dialogue |
---|---|---|---|---|
Training | MD-Col | 6,269 | 27.60 | 8.54 |
MD-Sys | 28,363 | 19.36 | 7.74 | |
SD-Col | 1,458 | 25.56 | 8.23 | |
SD-Sys | 5,848 | 28.90 | 7.58 | |
Validation | MD-Col | 2,687 | 27.62 | 8.54 |
MD-Sys | 8,728 | 19.51 | 7.72 | |
SD-Col | 626 | 25.00 | 8.17 | |
SD-Sys | 2,504 | 20.89 | 7.75 | |
Test | MD-Col | 3,837 | 27.42 | 8.54 |
MD-Sys | 3,837 | 19.27 | 7.73 | |
SD-Col | 892 | 25.61 | 8.14 | |
SD-Sys | 892 | 20.76 | 7.58 |
The dataset includes 5,400 dialogues in total, split approximately 7:2:1 among the training, validation, and test sets. “MD” = Multi-Domain, “SD” = Single-Domain; “Col” = Colloquial, “Sys” = Systematic Text.
Amount of Evaluation Data:
The evaluation data correspond to the public test set, containing approximately 9,458 samples, including all four subsets (MD-Col, MD-Sys, SD-Col, SD-Sys).
Data Fields:
KEYS | EXPLAIN |
---|---|
id | Unique sample ID |
audio_file | Path to the audio file |
text | Speech transcription |
original_data.dialogueID | Unique dialogue identifier |
original_data.roleID | Role ID (1 = user, 2 = system) |
original_data.gender | Speaker gender |
original_data.age | Speaker age |
original_data.region | Speaker’s region of origin |
original_data.topicName | Dialogue topic (e.g., weather, food, travel) |
original_data.context | Previous dialogue turns |
original_data.text_content | Current utterance text |
original_data.intent | Intent ID |
original_data.slot_type | Slot type encoding |
original_data.generative_label | Generative label with slot filling |
original_data.slot_value_dict | Dictionary of slot types and corresponding values |
original_data.choices | Candidate intent list |
original_data.hdTimeStart / hdTimeEnd | Audio start and end timestamps (in seconds) |
Sample of Source Dataset:
{
"id": "G40032S1017_3",
"audio_file": "Spoken3MC/wavs/G40032/G40032S1017.wav",
"text": "Provide location",
"original_data": {
"dialogueID": "G40032S1017",
"roleID": 2,
"gender": "Male",
"age": 21.0,
"region": "Hefei, Anhui",
"topicName": "Weather, Food, Travel",
"context": [
{
"roleID": 1,
"text": "Please recommend a two-day food trip.",
"hdTimeStart": 0.055,
"hdTimeEnd": 2.645,
"gender": "Male",
"age": 21.0,
"region": "Xinyang, Henan"
},
{
"roleID": 2,
"text": "You might consider visiting Snow Town in Heilongjiang National Forest Park — great snow views and local Northeastern cuisine.",
"hdTimeStart": 9.790,
"hdTimeEnd": 16.500,
"gender": "Male",
"age": 21.0,
"region": "Hefei, Anhui"
}
],
"text_content": "Snow Town is located in Mudanjiang, Heilongjiang Province.",
"intent": 24,
"slot_type": "58 58 0 0 0 0 0 0 0 88 88 88 88 0 32 32 32 32 0",
"generative_label": "Provide location (Tourist spot=Snow Town, Province=Heilongjiang, City=Mudanjiang)",
"slot_value_dict": {
"Tourist spot": ["Snow Town"],
"Province": ["Heilongjiang"],
"City": ["Mudanjiang"]
},
"choices": "['Introduce works','Introduce history','Provide location','Recommend attractions','Ask weather','Ask route']",
"hdTimeStart": 54.795,
"hdTimeEnd": 58.365
}
}
Citation Information:
@article{wang2025realtalkcn, title={RealTalk-CN: A Realistic Chinese Speech-Text Dialogue Benchmark With Cross-Modal Interaction Analysis}, author={Wang, Enzhi and Li, Qicheng and Zhao, Shiwan and Kong, Aobo and Zhou, Jiaming and Yang, Xi and Wang, Yequan and Lin, Yonghua and Qin, Yong}, journal={arXiv preprint arXiv:2508.10015}, year={2025} }
Licensing Information:
CC BY-NC-SA 4.0 license See the official RealTalk-CN page for details.