IC Evaluation Data

Fluent Speech Commands

#Accuracy

Adaptation Method:

Linear Classifier: Features output by the upstream model are first passed through a global average pooling layer for feature extraction, then input into a linear classifier that contains a single linear fully connected layer. The input dimension of the linear classifier is equal to the dimension of the feature vector, and the output dimension is equal to the number of categories.

Data Description：

The Fluent Speech Commands dataset contains 30,043 utterances from 97 speakers. Each file contains a voice command for controlling smart appliances or virtual assistants. The dataset includes three categories of intent (Action, Object, Location), encompassing a total of 31 unique sub-intents. The language is English.

Dataset structure：

Amount of source data：

Training set: 23,132 items, Validation set: 3,118 items, Test set: 793 items

Amount of Evaluation data：

The evaluation data volume is the public test set of 793 items.

Data detail：

KEYS	EXPLAIN
id	Data ID
path	Path to the corresponding MP3 file
speakerId	Speaker ID
transcription	Text corresponding to the speech
action	Action type intent
object	Object type intent
location	Location type intent

Sample of source dataset：

{
  "id":0,
  "path":"wavs/speakers/4BrX8aDqK2cLZRYl/cbdf5700-452c-11e9-b1e4-e5985dca719e.wav",
  "speakerId":"4BrX8aDqK2cLZRYl",
  "transcription":"Turn on the lights",
  "action":"activate",
  "object":"lights",
  "location":"none"
}

Citation information：

@article{lugosch2019speech,
  title={Speech model pre-training for end-to-end spoken language understanding},
  author={Lugosch, Loren and Ravanelli, Mirco and Ignoto, Patrick and Tomar, Vikrant Singh and Bengio, Yoshua},
  journal={arXiv preprint arXiv:1904.03670},
  year={2019}
}

Licensing information：

Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International license

IC Evaluation Data

RealTalk-CN

#Accuracy

Adaptation Method:

Linear Classifier: Features output by the upstream model are first passed through a global average pooling layer for feature extraction, then input into a linear classifier containing a single fully connected linear layer. The input dimension of the classifier equals the dimensionality of the feature vector, and the output dimension equals the number of target categories.

Data Description:

RealTalk-CN is the first Chinese multi-turn, multi-domain speech-text bimodal task-oriented dialogue (TOD) benchmark dataset, designed to evaluate speech-based large language models (Speech-LLMs) in real-world spoken scenarios. It assesses models’ understanding ability, robustness, and cross-modal interaction capabilities.

The dataset includes 5.4k dialogues (around 60,000 utterances), totaling approximately 150 hours of audio. All dialogues are recorded from real Chinese speech and cover 58 topic domains, 55 intent categories, and 115 slot types. Natural disfluency phenomena are explicitly annotated.

RealTalk-CN also supports a Cross-Modal Chat setup, enabling users to dynamically switch between speech and text inputs, simulating authentic multimodal human-computer interaction.

Dataset Structure:

Amount of Source Data:

Set	Subset	Samples	Avg. Utterance Length	Avg. Turns per Dialogue
Training	MD-Col	6,269	27.60	8.54
	MD-Sys	28,363	19.36	7.74
	SD-Col	1,458	25.56	8.23
	SD-Sys	5,848	28.90	7.58
Validation	MD-Col	2,687	27.62	8.54
	MD-Sys	8,728	19.51	7.72
	SD-Col	626	25.00	8.17
	SD-Sys	2,504	20.89	7.75
Test	MD-Col	3,837	27.42	8.54
	MD-Sys	3,837	19.27	7.73
	SD-Col	892	25.61	8.14
	SD-Sys	892	20.76	7.58

The dataset contains 5,400 dialogues in total, split approximately 7:2:1 for training, validation, and test sets. “MD” = Multi-Domain, “SD” = Single-Domain; “Col” = Colloquial, “Sys” = Systematic Text.

Amount of Evaluation Data:

The evaluation data correspond to the public test set, containing approximately 9,458 samples, including all four subsets (MD-Col, MD-Sys, SD-Col, SD-Sys).

Data Detail:

KEYS	EXPLAIN
id	Unique sample identifier
audio_file	Path to the audio file
text	Speech transcription
original_data.dialogueID	Unique dialogue ID
original_data.roleID	Role ID (1 = user, 2 = system)
original_data.gender	Speaker gender
original_data.age	Speaker age
original_data.region	Speaker’s region of origin
original_data.topicName	Dialogue topic (e.g., weather, food, travel)
original_data.context	Context of previous dialogue turns
original_data.text_content	Current utterance text
original_data.intent	Intent ID
original_data.slot_type	Slot type encoding
original_data.generative_label	Generative label with slot filling
original_data.slot_value_dict	Dictionary mapping slot types to values
original_data.choices	List of candidate intents
original_data.hdTimeStart / hdTimeEnd	Audio start and end time (seconds)

Sample of Source Dataset:

{
  "id": "G40032S1017_3",
  "audio_file": "Spoken3MC/wavs/G40032/G40032S1017.wav",
  "text": "Provide location",
  "original_data": {
    "dialogueID": "G40032S1017",
    "roleID": 2,
    "gender": "Male",
    "age": 21.0,
    "region": "Hefei, Anhui",
    "topicName": "Weather, Food, Travel",
    "context": [
      {
        "roleID": 1,
        "text": "Please recommend a two-day food trip.",
        "hdTimeStart": 0.055,
        "hdTimeEnd": 2.645,
        "gender": "Male",
        "age": 21.0,
        "region": "Xinyang, Henan"
      },
      {
        "roleID": 2,
        "text": "You might consider visiting Snow Town in Heilongjiang National Forest Park — great snow views and local Northeastern cuisine.",
        "hdTimeStart": 9.790,
        "hdTimeEnd": 16.500,
        "gender": "Male",
        "age": 21.0,
        "region": "Hefei, Anhui"
      }
    ],
    "text_content": "Snow Town is located in Mudanjiang, Heilongjiang Province.",
    "intent": 24,
    "slot_type": "58 58 0 0 0 0 0 0 0 88 88 88 88 0 32 32 32 32 0",
    "generative_label": "Provide location (Tourist spot=Snow Town, Province=Heilongjiang, City=Mudanjiang)",
    "slot_value_dict": {
      "Tourist spot": ["Snow Town"],
      "Province": ["Heilongjiang"],
      "City": ["Mudanjiang"]
    },
    "choices": "['Introduce works','Introduce history','Provide location','Recommend attractions','Ask weather','Ask route']",
    "hdTimeStart": 54.795,
    "hdTimeEnd": 58.365
  }
}

Citation Information:

@article{wang2025realtalkcn,
  title={RealTalk-CN: A Realistic Chinese Speech-Text Dialogue Benchmark With Cross-Modal Interaction Analysis},
  author={Wang, Enzhi and Li, Qicheng and Zhao, Shiwan and Kong, Aobo and Zhou, Jiaming and Yang, Xi and Wang, Yequan and Lin, Yonghua and Qin, Yong},
  journal={arXiv preprint arXiv:2508.10015},
  year={2025}
}

Licensing Information:

CC BY-NC-SA 4.0 license See the official RealTalk-CN page for details.

IC Evaluation Data ​

Fluent Speech Commands ​

Adaptation Method: ​

Data Description： ​

Dataset structure： ​

Amount of source data： ​

Amount of Evaluation data： ​

Data detail： ​

Sample of source dataset： ​

Citation information： ​

Licensing information： ​

IC Evaluation Data ​

RealTalk-CN ​

Adaptation Method: ​

Data Description: ​

Dataset Structure: ​

Amount of Source Data: ​

Amount of Evaluation Data: ​

Data Detail: ​

Sample of Source Dataset: ​

Citation Information: ​

Licensing Information: ​

IC Evaluation Data

Fluent Speech Commands

Adaptation Method:

Data Description：

Dataset structure：

Amount of source data：

Amount of Evaluation data：

Data detail：

Sample of source dataset：

Citation information：

Licensing information：

IC Evaluation Data

RealTalk-CN

Adaptation Method:

Data Description:

Dataset Structure:

Amount of Source Data:

Amount of Evaluation Data:

Data Detail:

Sample of Source Dataset:

Citation Information:

Licensing Information: