SER Evaluation Data
The Interactive Emotional Dyadic Motion Capture (IEMOCAP)
#Metrics-WAR UAR
Adaptation Method
Linear Classifier. First, the features output by the upstream model undergo feature extraction through the Global Average Pooling Layer, and then are input into the Linear Classifier which contains a linear fully connected layer. The input dimension of the Linear Classifier is equal to the dimension of the feature vector, and its output dimension is equal to the number of classes.
Data description
The Interactive Emotion Motion Capture (IEMOCAP) database is a performative, multimodal, and multispeaker emotion database. It contains approximately 12 hours of audio-visual data, including video, audio, facial motion capture, and text transcription. It includes conversational sessions with actors performing improvisation or scripted scenarios.
Dataset structure
Amount of source data
The dataset is split into Neutral (1708), Angry (1103), Sad (1084), Happy (595), Excited (1041), Scared (40), Surprised (107), Frustrated (1849), and Other (2507).
Amount of evaluation data
The evaluation dataset are 5531 instances (Neutral, Angry, Sad, Happy) from the source dataset, where samples from the Excited and Happy categories are combined.
Data detail
| KEYS | EXPLAIN |
|---|---|
| id | data id |
| sentence | the content of the speech |
| label | emotion label |
Sample of dataset
{
"id": Ses01F_impro01_F000,
"sentence": "Excuse me.",
"label": "neu"
}Citation information
@article{busso2008iemocap,
title={IEMOCAP: Interactive emotional dyadic motion capture database},
author={Busso, Carlos and Bulut, Murtaza and Lee, Chi-Chun and Kazemzadeh, Abe and Mower, Emily and Kim, Samuel and Chang, Jeannette N and Lee, Sungbok and Narayanan, Shrikanth S},
journal={Language resources and evaluation},
volume={42},
pages={335--359},
year={2008},
publisher={Springer}
}Licensing information
MSP-IMPROV
#Metrics-WAR UAR
Adaptation Method
Linear Classifier. First, the features output by the upstream model undergo feature extraction through the Global Average Pooling Layer, and then are input into the Linear Classifier which contains a linear fully connected layer. The input dimension of the Linear Classifier is equal to the dimension of the feature vector, and its output dimension is equal to the number of classes.
Data description
MSP-IMPROV database is a performative, multimodal, and multispeaker emotion database. It is constructed similar to IEMOCAP dataset but with 12 actors and six sessions.
Dataset structure
Amount of evaluation data
The evaluation dataset are 7,798 instances (Neutral, Angry, Sad, Happy).
Data detail
| KEYS | EXPLAIN |
|---|---|
| id | data id |
| sentence | the content of the speech |
| label | emotion label |
Sample of dataset
{
"id": MSP-IMPROV-S01A-F01-P-FM01,
"sentence": "I have to go to class. How can I not? Okay.",
"label": "ang"
}Citation information
@article{busso2016msp,
title={MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception},
author={Busso, Carlos and Parthasarathy, Srinivas and Burmania, Alec and AbdelWahab, Mohammed and Sadoughi, Najmeh and Provost, Emily Mower},
journal={IEEE Transactions on Affective Computing},
volume={8},
number={1},
pages={67--80},
year={2016},
publisher={IEEE}
}Licensing information
EmotionTalk
#Metrics-WAR UAR
Adaptation Method
Linear Classifier. First, the features output by the upstream model undergo feature extraction through the Global Average Pooling Layer, and then are input into the Linear Classifier which contains a linear fully connected layer. The input dimension of the Linear Classifier is equal to the dimension of the feature vector, and its output dimension is equal to the number of classes.
Data description
EmotionTalk is an interactive Chinese multimodal emotion dataset with rich annotations. It was released by Nankai University. This dataset provides multimodal information from 19 actors participating in dyadic conversation settings, incorporating acoustic, visual, and textual modalities. It includes 23.6 hours of speech (19,250 utterances), annotations for 7 utterance-level emotion categories (happy, surprise, sad, disgust, anger, fear, and neutral), 5-dimensional sentiment labels (negative, weakly negative, neutral, weakly positive, and positive) and 4-dimensional speech captions (speaker, speaking style, emotion and overall).
Dataset Composition and Specifications
Source Data Volume
Training set: 15413 Validation set: 1908 Test set: 1929
Evaluation Data Volume
Public test set: 1929 utterances
Data Fields
Includes video, audio files, sentence transcriptions, and emotion discrete / continuous /caption annotations.
data/
├── audio/*.tar
├── Text/*.tar
├── Video/*.tar
└── Multimodal/*.tarDataset Example
{
"data": {
"A": {
"emotion": "happy",
"Confidence_degree": "9",
"Continuous_label": 1
},
"B": {
"emotion": "happy",
"Confidence_degree": "9",
"Continuous_label": 0
},
"C": {
"emotion": "happy",
"Confidence_degree": "9",
"Continuous_label": 1
},
"D": {
"emotion": "happy",
"Confidence_degree": "9",
"Continuous_label": 1
},
"E": {
"emotion": "happy",
"Confidence_degree": "7",
"Continuous_label": 1
}
},
"speaker_id": "07",
"emotion_result": "happy",
"content": "哎,发现我有什么变化没有?",
"Continuous label_result": 0.8,
"file_name": "G00002/G00002_01/G00002_01_07/G00002_01_07_001.mp4"
}Citation
@article{sun2025emotiontalk,
title={EmotionTalk: An Interactive Chinese Multimodal Emotion Dataset With Rich Annotations},
author={Sun, Haoqin and Wang, Xuechen and Zhao, Jinghua and Zhao, Shiwan and Zhou, Jiaming and Wang, Hui and He, Jiabei and Kong, Aobo and Yang, Xi and Wang, Yequan and others},
journal={arXiv preprint arXiv:2505.23018},
year={2025}
}License
CC BY-NC-SA 4.0