Skip to content

SER Evaluation Data

The Interactive Emotional Dyadic Motion Capture (IEMOCAP)

#Metrics-WAR UAR

Adaptation Method

Linear Classifier. First, the features output by the upstream model undergo feature extraction through the Global Average Pooling Layer, and then are input into the Linear Classifier which contains a linear fully connected layer. The input dimension of the Linear Classifier is equal to the dimension of the feature vector, and its output dimension is equal to the number of classes.

Data description

The Interactive Emotion Motion Capture (IEMOCAP) database is a performative, multimodal, and multispeaker emotion database. It contains approximately 12 hours of audio-visual data, including video, audio, facial motion capture, and text transcription. It includes conversational sessions with actors performing improvisation or scripted scenarios.

Dataset structure

Amount of source data

The dataset is split into Neutral (1708), Angry (1103), Sad (1084), Happy (595), Excited (1041), Scared (40), Surprised (107), Frustrated (1849), and Other (2507).

Amount of evaluation data

The evaluation dataset are 5531 instances (Neutral, Angry, Sad, Happy) from the source dataset, where samples from the Excited and Happy categories are combined.

Data detail

KEYSEXPLAIN
iddata id
sentencethe content of the speech
labelemotion label

Sample of dataset

{
  "id": Ses01F_impro01_F000,
  "sentence": "Excuse me.",
  "label": "neu"
}

Citation information

@article{busso2008iemocap,
  title={IEMOCAP: Interactive emotional dyadic motion capture database},
  author={Busso, Carlos and Bulut, Murtaza and Lee, Chi-Chun and Kazemzadeh, Abe and Mower, Emily and Kim, Samuel and Chang, Jeannette N and Lee, Sungbok and Narayanan, Shrikanth S},
  journal={Language resources and evaluation},
  volume={42},
  pages={335--359},
  year={2008},
  publisher={Springer}
}

Licensing information

IEMOCAP License

MSP-IMPROV

#Metrics-WAR UAR

Adaptation Method

Linear Classifier. First, the features output by the upstream model undergo feature extraction through the Global Average Pooling Layer, and then are input into the Linear Classifier which contains a linear fully connected layer. The input dimension of the Linear Classifier is equal to the dimension of the feature vector, and its output dimension is equal to the number of classes.

Data description

MSP-IMPROV database is a performative, multimodal, and multispeaker emotion database. It is constructed similar to IEMOCAP dataset but with 12 actors and six sessions.

Dataset structure

Amount of evaluation data

The evaluation dataset are 7,798 instances (Neutral, Angry, Sad, Happy).

Data detail

KEYSEXPLAIN
iddata id
sentencethe content of the speech
labelemotion label

Sample of dataset

{
  "id": MSP-IMPROV-S01A-F01-P-FM01,
  "sentence": "I have to go to class. How can I not? Okay.",
  "label": "ang"
}

Citation information

@article{busso2016msp,
  title={MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception},
  author={Busso, Carlos and Parthasarathy, Srinivas and Burmania, Alec and AbdelWahab, Mohammed and Sadoughi, Najmeh and Provost, Emily Mower},
  journal={IEEE Transactions on Affective Computing},
  volume={8},
  number={1},
  pages={67--80},
  year={2016},
  publisher={IEEE}
}

Licensing information

MSP-IMPROV License

EmotionTalk

#Metrics-WAR UAR

Adaptation Method

Linear Classifier. First, the features output by the upstream model undergo feature extraction through the Global Average Pooling Layer, and then are input into the Linear Classifier which contains a linear fully connected layer. The input dimension of the Linear Classifier is equal to the dimension of the feature vector, and its output dimension is equal to the number of classes.

Data description

EmotionTalk is an interactive Chinese multimodal emotion dataset with rich annotations. It was released by Nankai University. This dataset provides multimodal information from 19 actors participating in dyadic conversation settings, incorporating acoustic, visual, and textual modalities. It includes 23.6 hours of speech (19,250 utterances), annotations for 7 utterance-level emotion categories (happy, surprise, sad, disgust, anger, fear, and neutral), 5-dimensional sentiment labels (negative, weakly negative, neutral, weakly positive, and positive) and 4-dimensional speech captions (speaker, speaking style, emotion and overall).

Dataset Composition and Specifications

Source Data Volume

Training set: 15413 Validation set: 1908 Test set: 1929

Evaluation Data Volume

Public test set: 1929 utterances

Data Fields

Includes video, audio files, sentence transcriptions, and emotion discrete / continuous /caption annotations.

data/  
├── audio/*.tar  
├── Text/*.tar  
├── Video/*.tar  
└── Multimodal/*.tar

Dataset Example

{
    "data": {
        "A": {
            "emotion": "happy",
            "Confidence_degree": "9",
            "Continuous_label": 1
        },
        "B": {
            "emotion": "happy",
            "Confidence_degree": "9",
            "Continuous_label": 0
        },
        "C": {
            "emotion": "happy",
            "Confidence_degree": "9",
            "Continuous_label": 1
        },
        "D": {
            "emotion": "happy",
            "Confidence_degree": "9",
            "Continuous_label": 1
        },
        "E": {
            "emotion": "happy",
            "Confidence_degree": "7",
            "Continuous_label": 1
        }
    },
    "speaker_id": "07",
    "emotion_result": "happy",
    "content": "哎,发现我有什么变化没有?",
    "Continuous label_result": 0.8,
    "file_name": "G00002/G00002_01/G00002_01_07/G00002_01_07_001.mp4"
}

Citation

@article{sun2025emotiontalk,
  title={EmotionTalk: An Interactive Chinese Multimodal Emotion Dataset With Rich Annotations},
  author={Sun, Haoqin and Wang, Xuechen and Zhao, Jinghua and Zhao, Shiwan and Zhou, Jiaming and Wang, Hui and He, Jiabei and Kong, Aobo and Yang, Xi and Wang, Yequan and others},
  journal={arXiv preprint arXiv:2505.23018},
  year={2025}
}

License

CC BY-NC-SA 4.0