ASR Evaluation Data
Librispeech
# Word Error Rate (WER)
Dataset Description
Librispeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from audiobooks from the LibriVox project, carefully segmented and aligned.
Source Data Volume
Training data: 960h (train-clean-100, train-clean-360, train-other-500)
Two validation and test sets: dev-clean, test-clean; dev-other, test-other
Evaluation Data Volume
test-clean: 5.4h, 2620 utterances test-dev: 5.1h, 2939 utterances
Data Fields
The dataset is organized by audio IDs. For example, the audio file 19-198-0001.flac
corresponds to the transcript in 19-198.trans.txt
, with fields: wav_id text
Dataset Example
78-368-0000 CHAPTER TWENTY THREE IT WAS EIGHT O'CLOCK WHEN WE LANDED WE WALKED FOR A SHORT TIME ON THE SHORE ENJOYING THE TRANSITORY LIGHT AND THEN RETIRED TO THE INN
Evaluation Metric
Word Error Rate (WER)
Citation
@inproceedings{panayotov2015librispeech, title={Librispeech: an asr corpus based on public domain audio books}, author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev}, booktitle={2015 IEEE international conference on acoustics, speech and signal processing (ICASSP)}, pages={5206--5210}, year={2015}, organization={IEEE} }
License
CC BY 4.0
AISHELL-1
# Character Error Rate (CER)
Dataset Description
AISHELL-1 is a 178-hour open-source Mandarin speech dataset released by Beijing Shell Shell. It is one of the most widely used Mandarin speech corpora. A total of 400 speakers participated in recording. The dataset was recorded in a quiet indoor environment with high-fidelity microphones, downsampled to 16kHz. The transcription accuracy exceeds 95%.
Dataset Composition and Specifications
Source Data Volume
Training set: 150h Validation set: 10h Test set: 5h
Evaluation Data Volume
Public test set: 5h, 7176 utterances
Data Fields
Each of the training, validation, and test sets contains two files: wav.scp
and text
.
wav.scp
: wav_id wav_pathtext
: wav_id text
Dataset Example
wav.scp:
BAC009S0002W0122 /mnt/sda/jiaming_space/datasets/aishell/data_aishell/wav/train/S0002/BAC009S0002W0122.wav
text:
BAC009S0002W0122 而对楼市成交抑制作用最大的限购
Evaluation Metric
Character Error Rate (CER)
Citation
@inproceedings{bu2017aishell, title={Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline}, author={Bu, Hui and Du, Jiayu and Na, Xingyu and Wu, Bengu and Zheng, Hao}, booktitle={2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA)}, pages={1--5}, year={2017}, organization={IEEE} }
License
Apache License v.2.0
ChildMandarin
# Character Error Rate (CER)
Dataset Description
ChildMandarin is an open-source dataset of young children (ages 3–5), jointly released by Nankai University and Beijing Academy of Artificial Intelligence (BAAI). It contains 397 children speakers from 22 provincial-level regions in China, totaling 41.25 hours.
Dataset Composition and Specifications
Source Data Volume
Training set: 33.35h Validation set: 3.78h Test set: 4.12h
Evaluation Data Volume
Public test set: 4.12h, 4198 utterances
Data Fields
Each of the training, validation, and test sets contains two files: wav.scp
and text
.
wav.scp
: wav_id wav_pathtext
: wav_id text
Dataset Example
text:
./data/148/148_5_F_L_ZIBO_Android_021.pcm 小鱼跳出水面没地方游泳了。
./data/148/148_5_F_L_ZIBO_Android_071.pcm 我很乖,我没有哭。
./data/148/148_5_F_L_ZIBO_Android_088.pcm 我喜欢画画想当画家。
Evaluation Metric
Character Error Rate (CER)
Citation
@inproceedings{zhou-etal-2025-childmandarin,
title = "{C}hild{M}andarin: A Comprehensive {M}andarin Speech Dataset for Young Children Aged 3-5",
author = "Zhou, Jiaming and
Wang, Shiyao and
Zhao, Shiwan and
He, Jiabei and
Sun, Haoqin and
Wang, Hui and
Liu, Cheng and
Kong, Aobo and
Guo, Yujie and
Yang, Xi and
Wang, Yequan and
Lin, Yonghua and
Qin, Yong",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.614/",
doi = "10.18653/v1/2025.acl-long.614",
pages = "12524--12537",
ISBN = "979-8-89176-251-0",
}
License
CC BY-NC-SA 4.0
SeniorTalk
# Character Error Rate (CER)
Dataset Description
SeniorTalk is a comprehensive open-source Mandarin speech dataset specifically targeting elderly speakers aged 75–85. It was released by Nankai University to address the severe lack of publicly available resources for this demographic, aiming to advance research in Automatic Speech Recognition (ASR) and related fields.
Dataset Composition and Specifications
Source Data Volume
Training set: 29.95h Validation set: 4.09h Test set: 3.77h
Evaluation Data Volume
Public test set: 3.77h, 5869 utterances
Data Fields
Includes audio files, sentence transcriptions, and speaker annotations.
sentence_data/
├── wav
│ ├── train/*.tar
│ ├── dev/*.tar
│ └── test/*.tar
└── transcript/*.txt
UTTERANCEINFO.txt # annotation of topics and duration
SPKINFO.txt # annotation of location, age, gender, and device
Dataset Example
Elderly0122S0001W0003.wav 找个有趣的地方玩一玩。
Elderly0122S0001W0005.wav 朝阳公园吧。
Elderly0122S0001W0010.wav 又遮阳光。
Elderly0122S0001W0016.wav 现在。
Elderly0122S0001W0023.wav 就是不太新鲜。
Elderly0122S0001W0026.wav 要新鲜一点的。
Elderly0122S0001W0027.wav 吃了才有营养。
Elderly0122S0001W0029.wav 好。
Evaluation Metric
Character Error Rate (CER)
Citation
@misc{chen2025seniortalkchineseconversationdataset,
title={SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors},
author={Yang Chen and Hui Wang and Shiyao Wang and Junyang Chen and Jiabei He and Jiaming Zhou and Xi Yang and Yequan Wang and Yonghua Lin and Yong Qin},
year={2025},
eprint={2503.16578},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.16578},
}
License
CC BY-NC-SA 4.0