ASR评测数据
Librispeech
#词错误率WER
数据描述
librispeech是一个大约1000小时的16kHz英语阅读演讲语料库,由vasil Panayotov在Daniel Povey的帮助下制作。数据来源于LibriVox项目的有声读物,并经过仔细分割和整理 。
源头数据量
训练数据共960h train-clean-100 train-clean-360 train-other-500
有两组验证集和测试集:dev-clean test-clean ;dev-other test-other
评测数据量:
test-clean 5.4h , 2620条 ;test-dev 5.1h,2939条
数据字段
Librispeech数据集按音频id存放,例如:19-198-0001.flac音频对应的文本在19-198.trans.txt文件中,字段如下: wav_id text
数据集样例
78-368-0000 CHAPTER TWENTY THREE IT WAS EIGHT O'CLOCK WHEN WE LANDED WE WALKED FOR A SHORT TIME ON THE SHORE ENJOYING THE TRANSITORY LIGHT AND THEN RETIRED TO THE INN
评价指标
词错误率WER
论文引用
@inproceedings{panayotov2015librispeech, title={Librispeech: an asr corpus based on public domain audio books}, author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev}, booktitle={2015 IEEE international conference on acoustics, speech and signal processing (ICASSP)}, pages={5206--5210}, year={2015}, organization={IEEE} }
数据集版权使用说明:
CC BY 4.0
AISHELL-1
#字错误率CER
数据描述
AISHEll-1是由希尔贝壳开源的178h中文普通话数据集,是目前最常用的中文语音数据集之一。共400人参加录音。该数据集在安静的室内环境中使用高保真麦克风录制,并将采样率降至16kHz。人工转录准确率95%以上。
数据集构成和规范
源数据量
训练集150h, 验证集10h, 测试集5h
评测数据量
评测数据量为公开的5h测试集,共7176条
数据字段
训练集、验证集、测试集均包含wav.scp和text两个文件
wav.scp: wav_id wav_path
text: wav_id text
数据集样例
wav.scp:
BAC009S0002W0122 /mnt/sda/jiaming_space/datasets/aishell/data_aishell/wav/train/S0002/BAC009S0002W0122.wav
text:
BAC009S0002W0122 而对楼市成交抑制作用最大的限购
评价指标
字错误率CER
论文引用
@inproceedings{bu2017aishell, title={Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline}, author={Bu, Hui and Du, Jiayu and Na, Xingyu and Wu, Bengu and Zheng, Hao}, booktitle={2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA)}, pages={1--5}, year={2017}, organization={IEEE} }
开源协议
Apache License v.2.0
ChildMandarin
#字错误率CER
数据描述
ChildMandarin是由南开大学和智源研究院共同开源的一个3-5岁低幼儿童数据集,说话人包括来自全国22个省级行政区的397名儿童,共41.25小时。
数据集构成和规范
源数据量
训练集33.35h, 验证集3.78h, 测试集4.12h
评测数据量
评测数据量为公开的4.12h测试集,共4198条
数据字段
训练集、验证集、测试集均包含wav.scp和text两个文件
wav.scp: wav_id wav_path
text: wav_id text
数据集样例
text:
./data/148/148_5_F_L_ZIBO_Android_021.pcm 小鱼跳出水面没地方游泳了。
./data/148/148_5_F_L_ZIBO_Android_071.pcm 我很乖,我没有哭。
./data/148/148_5_F_L_ZIBO_Android_088.pcm 我喜欢画画想当画家。
评价指标
字错误率CER
论文引用
@inproceedings{zhou-etal-2025-childmandarin,
title = "{C}hild{M}andarin: A Comprehensive {M}andarin Speech Dataset for Young Children Aged 3-5",
author = "Zhou, Jiaming and
Wang, Shiyao and
Zhao, Shiwan and
He, Jiabei and
Sun, Haoqin and
Wang, Hui and
Liu, Cheng and
Kong, Aobo and
Guo, Yujie and
Yang, Xi and
Wang, Yequan and
Lin, Yonghua and
Qin, Yong",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.614/",
doi = "10.18653/v1/2025.acl-long.614",
pages = "12524--12537",
ISBN = "979-8-89176-251-0",
}
开源协议
CC BY-NC-SA 4.0 license
SeniorTalk
#字错误率CER
数据描述
SeniorTalk 是南开大学发布的一个全面、开源的面向 75 至 85 岁老年人群体的普通话语音数据集。该数据集专门针对这一年龄段人群,旨在解决目前公开可用资源严重不足的问题,从而推动自动语音识别(ASR)等相关领域的研究进展。
数据集构成和规范
源数据量
训练集29.95h, 验证集4.09h,测试集3.77h
评测数据量
评测数据量为公开的3.77h测试集,共5,869条句子
数据字段
包括音频和句子转录文本,说话人特征标注
sentence_data/
├── wav
│ ├── train/*.tar
│ ├── dev/*.tar
│ └── test/*.tar
└── transcript/*.txt
UTTERANCEINFO.txt # annotation of topics and duration
SPKINFO.txt # annotation of location , age , gender and device
数据集样例
Elderly0122S0001W0003.wav 找个有趣的地方玩一玩。
Elderly0122S0001W0005.wav 朝阳公园吧。
Elderly0122S0001W0010.wav 又遮阳光。
Elderly0122S0001W0016.wav 现在。
Elderly0122S0001W0023.wav 就是不太新鲜。
Elderly0122S0001W0026.wav 要新鲜一点的。
Elderly0122S0001W0027.wav 吃了才有营养。
Elderly0122S0001W0029.wav 好。
评价指标
#字错误率CER
论文引用
@misc{chen2025seniortalkchineseconversationdataset,
title={SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors},
author={Yang Chen and Hui Wang and Shiyao Wang and Junyang Chen and Jiabei He and Jiaming Zhou and Xi Yang and Yequan Wang and Yonghua Lin and Yong Qin},
year={2025},
eprint={2503.16578},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.16578},
}
开源协议
CC BY-NC-SA 4.0 license