Evaluation Dataset
Common voice-en2(en-de) + Covost 2
#BiLingual Evaluation Understudy (BLEU)
Data description:
Common Voice is a vast, multilingual dataset of transcribed speech, used for research and validation of speech technology. Collected and validated through crowdsourcing, it supports 60 languages and boasts over 7,327 hours of verified speech data.
Covast 2, an expansion of the Common Voice dataset, features a corpus of 2,900 hours of speech. It offers translations from English (En) to 15 languages: Arabic (Ar), Catalan (Ca), Welsh (Cy), German (De), Estonian (et), Persian (Fa), Indonesian (Id), Japanese (Ja), Latvian (Lv), Mongolian (Mn), Slovenian (Sl), Swedish (Sv), Tamil (Ta), Turkish (Tr), and Chinese (Zh). Additionally, it provides translations from 21 languages to English, including the 15 target languages plus Spanish (Es), French (Fr), Italian (It), Dutch (Nl), Portuguese (Pt), and Russian (Ru).
Dataset structure:
Amount of source data:
Training set 429.24h, validation set 26.07h, test set 24.64h
Data detail:
train.tsv, dev.tsv, test.tsv
id audio n_frames sr src_text tgt_text
Sample of source dataset:
common_voice_en_78232 common_voice_en_78232.mp3 47232 48000 i wish you wouldn't Ich wünschte, du ließest es bleiben.
Citation information:
@article{ardila2019common, title={Common voice: A massively-multilingual speech corpus}, author={Ardila, Rosana and Branson, Megan and Davis, Kelly and Henretty, Michael and Kohler, Michael and Meyer, Josh and Morais, Reuben and Saunders, Lindsay and Tyers, Francis M and Weber, Gregor}, journal={arXiv preprint arXiv:1912.06670}, year={2019} } @article{wang2020covost, title={Covost 2 and massively multilingual speech-to-text translation}, author={Wang, Changhan and Wu, Anne and Pino, Juan}, journal={arXiv preprint arXiv:2007.10310}, year={2020} }
Licensing information:
Creative Commons CC0 license