Evaluation Dataset
Common voice-en2(en-de) + Covost 2
#BiLingual Evaluation Understudy (BLEU)
Adaptation Method
3-layer Transformer encoder + 3-layerTransformer decoder. The features output by the upstream model first undergo feature extraction through a global average pooling layer, and then are input into the encoder. The input dimension of the encoder is equal to the dimension of the feature vector, and the output dimension is equal to the vocabulary size.
Data description
Common Voice is a vast, multilingual dataset of transcribed speech, used for research and validation of speech technology. Collected and validated through crowdsourcing, it supports 60 languages and boasts over 7,327 hours of verified speech data.
Covast 2, an expansion of the Common Voice dataset, features a corpus of 2,900 hours of speech. It offers translations from English (En) to 15 languages: Arabic (Ar), Catalan (Ca), Welsh (Cy), German (De), Estonian (et), Persian (Fa), Indonesian (Id), Japanese (Ja), Latvian (Lv), Mongolian (Mn), Slovenian (Sl), Swedish (Sv), Tamil (Ta), Turkish (Tr), and Chinese (Zh). Additionally, it provides translations from 21 languages to English, including the 15 target languages plus Spanish (Es), French (Fr), Italian (It), Dutch (Nl), Portuguese (Pt), and Russian (Ru).
Dataset structure
Amount of source data
Training set 429.24h, validation set 26.07h, test set 24.64h
Data detail
train.tsv, dev.tsv, test.tsv
id audio n_frames sr src_text tgt_text
Sample of source dataset
common_voice_en_78232 common_voice_en_78232.mp3 47232 48000 i wish you wouldn't Ich wünschte, du ließest es bleiben.Citation information
@article{ardila2019common, title={Common voice: A massively-multilingual speech corpus}, author={Ardila, Rosana and Branson, Megan and Davis, Kelly and Henretty, Michael and Kohler, Michael and Meyer, Josh and Morais, Reuben and Saunders, Lindsay and Tyers, Francis M and Weber, Gregor}, journal={arXiv preprint arXiv:1912.06670}, year={2019} } @article{wang2020covost, title={Covost 2 and massively multilingual speech-to-text translation}, author={Wang, Changhan and Wu, Anne and Pino, Juan}, journal={arXiv preprint arXiv:2007.10310}, year={2020} }
Licensing information
Creative Commons CC0 license