Skip to content

Evaluation Dataset

Common voice-en2(en-de) + Covost 2

#BiLingual Evaluation Understudy (BLEU)

Data description:

Common Voice is a vast, multilingual dataset of transcribed speech, used for research and validation of speech technology. Collected and validated through crowdsourcing, it supports 60 languages and boasts over 7,327 hours of verified speech data.

Covast 2, an expansion of the Common Voice dataset, features a corpus of 2,900 hours of speech. It offers translations from English (En) to 15 languages: Arabic (Ar), Catalan (Ca), Welsh (Cy), German (De), Estonian (et), Persian (Fa), Indonesian (Id), Japanese (Ja), Latvian (Lv), Mongolian (Mn), Slovenian (Sl), Swedish (Sv), Tamil (Ta), Turkish (Tr), and Chinese (Zh). Additionally, it provides translations from 21 languages to English, including the 15 target languages plus Spanish (Es), French (Fr), Italian (It), Dutch (Nl), Portuguese (Pt), and Russian (Ru).

Dataset structure:

Amount of source data:

Training set 429.24h, validation set 26.07h, test set 24.64h

Data detail:

train.tsv, dev.tsv, test.tsv

id audio n_frames sr src_text tgt_text

Sample of source dataset:

common_voice_en_78232	common_voice_en_78232.mp3	47232	48000	i wish you wouldn't	Ich wünschte, du ließest es bleiben.

Citation information:

@article{ardila2019common,
 title={Common voice: A massively-multilingual speech corpus},
 author={Ardila, Rosana and Branson, Megan and Davis, Kelly and Henretty, Michael and Kohler, Michael and Meyer, Josh and Morais, Reuben and Saunders, Lindsay and Tyers, Francis M and Weber, Gregor},
 journal={arXiv preprint arXiv:1912.06670},
 year={2019}
}

@article{wang2020covost,
 title={Covost 2 and massively multilingual speech-to-text translation},
 author={Wang, Changhan and Wu, Anne and Pino, Juan},
 journal={arXiv preprint arXiv:2007.10310},
 year={2020}
}

Licensing information:

Creative Commons CC0 license