Evaluation Dataset

Common voice-en2(en-de) + Covost 2

#BiLingual Evaluation Understudy (BLEU)

Adaptation Method

3-layer Transformer encoder + 3-layerTransformer decoder. The features output by the upstream model first undergo feature extraction through a global average pooling layer, and then are input into the encoder. The input dimension of the encoder is equal to the dimension of the feature vector, and the output dimension is equal to the vocabulary size.

Data description

Common Voice is a vast, multilingual dataset of transcribed speech, used for research and validation of speech technology. Collected and validated through crowdsourcing, it supports 60 languages and boasts over 7,327 hours of verified speech data.

Covast 2, an expansion of the Common Voice dataset, features a corpus of 2,900 hours of speech. It offers translations from English (En) to 15 languages: Arabic (Ar), Catalan (Ca), Welsh (Cy), German (De), Estonian (et), Persian (Fa), Indonesian (Id), Japanese (Ja), Latvian (Lv), Mongolian (Mn), Slovenian (Sl), Swedish (Sv), Tamil (Ta), Turkish (Tr), and Chinese (Zh). Additionally, it provides translations from 21 languages to English, including the 15 target languages plus Spanish (Es), French (Fr), Italian (It), Dutch (Nl), Portuguese (Pt), and Russian (Ru).

Dataset structure

Amount of source data

Training set 429.24h, validation set 26.07h, test set 24.64h

Data detail

train.tsv, dev.tsv, test.tsv

id audio n_frames sr src_text tgt_text

Sample of source dataset

common_voice_en_78232	common_voice_en_78232.mp3	47232	48000	i wish you wouldn't	Ich wünschte, du ließest es bleiben.

Citation information

@article{ardila2019common,
 title={Common voice: A massively-multilingual speech corpus},
 author={Ardila, Rosana and Branson, Megan and Davis, Kelly and Henretty, Michael and Kohler, Michael and Meyer, Josh and Morais, Reuben and Saunders, Lindsay and Tyers, Francis M and Weber, Gregor},
 journal={arXiv preprint arXiv:1912.06670},
 year={2019}
}

@article{wang2020covost,
 title={Covost 2 and massively multilingual speech-to-text translation},
 author={Wang, Changhan and Wu, Anne and Pino, Juan},
 journal={arXiv preprint arXiv:2007.10310},
 year={2020}
}

Licensing information

Creative Commons CC0 license

Evaluation Dataset ​

Common voice-en2(en-de) + Covost 2 ​

Adaptation Method ​

Data description ​

Dataset structure ​

Amount of source data ​

Data detail ​

Sample of source dataset ​

Citation information ​

Licensing information ​