Evaluation Dataset

Common Language

#Accuracy

Adapting Method:

Linear Classifier, the features that output by the upstream model are pooled before being input into the linear classifier. The linear classifier consists of two linear layers and a ReLU activation layer. The input dimension of the linear classifier is equal to the dimension of the feature vector, and the output dimension is equal to the number of categories.

Data description:

This dataset is extracted from Common Voice. It is used for training language identification systems. The dataset contains 45 languages, with a total recording time of 45.1 hours (i.e., 1 hour of data for each language). The dataset has already been divided into training, validating, and testing datasets. Ten languages have been selected from Common Language to build the specific language identification system.

Dataset structure:

Amount of source data:

Training set 30h, validation set 7.5h, test set 7.5h

Data detail:

The evaluation data is selected from the original dataset, including the languages of "Chinese_China", "Dutch", "English", "Greek", "Italian", "Mongolian", "Russian", "Spanish", "Swedish" and "Welsh".

It includes 6.7 hours of training set, 1.7 hours of validation set, and 1.7 hours of test set.

The training set, validation set, and test set data are recorded in corresponding JSON files. The JSON files contain "labels" and "meta_data". "Labels" records the categories and codes of the labels, while the "meta_data" records path label, and the dataset category (training set, validation set, and test set) of a specific sample.

Sample of source dataset:

{
    "labels": {
        "Chinese_China": 0,
        "Dutch": 1,
        "English": 2,
        "Greek": 3,
        "Italian": 4,
        "Mangolian": 5,
        "Russian": 6,
        "Spanish": 7,
        "Swedish": 8,
        "Welsh": 9
    },
    "meta_data": [
        {
            "path": "Chinese_China/train/chch_trn_sp_75/common_voice_zh-CN_20785511.wav",
            "label": "Chinese_China",
            "speaker": "train"
        },
        {
            "path": "English/train/eng_trn_sp_503/common_voice_en_21530721.wav",
            "label": "English",
            "speaker": "train"
        },
        {
            "path": "Italian/train/itln_trn_sp_555/common_voice_it_20265825.wav",
            "label": "Italian",
            "speaker": "train"
        },
        {
            "path": "Spanish/train/spa_trn_sp_227/common_voice_es_19635818.wav",
            "label": "Spanish",
            "speaker": "train"
        },
        {
            "path": "Welsh/train/wls_trn_sp_546/common_voice_cy_19197031.wav",
            "label": "Welsh",
            "speaker": "train"
        }
    ]
}

Citation information:

@dataset{ganesh_sinisetty_2021_5036977,
  author       = {Ganesh Sinisetty and
                  Pavlo Ruban and
                  Oleksandr Dymov and
                  Mirco Ravanelli},
  title        = {CommonLanguage},
  month        = jun,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {0.1},
  doi          = {10.5281/zenodo.5036977},
  url          = {https://doi.org/10.5281/zenodo.5036977}
}

Licensing information:

CC BY 4.0

Evaluation Dataset ​

Common Language ​

Adapting Method: ​

Data description: ​

Dataset structure: ​

Amount of source data: ​

Data detail: ​

Sample of source dataset: ​

Citation information: ​

Licensing information: ​