Skip to content

IC Evaluation Data

Fluent Speech Commands

#Accuracy

Adaptation Method:

Linear Classifier: Features output by the upstream model are first passed through a global average pooling layer for feature extraction, then input into a linear classifier that contains a single linear fully connected layer. The input dimension of the linear classifier is equal to the dimension of the feature vector, and the output dimension is equal to the number of categories.

Data Description:

The Fluent Speech Commands dataset contains 30,043 utterances from 97 speakers. Each file contains a voice command for controlling smart appliances or virtual assistants. The dataset includes three categories of intent (Action, Object, Location), encompassing a total of 31 unique sub-intents. The language is English.

Dataset structure:

Amount of source data:

Training set: 23,132 items, Validation set: 3,118 items, Test set: 793 items

Amount of Evaluation data:

The evaluation data volume is the public test set of 793 items.

Data detail:

KEYSEXPLAIN
idData ID
pathPath to the corresponding MP3 file
speakerIdSpeaker ID
transcriptionText corresponding to the speech
actionAction type intent
objectObject type intent
locationLocation type intent

Sample of source dataset:

{
  "id":0,
  "path":"wavs/speakers/4BrX8aDqK2cLZRYl/cbdf5700-452c-11e9-b1e4-e5985dca719e.wav",
  "speakerId":"4BrX8aDqK2cLZRYl",
  "transcription":"Turn on the lights",
  "action":"activate",
  "object":"lights",
  "location":"none"
}

Citation information:

@article{lugosch2019speech,
  title={Speech model pre-training for end-to-end spoken language understanding},
  author={Lugosch, Loren and Ravanelli, Mirco and Ignoto, Patrick and Tomar, Vikrant Singh and Bengio, Yoshua},
  journal={arXiv preprint arXiv:1904.03670},
  year={2019}
}

Licensing information:

Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International license