Skip to content

SF评测数据

SNIPS

#准确率-Accuracy, 鲁棒性-Robustness

适配方法:

CTC Decoder,上游模型输出的特征经过两层LSTM和一个全连接层的线性分类器中。输入维度与特征向量维度相等,输出维度与槽类型数量相等。

相关论文引用:

@inproceedings{graves2006connectionist,
title={Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks},
author={Graves, Alex and Fern{'a}ndez, Santiago and Gomez, Faustino and Schmidhuber, J{"u}rgen},
booktitle={Proceedings of the 23rd international conference on Machine learning},
pages={369–376},
year={2006}
}

数据描述:

SNIPS自然语言理解基准测试是一个包含超过16,000个众包查询的数据集,语种为英语,分布在7个不同复杂性的用户意图中:SearchCreativeWork(例如找一个机器人电视节目),GetWeather(例如,现在马萨诸塞州波士顿刮风吗?),BookRestaurant(例如,我想明天晚上在巴黎预订一家评价很高的餐厅),PlayMusic(例如,在Spotify上播放碧昂斯的最后一首曲目), AddToPlaylist(例如,将钻石添加到我的公路旅行播放列表),RateBook(例如给老鼠和男人6星),SearchScreeningEvent(例如查看神奇女侠在巴黎的放映时间)。

数据集构成和规范:

源数据量:

训练集13084条, 验证集700条, 测试集700条。

评测数据量:

评测数据量为公开的测试集700条。

源数据字段:

KEYSEXPLAIN
id数据的MP3文件路径
textMP3文件对应的文本
label每个token的槽类型

源数据集样例:

{
    "id":"Aditi-snips-test-0",
    "text":"BOS I'D LIKE TO HAVE THIS TRACK ONTO MY CLASSICAL RELAXATIONS PLAYLIST EOS"	
    "label":"O O O O O O music_item O playlist_owner playlist playlist O AddToPlaylist"
}

论文引用:

@article{coucke2018snips,
  title={Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces},
  author={Coucke, Alice and Saade, Alaa and Ball, Adrien and Bluche, Th{\'e}odore and Caulier, Alexandre and Leroy, >   David and Doumouro, Cl{\'e}ment and Gisselbrecht, Thibault and Caltagirone, Francesco and Lavril, Thibaut and others},
  journal={arXiv preprint arXiv:1805.10190},
  year={2018}
  }
  ```

源数据集版权使用说明:

Creative Commons Zero v1.0 Universal