任务介绍
文本检索视频旨在通过输入一段自然语言描述,从大规模视频库中自动找到与该文本语义最相关的视频片段。
评测数据
MSR-VTT
数据描述:
MSR-VTT,全称Microsoft Research Video to Text是一个包含视频及对应文本标注的大规模数据集,由来自20个类别的10,000个视频片段组成,每个视频片段包含20个英文句子标注。
数据集构成和规范:
源数据量:
数据集分成训练集(6513),验证集(497),测试集(2990),每段视频有20个对应的文本描述
评测数据量:
评测数据为源数据测试集中的2990段视频以及对应的文本描述
源数据字段:
KEYS | EXPLAIN |
---|---|
vid | 视频 |
texts | 对应的文本 |
源数据集样例:
vid:
texts:
- a baker is demonstrating a cooking technique
- a female giving a baking demonstration in her kitchen
- a girl explaining to prepare a dish
- a lady with a scarf is cooking with dough
- a person is preparing some food
- a person making pastries
- a woman is making a pastry
- a woman is rolling doe
- a woman is rolling dough around a stick
- a woman is rolling dough
- a woman is rolling dough
- a woman is wrapping dough around some food item
- a woman rolling up pastry while giving instructions
- a woman rolls dough
- a woman showing an easy way to make crescent rolls
- how to prepare food rolls
- the pastry should have five creases
- a person is preparing some food
- a woman is rolling dough around a stick
- a woman rolls dough
论文引用:
@inproceedings{xu2016msr-vtt,
author = {Xu, Jun and Mei, Tao and Yao, Ting and Rui, Yong},
title = {MSR-VTT: A Large Video Description Dataset for Bridging Video and Language},
year = {2016},
month = {June},
publisher = {IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)},
}
UCF-101
数据描述:
UCF101是美国中央佛罗里达大学(University of Central Florida)从YouTube收集的具有101个动作类别的视频数据集,共包含13320段视频。
数据集构成和规范:
源数据量:
数据集分成训练集(9537),测试集(3783)
评测数据量:
评测数据为源数据测试集中的2990段视频以及对应的文本描述
源数据字段:
KEYS | EXPLAIN |
---|---|
vid | 视频 |
label | 视频动作类别标签 |
源数据集样例:
vid:
label:
Playing Basketball
论文引用:
@article{soomro2012ucf101,
title={UCF101: A dataset of 101 human actions classes from videos in the wild},
author={Soomro, Khurram and Zamir, Amir Roshan and Shah, Mubarak},
journal={arXiv preprint arXiv:1212.0402},
year={2012}
}