Skip to content

任务介绍

文本到视频生成旨在根据用户给定的文本描述,进行“联想”和“创造”,自动生成语义一致、内容真实、时序连贯且符合逻辑的视频

评测数据

MSR-VTT

数据描述:

MSR-VTT,全称Microsoft Research Video to Text是一个包含视频及对应文本标注的大规模数据集,由来自20个类别的10,000个视频片段组成,每个视频片段包含20个英文句子标注。

数据集构成和规范:

源数据量:

数据集分成训练集(6513),验证集(497),测试集(2990),每段视频有20个对应的文本描述

评测数据量:

评测数据为源数据测试集中的2990段视频以及对应的文本描述

源数据字段:

KEYSEXPLAIN
vid视频
texts对应的文本

源数据集样例:

vid:
Alt text
texts:

  1. a baker is demonstrating a cooking technique
  2. a female giving a baking demonstration in her kitchen
  3. a girl explaining to prepare a dish
  4. a lady with a scarf is cooking with dough
  5. a person is preparing some food
  6. a person making pastries
  7. a woman is making a pastry
  8. a woman is rolling doe
  9. a woman is rolling dough around a stick
  10. a woman is rolling dough
  11. a woman is rolling dough
  12. a woman is wrapping dough around some food item
  13. a woman rolling up pastry while giving instructions
  14. a woman rolls dough
  15. a woman showing an easy way to make crescent rolls
  16. how to prepare food rolls
  17. the pastry should have five creases
  18. a person is preparing some food
  19. a woman is rolling dough around a stick
  20. a woman rolls dough

论文引用:

@inproceedings{xu2016msr-vtt,
author = {Xu, Jun and Mei, Tao and Yao, Ting and Rui, Yong},
title = {MSR-VTT: A Large Video Description Dataset for Bridging Video and Language},
year = {2016},
month = {June},
publisher = {IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)},
}

UCF-101

数据描述:

UCF101是美国中央佛罗里达大学(University of Central Florida)从YouTube收集的具有101个动作类别的视频数据集,共包含13320段视频。

数据集构成和规范:

源数据量:

数据集分成训练集(9537),测试集(3783)

评测数据量:

评测数据为源数据测试集中的2990段视频以及对应的文本描述

源数据字段:

KEYSEXPLAIN
vid视频
label视频动作类别标签

源数据集样例:

vid:
Alt text

label:
Playing Basketball

论文引用:

@article{soomro2012ucf101,
  title={UCF101: A dataset of 101 human actions classes from videos in the wild},
  author={Soomro, Khurram and Zamir, Amir Roshan and Shah, Mubarak},
  journal={arXiv preprint arXiv:1212.0402},
  year={2012}
}