评测数据

MS-COCO

数据描述：

MS-COCO的全称是Microsoft Common Objects in Context，起源于微软于2014年出资标注的Microsoft COCO数据集，简称COCO。COCO数据集涵盖了150万个对象实例，80个目标类别以及91个物体类别，用于目标检测、分割、文本生成图像、图像描述等等场景

数据集构成和规范：

源数据量：

数据集分成训练集(118287)，验证集(5000)，测试集(40670)，每张图像有5个对应的文本描述

评测数据量:

评测数据为源数据测试集中的40670张图像以及对应的文本描述

源数据字段：

KEYS	EXPLAIN
img	图像
texts	对应的文本

源数据集样例：

img: Alt text texts:

A red hair woman holding an open box of pizza.
A young woman holding a pizza in a box.
a woman is holding a box of pizza.
A woman is posing with an open pizza box.
A woman holds an open box of pizza.

源数据集版权使用说明：

Creative Commons Attribution 4.0 License

论文引用：

{MS-COCO,
  title={Microsoft coco: Common objects in context},
  author={Lin, Tsung-Yi, et al.},
  year={2014},
  howpublished={ECCV 2014},
}

CUB

数据描述：

CUB-200数据集全称为Caltech-UCSD Birds-200-2011数据集，是由加利福尼亚理工学院提供的鸟类数据库，共包含200种鸟类的11,788张图像。使用中通常划分为训练集（100种），验证集（50种）和测试集（50种）。

数据集构成和规范：

源数据量：

数据集分成训练集(8855)，测试集(2933)，每张图像有10个对应的文本描述

评测数据量:

评测数据为源数据测试集中的2933张图像以及对应的文本描述

源数据字段：

KEYS	EXPLAIN
img	图像
texts	对应的文本

源数据集样例：

img:
Alt text
texts:

this small blue bird has a white bill and black legs.
this bird has a short white bill along with a vibrant blue belly, and fluffy blue breast.
a small sized bird that is mostly blue and has a short thick bill
small, but wide bird with a small beak and an almost non existent head, all blue body.
small chubby bird with a blue body, and bluish green wings and tail
this bird is blue with black and has a very short beak.
the small bird is blue in color with a small grey beak.
this bird is vivid blue and black in color, with a stubby multi colored beak.
a small bird that is blue, has narrow legs, a long tail, and a short beak that curves downward.
this bird has wings that are black and has a blue belly

论文引用：

@techreport{WahCUB_200_2011,
	Title = ,
	Author = {Wah, C. and Branson, S. and Welinder, P. and Perona, P. and Belongie, S.},
	Year = {2011}
	Institution = {California Institute of Technology},
	Number = {CNS-TR-2011-001}
}

Oxford-102 Flower

数据描述：

Oxford-102 Flower是牛津工程大学于2008年发布的花卉数据集，选择的花通常在英国本土，总共包含102种类别的花卉。

数据集构成和规范：

源数据量：

数据集分成训练集(6149)，测试集(2040)，每张图像有10个对应的文本描述

评测数据量:

评测数据为源数据测试集中的2040张图像以及对应的文本描述

源数据字段：

KEYS	EXPLAIN
img	图像
texts	对应的文本

源数据集样例：

img:

texts:

the petals of the flower are pink in color and have a yellow center.
this flower is pink and white in color, with petals that are multi colored.
the geographical shapes of the bright purple petals set off the orange stamen and filament and the cross shaped stigma is beautiful.
the purple petals have shades of white with white anther and filament
this flower has large pink petals and a white stigma in the center
this flower has petals that are pink and has a yellow stamen
a flower with short and wide petals that is light purple.
this flower has small pink petals with a yellow center.
this flower has large rounded pink petals with curved edges and purple veins.
this flower has purple petals as well as a white stamen.

论文引用：

@inproceedings{nilsback2008automated,
  title={Automated flower classification over a large number of classes},
  author={Nilsback, Maria-Elena and Zisserman, Andrew},
  booktitle={2008 Sixth Indian conference on computer vision, graphics \& image processing},
  pages={722--729},
  year={2008},
  organization={IEEE}
}

mg18_en

评测指标:FID, CLIP Score

数据描述

这是一个用于评估多语言图像生成质量的数据集，包含18种语言的7000个高质量图像-文本对。这个数据集通过扩展XM-3600数据集并结合WIT数据集中的高质量图像来构建。它用于评估模型在生成通用图像方面的能力。

此数据集为英文版。

数据集构成

数据集选取了中英文两种，每种语言各2500个prompt

源数据集版权使用说明：

apache-2.0

论文引用

@misc{ye2023altdiffusion, title={AltDiffusion: A Multilingual Text-to-Image Diffusion Model}, author={Fulong Ye and Guang Liu and Xinya Wu and Ledell Wu}, year={2023}, eprint={2308.09991}, archivePrefix={arXiv}, primaryClass={cs.CV} }

mg18_zh

评测指标:FID, CLIP Score

数据描述

此数据集为中文版。

数据集构成

数据集选取了中英文两种，每种语言各2500个prompt

源数据集版权使用说明：

apache-2.0

论文引用

Image-gen-v1.0

评测指标：人工从图文一致性，图像质量和安全性三个方面进行主观评价。

数据描述

由智源全新命制的文生图评测数据集，共414条prompts，以中文和英文为主。在prompt的设计上，覆盖了各类实体（任务，动植物，风景，天气等），属性（颜色，情绪，氛围等），风格（写实，动漫，摄影等），以及一些需要推理能和复杂文本理解能力的内容。力求从不同维度进行全方位的评测。

CelebA-HQ

数据描述：

CelebA，指CelebFaces Attribute，即名人面部属性数据集。它包含10,177位名人身份的202,599张面部图像。CelebA由香港中文大学公开提供，广泛用于与人脸相关的计算机视觉训练任务。

数据集构成和规范：

源数据量：

训练集（24183），验证集（2993），测试集（2824）。每张图有10条标题。

数据字段：

KEYS	EXPLAIN
img	图像
texts	图像标题

源数据集样例：

img:
Alt text
texts:

The person has pale skin, wavy hair, black hair, pointy nose, high cheekbones, big lips, and arched eyebrows and is wearing heavy makeup.
This attractive person has wavy hair, and big nose.
This person has black hair, wavy hair, arched eyebrows, pointy nose, pale skin, big nose, and big lips. She is attractive, and young and is wearing heavy makeup, and lipstick.
The woman wears lipstick. She has big nose, high cheekbones, arched eyebrows, wavy hair, and big lips. She is smiling, and young.
She wears earrings. She has big nose, and pointy nose. She is smiling.
This attractive person has pale skin.
She is wearing lipstick, and earrings. She is attractive, and smiling and has arched eyebrows, wavy hair, big lips, high cheekbones, big nose, and black hair.
This smiling, and young woman has pointy nose.
This woman is attractive and has wavy hair, high cheekbones, black hair, arched eyebrows, big nose, big lips, and pointy nose.
This person has black hair, high cheekbones, wavy hair, big lips, pointy nose, and pale skin and is wearing heavy makeup.

论文引用：

@inproceedings{liu2015faceattributes,
 title = {Deep Learning Face Attributes in the Wild},
 author = {Liu, Ziwei and Luo, Ping and Wang, Xiaogang and Tang, Xiaoou},
 booktitle = {Proceedings of International Conference on Computer Vision (ICCV)},
 year = {2015} 
}

@inproceedings{karras2017progressive,
  title={Progressive growing of gans for improved quality, stability, and variation},
  author={Karras, Tero and Aila, Timo and Laine, Samuli and Lehtinen, Jaakko},
  journal={International Conference on Learning Representations (ICLR)},
  year={2018}
}

数据集版权使用说明：

本软件的使用仅限于非商业研究和教育目的。

MSR-VTT

数据描述：

MSR-VTT，指Microsoft Research Video to Text，是一个包含视频和相应文本注释的大规模数据集。它包含20个类别的10,000个视频片段。每个视频片段都包含20个英文句子注释。

数据集构成和规范：

源数据量：

训练集（6513），验证集（497），测试集（2990）。每个视频有20条标题。

数据字段：

KEYS	EXPLAIN
vid	视频
texts	视频标题

源数据集样例：

vid:
Alt text
texts:

a baker is demonstrating a cooking technique
a female giving a baking demonstration in her kitchen
a girl explaining to prepare a dish
a lady with a scarf is cooking with dough
a person is preparing some food
a person making pastries
a woman is making a pastry
a woman is rolling doe
a woman is rolling dough around a stick
a woman is rolling dough
a woman is rolling dough
a woman is wrapping dough around some food item
a woman rolling up pastry while giving instructions
a woman rolls dough
a woman showing an easy way to make crescent rolls
how to prepare food rolls
the pastry should have five creases
a person is preparing some food
a woman is rolling dough around a stick
a woman rolls dough

论文引用：

@inproceedings{xu2016msr-vtt,
author = {Xu, Jun and Mei, Tao and Yao, Ting and Rui, Yong},
title = {MSR-VTT: A Large Video Description Dataset for Bridging Video and Language},
year = {2016},
month = {June},
publisher = {IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)},
}

UCF-101

数据描述：

UCF101是一个视频数据集，包含由中佛罗里达大学收集的YouTube上的101个动作类别的13,320个视频。

数据集构成和规范：

源数据量：

训练集（9537），测试集（3783）

数据字段：

KEYS	EXPLAIN
vid	视频
labels	视频标注

源数据集样例：

vid:
Alt text

label:
Playing Basketball

论文引用：

@article{soomro2012ucf101,
  title={UCF101: A dataset of 101 human actions classes from videos in the wild},
  author={Soomro, Khurram and Zamir, Amir Roshan and Shah, Mubarak},
  journal={arXiv preprint arXiv:1212.0402},
  year={2012}
}

RelScene

数据描述：

RelScene 包含一个全面的 3D 场景集合，其中包括文本描述，标注了物体的空间关系，并提供了模板化和自由形式的自然语言描述。

数据集构成和规范：

源数据量：

该数据集分为训练集（4854）和测试集（900）。

数据字段：

KEYS	EXPLAIN
img	图像
prompt	图像的描述文字（caption）

源数据集样例：

img:
Alt text

prompt: 在这个 3D 场景中，有一个吊灯（Pendant Lamp）位于餐桌（Dining Table）的正上方。在餐桌的左侧，有一个抽屉柜/转角柜（Drawer Chest / Corner cabinet），与桌子对齐。另一个餐桌位于第一个餐桌的左侧。第一个餐桌位于抽屉柜/转角柜的右后方，并且也与其对齐。此外，第一个餐桌位于吊灯的正下方。最后，在第一个吊灯的左侧还有第二个吊灯。

论文引用：

@inproceedings{ye2024relscene,
  title={RelScene: A Benchmark and baseline for Spatial Relations in text-driven 3D Scene Generation},
  author={Ye, Zhaoda and Zheng, Xinhan and Liu, Yang and Peng, Yuxin},
  booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
  pages={10563--10571},
  year={2024}
}

评测数据 ​

MS-COCO ​

数据描述： ​

数据集构成和规范： ​

源数据量： ​

评测数据量: ​

源数据字段： ​

源数据集样例： ​

源数据集版权使用说明： ​

论文引用： ​

CUB ​

数据描述： ​

数据集构成和规范： ​

源数据量： ​

评测数据量: ​

源数据字段： ​

源数据集样例： ​

论文引用： ​

Oxford-102 Flower ​

数据描述： ​

数据集构成和规范： ​

源数据量： ​

评测数据量: ​

源数据字段： ​

源数据集样例： ​

论文引用： ​

mg18_en ​

数据描述 ​

数据集构成 ​

源数据集版权使用说明： ​

论文引用 ​

mg18_zh ​

数据描述 ​

数据集构成 ​

源数据集版权使用说明： ​

论文引用 ​

Image-gen-v1.0 ​

数据描述 ​

CelebA-HQ ​

数据描述： ​

数据集构成和规范： ​

源数据量： ​

数据字段： ​

源数据集样例： ​

论文引用： ​

数据集版权使用说明： ​

MSR-VTT ​

数据描述： ​

数据集构成和规范： ​

源数据量： ​

数据字段： ​

源数据集样例： ​

论文引用： ​

UCF-101 ​

数据描述： ​

数据集构成和规范： ​

源数据量： ​

数据字段： ​

源数据集样例： ​

论文引用： ​

RelScene ​

数据描述： ​

数据集构成和规范： ​

源数据量： ​

数据字段： ​

源数据集样例： ​

论文引用： ​

评测数据

MS-COCO

数据描述：

数据集构成和规范：

源数据量：

评测数据量:

源数据字段：

源数据集样例：

源数据集版权使用说明：

论文引用：

CUB

数据描述：

数据集构成和规范：

源数据量：

评测数据量:

源数据字段：

源数据集样例：

论文引用：

Oxford-102 Flower

数据描述：

数据集构成和规范：

源数据量：

评测数据量:

源数据字段：

源数据集样例：

论文引用：

mg18_en

数据描述

数据集构成

源数据集版权使用说明：

论文引用

mg18_zh

数据描述

数据集构成

源数据集版权使用说明：

论文引用

Image-gen-v1.0

数据描述

CelebA-HQ

数据描述：

数据集构成和规范：

源数据量：

数据字段：

源数据集样例：

论文引用：

数据集版权使用说明：

MSR-VTT

数据描述：

数据集构成和规范：

源数据量：

数据字段：

源数据集样例：

论文引用：

UCF-101

数据描述：

数据集构成和规范：

源数据量：

数据字段：

源数据集样例：

论文引用：

RelScene

数据描述：

数据集构成和规范：

源数据量：

数据字段：

源数据集样例：

论文引用：