Evaluation Dataset

Dataset 1（MS-COCO）

Data description：

The full name of MS-COCO is Microsoft Common Objects in Context, which originated from the Microsoft COCO data set funded by Microsoft in 2014. The COCO dataset covers 1.5 million object instances, 80 target categories and 91 object categories, and is used for target detection, segmentation, text generation, image description, etc.

Dataset structure：

Amount of source data：

The dataset is split into train(118287), validation(5000), test(40670), each image has 5 captions.

Data detail：

KEYS	EXPLAIN
img	image
texts	captions of the image

Sample of source dataset：

img: Alt text texts:

A red hair woman holding an open box of pizza.
A young woman holding a pizza in a box.
a woman is holding a box of pizza.
A woman is posing with an open pizza box.
A woman holds an open box of pizza.

Licensing information：

Creative Commons Attribution 4.0 License

Citation information：

{MS-COCO,
  title={Microsoft coco: Common objects in context},
  author={Lin, Tsung-Yi, et al.},
  year={2014},
  howpublished={ECCV 2014},
}

CUB

Data description：

The full name of the CUB-200 data set is the Caltech-UCSD Birds-200-2011 data set. It is a bird database provided by the California Institute of Technology and contains a total of 11,788 images of 200 species of birds. It is usually divided into a training set (100 types), a verification set (50 types) and a test set (50 types).

Dataset structure：

Amount of source data：

The dataset is split into train(8855) and test(2933), each image has 10 captions.

Data detail：

KEYS	EXPLAIN
img	image
texts	captions of the image

Sample of source dataset：

img:
Alt text
texts:

this small blue bird has a white bill and black legs.
this bird has a short white bill along with a vibrant blue belly, and fluffy blue breast.
a small sized bird that is mostly blue and has a short thick bill
small, but wide bird with a small beak and an almost non existent head, all blue body.
small chubby bird with a blue body, and bluish green wings and tail
this bird is blue with black and has a very short beak.
the small bird is blue in color with a small grey beak.
this bird is vivid blue and black in color, with a stubby multi colored beak.
a small bird that is blue, has narrow legs, a long tail, and a short beak that curves downward.
this bird has wings that are black and has a blue belly

Citation information：

@techreport{WahCUB_200_2011,
	Title = ,
	Author = {Wah, C. and Branson, S. and Welinder, P. and Perona, P. and Belongie, S.},
	Year = {2011}
	Institution = {California Institute of Technology},
	Number = {CNS-TR-2011-001}
}

CelebA-HQ

Data description：

CelebA stands for CelebFaces Attribute, which means celebrity face attribute dataset. It contains 202,599 face images of 10,177 celebrity identities. CelebA is openly provided by the Chinese University of Hong Kong and is widely used in face-related computer vision training tasks.

Dataset structure：

Amount of source data：

The dataset is split into train(24183), validation(2993), test(2824), each image has 10 captions.

Data detail：

KEYS	EXPLAIN
img	image
texts	captions of the image

Sample of source dataset：

img:
Alt text
texts:

The person has pale skin, wavy hair, black hair, pointy nose, high cheekbones, big lips, and arched eyebrows and is wearing heavy makeup.
This attractive person has wavy hair, and big nose.
This person has black hair, wavy hair, arched eyebrows, pointy nose, pale skin, big nose, and big lips. She is attractive, and young and is wearing heavy makeup, and lipstick.
The woman wears lipstick. She has big nose, high cheekbones, arched eyebrows, wavy hair, and big lips. She is smiling, and young.
She wears earrings. She has big nose, and pointy nose. She is smiling.
This attractive person has pale skin.
She is wearing lipstick, and earrings. She is attractive, and smiling and has arched eyebrows, wavy hair, big lips, high cheekbones, big nose, and black hair.
This smiling, and young woman has pointy nose.
This woman is attractive and has wavy hair, high cheekbones, black hair, arched eyebrows, big nose, big lips, and pointy nose.
This person has black hair, high cheekbones, wavy hair, big lips, pointy nose, and pale skin and is wearing heavy makeup.

Citation information：

@inproceedings{liu2015faceattributes,
 title = {Deep Learning Face Attributes in the Wild},
 author = {Liu, Ziwei and Luo, Ping and Wang, Xiaogang and Tang, Xiaoou},
 booktitle = {Proceedings of International Conference on Computer Vision (ICCV)},
 year = {2015} 
}

@inproceedings{karras2017progressive,
  title={Progressive growing of gans for improved quality, stability, and variation},
  author={Karras, Tero and Aila, Timo and Laine, Samuli and Lehtinen, Jaakko},
  journal={International Conference on Learning Representations (ICLR)},
  year={2018}
}

Licensing information：

The use of this software is RESTRICTED to non-commercial research and educational purposes.

Oxford-102 Flower

Data description：

Oxford-102 Flower is a flower dataset released by Oxford University of Engineering in 2008. The selected flowers are usually native to the UK and contain a total of 102 categories of flowers.

Dataset structure：

Amount of source data：

The dataset is split into train(6149) and test(2040), each image has 10 captions.

Data detail：

KEYS	EXPLAIN
img	image
texts	captions the of image

Sample of source dataset：

img:

texts:

the petals of the flower are pink in color and have a yellow center.
this flower is pink and white in color, with petals that are multi colored.
the geographical shapes of the bright purple petals set off the orange stamen and filament and the cross shaped stigma is beautiful.
the purple petals have shades of white with white anther and filament
this flower has large pink petals and a white stigma in the center
this flower has petals that are pink and has a yellow stamen
a flower with short and wide petals that is light purple.
this flower has small pink petals with a yellow center.
this flower has large rounded pink petals with curved edges and purple veins.
this flower has purple petals as well as a white stamen.

Citation information：

@inproceedings{nilsback2008automated,
  title={Automated flower classification over a large number of classes},
  author={Nilsback, Maria-Elena and Zisserman, Andrew},
  booktitle={2008 Sixth Indian conference on computer vision, graphics \& image processing},
  pages={722--729},
  year={2008},
  organization={IEEE}
}

MSR-VTT

Data description：

MSR-VTT stands for Microsoft Research Video to Text, is a large-scale data set containing videos and corresponding text annotations. It consists of 10,000 video clips from 20 categories. Each video clip contains 20 English sentence annotations.

Dataset structure：

Amount of source data：

The dataset is split into train(6513), validation(497), test(2990), each video has 20 captions.

Data detail：

KEYS	EXPLAIN
vid	video
texts	captions of the video

Sample of source dataset：

vid:
Alt text
texts:

a baker is demonstrating a cooking technique
a female giving a baking demonstration in her kitchen
a girl explaining to prepare a dish
a lady with a scarf is cooking with dough
a person is preparing some food
a person making pastries
a woman is making a pastry
a woman is rolling doe
a woman is rolling dough around a stick
a woman is rolling dough
a woman is rolling dough
a woman is wrapping dough around some food item
a woman rolling up pastry while giving instructions
a woman rolls dough
a woman showing an easy way to make crescent rolls
how to prepare food rolls
the pastry should have five creases
a person is preparing some food
a woman is rolling dough around a stick
a woman rolls dough

Citation information：

@inproceedings{xu2016msr-vtt,
author = {Xu, Jun and Mei, Tao and Yao, Ting and Rui, Yong},
title = {MSR-VTT: A Large Video Description Dataset for Bridging Video and Language},
year = {2016},
month = {June},
publisher = {IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)},
}

UCF-101

Data description：

UCF101 is a video dataset with 101 action categories collected from YouTube by the University of Central Florida, containing a total of 13,320 videos.

Dataset structure：

Amount of source data：

The dataset is split into train(9537) and test(3783).

Data detail：

KEYS	EXPLAIN
vid	video
label	the label of the video

Sample of source dataset：

vid:
Alt text

label:
Playing Basketball

Citation information：

@article{soomro2012ucf101,
  title={UCF101: A dataset of 101 human actions classes from videos in the wild},
  author={Soomro, Khurram and Zamir, Amir Roshan and Shah, Mubarak},
  journal={arXiv preprint arXiv:1212.0402},
  year={2012}
}

mg18

Data description：

This is a dataset for evaluating the quality of multilingual image generation, containing 7,000 high-quality image-text pairs in 18 languages. This dataset is constructed by expanding the XM-3600 dataset and combining high-quality images from the WIT dataset. It is used to evaluate the model's ability to generate generic images.

Dataset structure：

The dataset selected prompts in both Chinese and English, with 2500 prompts in each language.

Citation information：

@misc{ye2023altdiffusion,
      title={AltDiffusion: A Multilingual Text-to-Image Diffusion Model}, 
      author={Fulong Ye and Guang Liu and Xinya Wu and Ledell Wu},
      year={2023},
      eprint={2308.09991},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Licensing information:

apache-2.0

Image-gen-v1.0

Data description：

Evaluation metrics: Manual subjective evaluation from 3 aspects: image and text consistency, image quality and security.

Dataset structure：

The newly developed text-to-image generation evaluation dataset by BAAI has a total of 414 prompts, mainly in Chinese and English. In the design of prompt, it covers all kinds of entities (tasks, animals and plants, landscapes, weather, etc.), attributes (colors, moods, vibe, etc.), styles (realism, animation, photography, etc.), and some content that requires reasoning and complex text comprehension, striving to conduct a full range of evaluations from different dimensions.

RelScene

Data description：

RelScene contains a comprehensive collection of 3D scenes, including textual descriptions, annotating object spatial relations, and providing both template and free-form natural language descriptions.

Dataset structure：

Amount of source data：

The dataset is split into train(4854) and test(900).

Data detail：

KEYS	EXPLAIN
img	image
prompt	the caption of the image

Sample of source dataset：

img:
Alt text

prompt:
In this 3D scene, there is a Pendant Lamp positioned directly above a Dining Table. To the left of the Dining Table, there is a Drawer Chest / Corner cabinet that is aligned with the table. Another Dining Table is located to the left of the first one. The first Dining Table is positioned to the back right of the Drawer Chest / Corner cabinet, also aligned with it. Additionally, the first Dining Table is directly below the Pendant Lamp. Lastly, there is a second Pendant Lamp to the left of the first one.

Citation information：

@inproceedings{ye2024relscene,
  title={RelScene: A Benchmark and baseline for Spatial Relations in text-driven 3D Scene Generation},
  author={Ye, Zhaoda and Zheng, Xinhan and Liu, Yang and Peng, Yuxin},
  booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
  pages={10563--10571},
  year={2024}
}

Evaluation Dataset ​

Dataset 1（MS-COCO） ​

Data description： ​

Dataset structure： ​

Amount of source data： ​

Data detail： ​

Sample of source dataset： ​

Licensing information： ​

Citation information： ​

CUB ​

Data description： ​

Dataset structure： ​

Amount of source data： ​

Data detail： ​

Sample of source dataset： ​

Citation information： ​

CelebA-HQ ​

Data description： ​

Dataset structure： ​

Amount of source data： ​

Data detail： ​

Sample of source dataset： ​

Citation information： ​

Licensing information： ​

Oxford-102 Flower ​

Data description： ​

Dataset structure： ​

Amount of source data： ​

Data detail： ​

Sample of source dataset： ​

Citation information： ​

MSR-VTT ​

Data description： ​

Dataset structure： ​

Amount of source data： ​

Data detail： ​

Sample of source dataset： ​

Citation information： ​

UCF-101 ​

Data description： ​

Dataset structure： ​

Amount of source data： ​

Data detail： ​

Sample of source dataset： ​

Citation information： ​

mg18 ​

Data description： ​

Dataset structure： ​

Citation information： ​

Licensing information: ​

Image-gen-v1.0 ​

Data description： ​

Dataset structure： ​

RelScene ​

Data description： ​

Dataset structure： ​

Amount of source data： ​

Data detail： ​

Sample of source dataset： ​

Citation information： ​

Evaluation Dataset

Dataset 1（MS-COCO）

Data description：

Dataset structure：

Amount of source data：

Data detail：

Sample of source dataset：

Licensing information：

Citation information：

CUB

Data description：

Dataset structure：

Amount of source data：

Data detail：

Sample of source dataset：

Citation information：

CelebA-HQ

Data description：

Dataset structure：

Amount of source data：

Data detail：

Sample of source dataset：

Citation information：

Licensing information：

Oxford-102 Flower

Data description：

Dataset structure：

Amount of source data：

Data detail：

Sample of source dataset：

Citation information：

MSR-VTT

Data description：

Dataset structure：

Amount of source data：

Data detail：

Sample of source dataset：

Citation information：

UCF-101

Data description：

Dataset structure：

Amount of source data：

Data detail：

Sample of source dataset：

Citation information：

mg18

Data description：

Dataset structure：

Citation information：

Licensing information:

Image-gen-v1.0

Data description：

Dataset structure：

RelScene

Data description：

Dataset structure：

Amount of source data：

Data detail：

Sample of source dataset：

Citation information：