Evaluation Dataset
Dataset 1(MS-COCO)
Data description:
The full name of MS-COCO is Microsoft Common Objects in Context, which originated from the Microsoft COCO data set funded by Microsoft in 2014. The COCO dataset covers 1.5 million object instances, 80 target categories and 91 object categories, and is used for target detection, segmentation, text generation, image description, etc.
Dataset structure:
Amount of source data:
The dataset is split into train(118287), validation(5000), test(40670), each image has 5 captions.
Data detail:
KEYS | EXPLAIN |
---|---|
img | image |
texts | captions of the image |
Sample of source dataset:
img: texts:
- A red hair woman holding an open box of pizza.
- A young woman holding a pizza in a box.
- a woman is holding a box of pizza.
- A woman is posing with an open pizza box.
- A woman holds an open box of pizza.
Licensing information:
Creative Commons Attribution 4.0 License
Citation information:
{MS-COCO,
title={Microsoft coco: Common objects in context},
author={Lin, Tsung-Yi, et al.},
year={2014},
howpublished={ECCV 2014},
}
CUB
Data description:
The full name of the CUB-200 data set is the Caltech-UCSD Birds-200-2011 data set. It is a bird database provided by the California Institute of Technology and contains a total of 11,788 images of 200 species of birds. It is usually divided into a training set (100 types), a verification set (50 types) and a test set (50 types).
Dataset structure:
Amount of source data:
The dataset is split into train(8855) and test(2933), each image has 10 captions.
Data detail:
KEYS | EXPLAIN |
---|---|
img | image |
texts | captions of the image |
Sample of source dataset:
img:
texts:
- this small blue bird has a white bill and black legs.
- this bird has a short white bill along with a vibrant blue belly, and fluffy blue breast.
- a small sized bird that is mostly blue and has a short thick bill
- small, but wide bird with a small beak and an almost non existent head, all blue body.
- small chubby bird with a blue body, and bluish green wings and tail
- this bird is blue with black and has a very short beak.
- the small bird is blue in color with a small grey beak.
- this bird is vivid blue and black in color, with a stubby multi colored beak.
- a small bird that is blue, has narrow legs, a long tail, and a short beak that curves downward.
- this bird has wings that are black and has a blue belly
Citation information:
@techreport{WahCUB_200_2011,
Title = ,
Author = {Wah, C. and Branson, S. and Welinder, P. and Perona, P. and Belongie, S.},
Year = {2011}
Institution = {California Institute of Technology},
Number = {CNS-TR-2011-001}
}
CelebA-HQ
Data description:
CelebA stands for CelebFaces Attribute, which means celebrity face attribute dataset. It contains 202,599 face images of 10,177 celebrity identities. CelebA is openly provided by the Chinese University of Hong Kong and is widely used in face-related computer vision training tasks.
Dataset structure:
Amount of source data:
The dataset is split into train(24183), validation(2993), test(2824), each image has 10 captions.
Data detail:
KEYS | EXPLAIN |
---|---|
img | image |
texts | captions of the image |
Sample of source dataset:
img:
texts:
- The person has pale skin, wavy hair, black hair, pointy nose, high cheekbones, big lips, and arched eyebrows and is wearing heavy makeup.
- This attractive person has wavy hair, and big nose.
- This person has black hair, wavy hair, arched eyebrows, pointy nose, pale skin, big nose, and big lips. She is attractive, and young and is wearing heavy makeup, and lipstick.
- The woman wears lipstick. She has big nose, high cheekbones, arched eyebrows, wavy hair, and big lips. She is smiling, and young.
- She wears earrings. She has big nose, and pointy nose. She is smiling.
- This attractive person has pale skin.
- She is wearing lipstick, and earrings. She is attractive, and smiling and has arched eyebrows, wavy hair, big lips, high cheekbones, big nose, and black hair.
- This smiling, and young woman has pointy nose.
- This woman is attractive and has wavy hair, high cheekbones, black hair, arched eyebrows, big nose, big lips, and pointy nose.
- This person has black hair, high cheekbones, wavy hair, big lips, pointy nose, and pale skin and is wearing heavy makeup.
Citation information:
@inproceedings{liu2015faceattributes,
title = {Deep Learning Face Attributes in the Wild},
author = {Liu, Ziwei and Luo, Ping and Wang, Xiaogang and Tang, Xiaoou},
booktitle = {Proceedings of International Conference on Computer Vision (ICCV)},
year = {2015}
}
@inproceedings{karras2017progressive,
title={Progressive growing of gans for improved quality, stability, and variation},
author={Karras, Tero and Aila, Timo and Laine, Samuli and Lehtinen, Jaakko},
journal={International Conference on Learning Representations (ICLR)},
year={2018}
}
Licensing information:
The use of this software is RESTRICTED to non-commercial research and educational purposes.
Oxford-102 Flower
Data description:
Oxford-102 Flower is a flower dataset released by Oxford University of Engineering in 2008. The selected flowers are usually native to the UK and contain a total of 102 categories of flowers.
Dataset structure:
Amount of source data:
The dataset is split into train(6149) and test(2040), each image has 10 captions.
Data detail:
KEYS | EXPLAIN |
---|---|
img | image |
texts | captions the of image |
Sample of source dataset:
img:
texts:
- the petals of the flower are pink in color and have a yellow center.
- this flower is pink and white in color, with petals that are multi colored.
- the geographical shapes of the bright purple petals set off the orange stamen and filament and the cross shaped stigma is beautiful.
- the purple petals have shades of white with white anther and filament
- this flower has large pink petals and a white stigma in the center
- this flower has petals that are pink and has a yellow stamen
- a flower with short and wide petals that is light purple.
- this flower has small pink petals with a yellow center.
- this flower has large rounded pink petals with curved edges and purple veins.
- this flower has purple petals as well as a white stamen.
Citation information:
@inproceedings{nilsback2008automated,
title={Automated flower classification over a large number of classes},
author={Nilsback, Maria-Elena and Zisserman, Andrew},
booktitle={2008 Sixth Indian conference on computer vision, graphics \& image processing},
pages={722--729},
year={2008},
organization={IEEE}
}
MSR-VTT
Data description:
MSR-VTT stands for Microsoft Research Video to Text, is a large-scale data set containing videos and corresponding text annotations. It consists of 10,000 video clips from 20 categories. Each video clip contains 20 English sentence annotations.
Dataset structure:
Amount of source data:
The dataset is split into train(6513), validation(497), test(2990), each video has 20 captions.
Data detail:
KEYS | EXPLAIN |
---|---|
vid | video |
texts | captions of the video |
Sample of source dataset:
vid:
texts:
- a baker is demonstrating a cooking technique
- a female giving a baking demonstration in her kitchen
- a girl explaining to prepare a dish
- a lady with a scarf is cooking with dough
- a person is preparing some food
- a person making pastries
- a woman is making a pastry
- a woman is rolling doe
- a woman is rolling dough around a stick
- a woman is rolling dough
- a woman is rolling dough
- a woman is wrapping dough around some food item
- a woman rolling up pastry while giving instructions
- a woman rolls dough
- a woman showing an easy way to make crescent rolls
- how to prepare food rolls
- the pastry should have five creases
- a person is preparing some food
- a woman is rolling dough around a stick
- a woman rolls dough
Citation information:
@inproceedings{xu2016msr-vtt,
author = {Xu, Jun and Mei, Tao and Yao, Ting and Rui, Yong},
title = {MSR-VTT: A Large Video Description Dataset for Bridging Video and Language},
year = {2016},
month = {June},
publisher = {IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)},
}
UCF-101
Data description:
UCF101 is a video dataset with 101 action categories collected from YouTube by the University of Central Florida, containing a total of 13,320 videos.
Dataset structure:
Amount of source data:
The dataset is split into train(9537) and test(3783).
Data detail:
KEYS | EXPLAIN |
---|---|
vid | video |
label | the label of the video |
Sample of source dataset:
vid:
label:
Playing Basketball
Citation information:
@article{soomro2012ucf101,
title={UCF101: A dataset of 101 human actions classes from videos in the wild},
author={Soomro, Khurram and Zamir, Amir Roshan and Shah, Mubarak},
journal={arXiv preprint arXiv:1212.0402},
year={2012}
}
mg18
Data description:
This is a dataset for evaluating the quality of multilingual image generation, containing 7,000 high-quality image-text pairs in 18 languages. This dataset is constructed by expanding the XM-3600 dataset and combining high-quality images from the WIT dataset. It is used to evaluate the model's ability to generate generic images.
Dataset structure:
The dataset selected prompts in both Chinese and English, with 2500 prompts in each language.
Citation information:
@misc{ye2023altdiffusion,
title={AltDiffusion: A Multilingual Text-to-Image Diffusion Model},
author={Fulong Ye and Guang Liu and Xinya Wu and Ledell Wu},
year={2023},
eprint={2308.09991},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Licensing information:
apache-2.0
Image-gen-v1.0
Data description:
Evaluation metrics: Manual subjective evaluation from 3 aspects: image and text consistency, image quality and security.
Dataset structure:
The newly developed text-to-image generation evaluation dataset by BAAI has a total of 414 prompts, mainly in Chinese and English. In the design of prompt, it covers all kinds of entities (tasks, animals and plants, landscapes, weather, etc.), attributes (colors, moods, vibe, etc.), styles (realism, animation, photography, etc.), and some content that requires reasoning and complex text comprehension, striving to conduct a full range of evaluations from different dimensions.