Skip to content

Evaluation Dataset

Dataset 1(MS-COCO)

Data description:

The full name of MS-COCO is Microsoft Common Objects in Context, which originated from the Microsoft COCO data set funded by Microsoft in 2014. The COCO dataset covers 1.5 million object instances, 80 target categories and 91 object categories, and is used for target detection, segmentation, text generation, image description, etc.

Dataset structure:

Amount of source data:

The dataset is split into train(118287), validation(5000), test(40670), each image has 5 captions.

Data detail:

KEYSEXPLAIN
imgimage
textscaptions of the image

Sample of source dataset:

img: Alt text texts:

  1. A red hair woman holding an open box of pizza.
  2. A young woman holding a pizza in a box.
  3. a woman is holding a box of pizza.
  4. A woman is posing with an open pizza box.
  5. A woman holds an open box of pizza.

Licensing information:

Creative Commons Attribution 4.0 License

Citation information:

{MS-COCO,
  title={Microsoft coco: Common objects in context},
  author={Lin, Tsung-Yi, et al.},
  year={2014},
  howpublished={ECCV 2014},
}

CUB

Data description:

The full name of the CUB-200 data set is the Caltech-UCSD Birds-200-2011 data set. It is a bird database provided by the California Institute of Technology and contains a total of 11,788 images of 200 species of birds. It is usually divided into a training set (100 types), a verification set (50 types) and a test set (50 types).

Dataset structure:

Amount of source data:

The dataset is split into train(8855) and test(2933), each image has 10 captions.

Data detail:

KEYSEXPLAIN
imgimage
textscaptions of the image

Sample of source dataset:

img:
Alt text
texts:

  1. this small blue bird has a white bill and black legs.
  2. this bird has a short white bill along with a vibrant blue belly, and fluffy blue breast.
  3. a small sized bird that is mostly blue and has a short thick bill
  4. small, but wide bird with a small beak and an almost non existent head, all blue body.
  5. small chubby bird with a blue body, and bluish green wings and tail
  6. this bird is blue with black and has a very short beak.
  7. the small bird is blue in color with a small grey beak.
  8. this bird is vivid blue and black in color, with a stubby multi colored beak.
  9. a small bird that is blue, has narrow legs, a long tail, and a short beak that curves downward.
  10. this bird has wings that are black and has a blue belly

Citation information:

@techreport{WahCUB_200_2011,
	Title = ,
	Author = {Wah, C. and Branson, S. and Welinder, P. and Perona, P. and Belongie, S.},
	Year = {2011}
	Institution = {California Institute of Technology},
	Number = {CNS-TR-2011-001}
}

CelebA-HQ

Data description:

CelebA stands for CelebFaces Attribute, which means celebrity face attribute dataset. It contains 202,599 face images of 10,177 celebrity identities. CelebA is openly provided by the Chinese University of Hong Kong and is widely used in face-related computer vision training tasks.

Dataset structure:

Amount of source data:

The dataset is split into train(24183), validation(2993), test(2824), each image has 10 captions.

Data detail:

KEYSEXPLAIN
imgimage
textscaptions of the image

Sample of source dataset:

img:
Alt text
texts:

  1. The person has pale skin, wavy hair, black hair, pointy nose, high cheekbones, big lips, and arched eyebrows and is wearing heavy makeup.
  2. This attractive person has wavy hair, and big nose.
  3. This person has black hair, wavy hair, arched eyebrows, pointy nose, pale skin, big nose, and big lips. She is attractive, and young and is wearing heavy makeup, and lipstick.
  4. The woman wears lipstick. She has big nose, high cheekbones, arched eyebrows, wavy hair, and big lips. She is smiling, and young.
  5. She wears earrings. She has big nose, and pointy nose. She is smiling.
  6. This attractive person has pale skin.
  7. She is wearing lipstick, and earrings. She is attractive, and smiling and has arched eyebrows, wavy hair, big lips, high cheekbones, big nose, and black hair.
  8. This smiling, and young woman has pointy nose.
  9. This woman is attractive and has wavy hair, high cheekbones, black hair, arched eyebrows, big nose, big lips, and pointy nose.
  10. This person has black hair, high cheekbones, wavy hair, big lips, pointy nose, and pale skin and is wearing heavy makeup.

Citation information:

@inproceedings{liu2015faceattributes,
 title = {Deep Learning Face Attributes in the Wild},
 author = {Liu, Ziwei and Luo, Ping and Wang, Xiaogang and Tang, Xiaoou},
 booktitle = {Proceedings of International Conference on Computer Vision (ICCV)},
 year = {2015} 
}

@inproceedings{karras2017progressive,
  title={Progressive growing of gans for improved quality, stability, and variation},
  author={Karras, Tero and Aila, Timo and Laine, Samuli and Lehtinen, Jaakko},
  journal={International Conference on Learning Representations (ICLR)},
  year={2018}
}

Licensing information:

The use of this software is RESTRICTED to non-commercial research and educational purposes.

Oxford-102 Flower

Data description:

Oxford-102 Flower is a flower dataset released by Oxford University of Engineering in 2008. The selected flowers are usually native to the UK and contain a total of 102 categories of flowers.

Dataset structure:

Amount of source data:

The dataset is split into train(6149) and test(2040), each image has 10 captions.

Data detail:

KEYSEXPLAIN
imgimage
textscaptions the of image

Sample of source dataset:

img:
image
texts:

  1. the petals of the flower are pink in color and have a yellow center.
  2. this flower is pink and white in color, with petals that are multi colored.
  3. the geographical shapes of the bright purple petals set off the orange stamen and filament and the cross shaped stigma is beautiful.
  4. the purple petals have shades of white with white anther and filament
  5. this flower has large pink petals and a white stigma in the center
  6. this flower has petals that are pink and has a yellow stamen
  7. a flower with short and wide petals that is light purple.
  8. this flower has small pink petals with a yellow center.
  9. this flower has large rounded pink petals with curved edges and purple veins.
  10. this flower has purple petals as well as a white stamen.

Citation information:

@inproceedings{nilsback2008automated,
  title={Automated flower classification over a large number of classes},
  author={Nilsback, Maria-Elena and Zisserman, Andrew},
  booktitle={2008 Sixth Indian conference on computer vision, graphics \& image processing},
  pages={722--729},
  year={2008},
  organization={IEEE}
}

MSR-VTT

Data description:

MSR-VTT stands for Microsoft Research Video to Text, is a large-scale data set containing videos and corresponding text annotations. It consists of 10,000 video clips from 20 categories. Each video clip contains 20 English sentence annotations.

Dataset structure:

Amount of source data:

The dataset is split into train(6513), validation(497), test(2990), each video has 20 captions.

Data detail:

KEYSEXPLAIN
vidvideo
textscaptions of the video

Sample of source dataset:

vid:
Alt text
texts:

  1. a baker is demonstrating a cooking technique
  2. a female giving a baking demonstration in her kitchen
  3. a girl explaining to prepare a dish
  4. a lady with a scarf is cooking with dough
  5. a person is preparing some food
  6. a person making pastries
  7. a woman is making a pastry
  8. a woman is rolling doe
  9. a woman is rolling dough around a stick
  10. a woman is rolling dough
  11. a woman is rolling dough
  12. a woman is wrapping dough around some food item
  13. a woman rolling up pastry while giving instructions
  14. a woman rolls dough
  15. a woman showing an easy way to make crescent rolls
  16. how to prepare food rolls
  17. the pastry should have five creases
  18. a person is preparing some food
  19. a woman is rolling dough around a stick
  20. a woman rolls dough

Citation information:

@inproceedings{xu2016msr-vtt,
author = {Xu, Jun and Mei, Tao and Yao, Ting and Rui, Yong},
title = {MSR-VTT: A Large Video Description Dataset for Bridging Video and Language},
year = {2016},
month = {June},
publisher = {IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)},
}

UCF-101

Data description:

UCF101 is a video dataset with 101 action categories collected from YouTube by the University of Central Florida, containing a total of 13,320 videos.

Dataset structure:

Amount of source data:

The dataset is split into train(9537) and test(3783).

Data detail:

KEYSEXPLAIN
vidvideo
labelthe label of the video

Sample of source dataset:

vid:
Alt text

label:
Playing Basketball

Citation information:

@article{soomro2012ucf101,
  title={UCF101: A dataset of 101 human actions classes from videos in the wild},
  author={Soomro, Khurram and Zamir, Amir Roshan and Shah, Mubarak},
  journal={arXiv preprint arXiv:1212.0402},
  year={2012}
}

mg18

Data description:

This is a dataset for evaluating the quality of multilingual image generation, containing 7,000 high-quality image-text pairs in 18 languages. This dataset is constructed by expanding the XM-3600 dataset and combining high-quality images from the WIT dataset. It is used to evaluate the model's ability to generate generic images.

Dataset structure:

The dataset selected prompts in both Chinese and English, with 2500 prompts in each language.

Citation information:

@misc{ye2023altdiffusion,
      title={AltDiffusion: A Multilingual Text-to-Image Diffusion Model}, 
      author={Fulong Ye and Guang Liu and Xinya Wu and Ledell Wu},
      year={2023},
      eprint={2308.09991},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Licensing information:

apache-2.0

Image-gen-v1.0

Data description:

Evaluation metrics: Manual subjective evaluation from 3 aspects: image and text consistency, image quality and security.

Dataset structure:

The newly developed text-to-image generation evaluation dataset by BAAI has a total of 414 prompts, mainly in Chinese and English. In the design of prompt, it covers all kinds of entities (tasks, animals and plants, landscapes, weather, etc.), attributes (colors, moods, vibe, etc.), styles (realism, animation, photography, etc.), and some content that requires reasoning and complex text comprehension, striving to conduct a full range of evaluations from different dimensions.