Evaluation Dataset
Dataset 1(Flickr30k(F30k))
Data description:
Flickr30k(F30k): an image sentence paired dataset, with images fromFlickr Website, sentences are manually annotated, with 5 different English sentences annotated on each image. Flickr30k is a small sized vision language multimodal training and testing benchmark dataset, which can be used to evaluate tasks such as image and text matching/cross modal retrieval. The image content mostly comes from real-life scenarios, and the sentence descriptions are usually intuitive descriptions of the image content.
Dataset structure:
Size of downloaded dataset files: 4.3 GB
Amount of source data:
The dataset is split into train(39905), validation(10042), test(10003)
Data detail:
KEYS | EXPLAIN |
---|---|
sentids | sentence id list |
imgid | image id |
sentences | matched sentences |
tokens | tokens in a sentence |
raw | sentence |
sentid | sentence id |
split | dataset split |
filename | image file name |
Sample of source dataset:
This example was too long and was cropped:
{
'sentids': [125, 126, 127, 128, 129],
'imgid': 25,
'sentences': [
{
'tokens': ['the', 'man', 'with', 'pierced', 'ears', 'is', 'wearing', 'glasses', 'and', 'an', 'orange', 'hat'],
'raw': 'The man with pierced ears is wearing glasses and an orange hat.',
'imgid': 25,
'sentid': 125
},
{
'tokens': ['a', 'man', 'with', 'glasses', 'is', 'wearing', 'a', 'beer', 'can', 'crocheted', 'hat'],
'raw': 'A man with glasses is wearing a beer can crocheted hat.',
'imgid': 25,
'sentid': 126
},
{
'tokens': ['a', 'man', 'with', 'gauges', 'and', 'glasses', 'is', 'wearing', 'a', 'blitz', 'hat'],
'raw': 'A man with gauges and glasses is wearing a Blitz hat.',
'imgid': 25,
'sentid': 127
},
{
'tokens': ['a', 'man', 'in', 'an', 'orange', 'hat', 'starring', 'at', 'something'],
'raw': 'A man in an orange hat starring at something.',
'imgid': 25,
'sentid': 128
},
{
'tokens': ['a', 'man', 'wears', 'an', 'orange', 'hat', 'and', 'glasses'],
'raw': 'A man wears an orange hat and glasses.', 'imgid': 25, 'sentid': 129
}
],
'split': 'test',
'filename': '1007129816.jpg'
}
Citation information:
{Flickr30k,
title={From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions},
author={Peter Young, Alice Lai, Micah Hodosh, Julia Hockenmaier},
year={2014},
howpublished={https://aclanthology.org/Q14-1006/},
}
Licensing information:
Flickr 30kFlickr 30k & Denotation Graph dataFlickr Terms & Conditions of Use
Dataset 2(Microsoft COCO(MSCOCO))
Data description:
Microsoft COCO(MSCOCO): an image sentence paired dataset, with images fromFlickr Website, sentences are manually annotated, with 5 different English sentences annotated on each image. Microsoft COCO(MSCOCO) is a medium sized vision language multimodal training and testing benchmark dataset, which can be used to evaluate tasks such as image and text matching/cross modal retrieval. The image content mostly comes from real-life scenarios, and the sentence descriptions are usually intuitive descriptions of the image content.
Dataset structure:
Size of downloaded dataset files: 39 GB
Amount of source data:
Dataset(according to Karpathy's split) are split into train(image/sentence: 82783/413915),rest_val(image/sentence: 30504/152520),val/dev(image/sentence: 5000/25000), test(image/sentence: 5000/25000)
Data detail:
KEYS | EXPLAIN |
---|---|
image_id | image id |
id(txt) | sentence id |
caption | sentence |
license | license |
file_name | file name |
coco_url | image url(COCO) |
height | image height |
width | image width |
date_captured | date captured |
flickr_url | image url(Flickr) |
id(img) | image id |
Sample of source dataset:
{
'image_id': 391895,
'id': 770337,
'caption': 'A man with a red helmet on a small moped on a dirt road. '
}
[
{
'license': 3,
'file_name': 'COCO_val2014_000000391895.jpg',
'coco_url': 'http://images.cocodataset.org/val2014/COCO_val2014_000000391895.jpg',
'height': 360,
'width': 640,
'date_captured': '2013-11-14 11:18:45',
'flickr_url': 'http://farm9.staticflickr.com/8186/8119368305_4e622c8349_z.jpg',
'id': 391895
}
]
Citation information:
{Microsoft COCO(MSCOCO),
title={Microsoft COCO: Common Objects in Context},
author={Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár & C. Lawrence Zitnick},
year={2014},
howpublished={http://link.springer.com/chapter/10.1007/978-3-319-10602-1_48},
}