Evaluation Dataset

Dataset 1（Flickr30k(F30k)）

Data description：

Flickr30k(F30k): an image sentence paired dataset, with images fromFlickr Website, sentences are manually annotated, with 5 different English sentences annotated on each image. Flickr30k is a small sized vision language multimodal training and testing benchmark dataset, which can be used to evaluate tasks such as image and text matching/cross modal retrieval. The image content mostly comes from real-life scenarios, and the sentence descriptions are usually intuitive descriptions of the image content.

Dataset structure：

Size of downloaded dataset files: 4.3 GB

Amount of source data：

The dataset is split into train(39905), validation(10042), test(10003)

Data detail：

KEYS	EXPLAIN
sentids	sentence id list
imgid	image id
sentences	matched sentences
tokens	tokens in a sentence
raw	sentence
sentid	sentence id
split	dataset split
filename	image file name

Sample of source dataset：

This example was too long and was cropped:

{
'sentids': [125, 126, 127, 128, 129], 
'imgid': 25, 
'sentences': [
                {
                'tokens': ['the', 'man', 'with', 'pierced', 'ears', 'is', 'wearing', 'glasses', 'and', 'an', 'orange', 'hat'], 
                'raw': 'The man with pierced ears is wearing glasses and an orange hat.', 
                'imgid': 25, 
                'sentid': 125
                }, 
                {
                'tokens': ['a', 'man', 'with', 'glasses', 'is', 'wearing', 'a', 'beer', 'can', 'crocheted', 'hat'], 
                'raw': 'A man with glasses is wearing a beer can crocheted hat.', 
                'imgid': 25, 
                'sentid': 126
                }, 
                {
                'tokens': ['a', 'man', 'with', 'gauges', 'and', 'glasses', 'is', 'wearing', 'a', 'blitz', 'hat'], 
                'raw': 'A man with gauges and glasses is wearing a Blitz hat.', 
                'imgid': 25, 
                'sentid': 127
                }, 
                {
                'tokens': ['a', 'man', 'in', 'an', 'orange', 'hat', 'starring', 'at', 'something'], 
                'raw': 'A man in an orange hat starring at something.', 
                'imgid': 25, 
                'sentid': 128
                }, 
                {
                'tokens': ['a', 'man', 'wears', 'an', 'orange', 'hat', 'and', 'glasses'], 
                'raw': 'A man wears an orange hat and glasses.', 'imgid': 25, 'sentid': 129
                }
             ], 
'split': 'test', 
'filename': '1007129816.jpg'
}

Citation information：

{Flickr30k,
  title={From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions},
  author={Peter Young, Alice Lai, Micah Hodosh, Julia Hockenmaier},
  year={2014},
  howpublished={https://aclanthology.org/Q14-1006/},
}

Licensing information：

Flickr 30k Flickr 30k & Denotation Graph data Flickr Terms & Conditions of Use

Dataset 2（Microsoft COCO(MSCOCO)）

#Evaluation Metrics-Recall

Data description：

Microsoft COCO(MSCOCO): an image sentence paired dataset, with images fromFlickr Website, sentences are manually annotated, with 5 different English sentences annotated on each image. Microsoft COCO(MSCOCO) is a medium sized vision language multimodal training and testing benchmark dataset, which can be used to evaluate tasks such as image and text matching/cross modal retrieval. The image content mostly comes from real-life scenarios, and the sentence descriptions are usually intuitive descriptions of the image content.

Dataset structure：

Size of downloaded dataset files: 39 GB

Amount of source data：

Dataset(according to Karpathy's split) are split into train(image/sentence: 82783/413915)，rest_val(image/sentence: 30504/152520)，val/dev(image/sentence: 5000/25000), test(image/sentence: 5000/25000)

Data detail：

KEYS	EXPLAIN
image_id	image id
id(txt)	sentence id
caption	sentence
license	license
file_name	file name
coco_url	image url(COCO)
height	image height
width	image width
date_captured	date captured
flickr_url	image url(Flickr)
id(img)	image id

Sample of source dataset：

{
'image_id': 391895, 
'id': 770337, 
'caption': 'A man with a red helmet on a small moped on a dirt road. '
}
[
    {
    'license': 3, 
    'file_name': 'COCO_val2014_000000391895.jpg', 
    'coco_url': 'http://images.cocodataset.org/val2014/COCO_val2014_000000391895.jpg', 
    'height': 360, 
    'width': 640, 
    'date_captured': '2013-11-14 11:18:45', 
    'flickr_url': 'http://farm9.staticflickr.com/8186/8119368305_4e622c8349_z.jpg', 
    'id': 391895
    }
]

Citation information：

{Microsoft COCO(MSCOCO),
  title={Microsoft COCO: Common Objects in Context},
  author={Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár & C. Lawrence Zitnick},
  year={2014},
  howpublished={http://link.springer.com/chapter/10.1007/978-3-319-10602-1_48},
}

Licensing information：

COCO Terms of Use

Evaluation Dataset ​

Dataset 1（Flickr30k(F30k)） ​

Data description： ​

Dataset structure： ​

Amount of source data： ​

Data detail： ​

Sample of source dataset： ​

Citation information： ​

Licensing information： ​

Dataset 2（Microsoft COCO(MSCOCO)） ​

Data description： ​

Dataset structure： ​

Amount of source data： ​

Data detail： ​

Sample of source dataset： ​

Citation information： ​

Licensing information： ​

Evaluation Dataset

Dataset 1（Flickr30k(F30k)）

Data description：

Dataset structure：

Amount of source data：

Data detail：

Sample of source dataset：

Citation information：

Licensing information：

Dataset 2（Microsoft COCO(MSCOCO)）

Data description：

Dataset structure：

Amount of source data：

Data detail：

Sample of source dataset：

Citation information：

Licensing information：