Skip to content

Evaluation Dataset

Dataset 1(Flickr30k(F30k))

#Evaluation Metrics-Recall

Data description:

Flickr30k(F30k): an image sentence paired dataset, with images fromFlickr Website, sentences are manually annotated, with 5 different English sentences annotated on each image. Flickr30k is a small sized vision language multimodal training and testing benchmark dataset, which can be used to evaluate tasks such as image and text matching/cross modal retrieval. The image content mostly comes from real-life scenarios, and the sentence descriptions are usually intuitive descriptions of the image content.

Dataset structure:

Size of downloaded dataset files: 4.3 GB

Amount of source data:

The dataset is split into train(39905), validation(10042), test(10003)

Data detail:

KEYSEXPLAIN
sentidssentence id list
imgidimage id
sentencesmatched sentences
tokenstokens in a sentence
rawsentence
sentidsentence id
splitdataset split
filenameimage file name

Sample of source dataset:

This example was too long and was cropped:

{
'sentids': [125, 126, 127, 128, 129], 
'imgid': 25, 
'sentences': [
                {
                'tokens': ['the', 'man', 'with', 'pierced', 'ears', 'is', 'wearing', 'glasses', 'and', 'an', 'orange', 'hat'], 
                'raw': 'The man with pierced ears is wearing glasses and an orange hat.', 
                'imgid': 25, 
                'sentid': 125
                }, 
                {
                'tokens': ['a', 'man', 'with', 'glasses', 'is', 'wearing', 'a', 'beer', 'can', 'crocheted', 'hat'], 
                'raw': 'A man with glasses is wearing a beer can crocheted hat.', 
                'imgid': 25, 
                'sentid': 126
                }, 
                {
                'tokens': ['a', 'man', 'with', 'gauges', 'and', 'glasses', 'is', 'wearing', 'a', 'blitz', 'hat'], 
                'raw': 'A man with gauges and glasses is wearing a Blitz hat.', 
                'imgid': 25, 
                'sentid': 127
                }, 
                {
                'tokens': ['a', 'man', 'in', 'an', 'orange', 'hat', 'starring', 'at', 'something'], 
                'raw': 'A man in an orange hat starring at something.', 
                'imgid': 25, 
                'sentid': 128
                }, 
                {
                'tokens': ['a', 'man', 'wears', 'an', 'orange', 'hat', 'and', 'glasses'], 
                'raw': 'A man wears an orange hat and glasses.', 'imgid': 25, 'sentid': 129
                }
             ], 
'split': 'test', 
'filename': '1007129816.jpg'
}

Citation information:

{Flickr30k,
  title={From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions},
  author={Peter Young, Alice Lai, Micah Hodosh, Julia Hockenmaier},
  year={2014},
  howpublished={https://aclanthology.org/Q14-1006/},
}

Licensing information:

Flickr 30kFlickr 30k & Denotation Graph dataFlickr Terms & Conditions of Use

Dataset 2(Microsoft COCO(MSCOCO))

#Evaluation Metrics-Recall

Data description:

Microsoft COCO(MSCOCO): an image sentence paired dataset, with images fromFlickr Website, sentences are manually annotated, with 5 different English sentences annotated on each image. Microsoft COCO(MSCOCO) is a medium sized vision language multimodal training and testing benchmark dataset, which can be used to evaluate tasks such as image and text matching/cross modal retrieval. The image content mostly comes from real-life scenarios, and the sentence descriptions are usually intuitive descriptions of the image content.

Dataset structure:

Size of downloaded dataset files: 39 GB

Amount of source data:

Dataset(according to Karpathy's split) are split into train(image/sentence: 82783/413915),rest_val(image/sentence: 30504/152520),val/dev(image/sentence: 5000/25000), test(image/sentence: 5000/25000)

Data detail:

KEYSEXPLAIN
image_idimage id
id(txt)sentence id
captionsentence
licenselicense
file_namefile name
coco_urlimage url(COCO)
heightimage height
widthimage width
date_captureddate captured
flickr_urlimage url(Flickr)
id(img)image id

Sample of source dataset:

{
'image_id': 391895, 
'id': 770337, 
'caption': 'A man with a red helmet on a small moped on a dirt road. '
}
[
    {
    'license': 3, 
    'file_name': 'COCO_val2014_000000391895.jpg', 
    'coco_url': 'http://images.cocodataset.org/val2014/COCO_val2014_000000391895.jpg', 
    'height': 360, 
    'width': 640, 
    'date_captured': '2013-11-14 11:18:45', 
    'flickr_url': 'http://farm9.staticflickr.com/8186/8119368305_4e622c8349_z.jpg', 
    'id': 391895
    }
]

Citation information:

{Microsoft COCO(MSCOCO),
  title={Microsoft COCO: Common Objects in Context},
  author={Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár & C. Lawrence Zitnick},
  year={2014},
  howpublished={http://link.springer.com/chapter/10.1007/978-3-319-10602-1_48},
}

Licensing information:

COCOTerms of Use