Skip to content

Evaluation Dataset

The following datasets are all transformed into standard Evaluation Prompts before evaluation.

Dataset 1(RefCOCO)

#Evaluation Metrics-1. Precision or Accuracy

Data description:

RefCOCO: an referring expression/visual grounding dataset, with images fromMS COCO Website, whose image and corresponding phrases(referring expressions) are manually annotated. RefCOCO is a small sized vision language multimodal training and testing benchmark dataset, which can be used to evaluate tasks such as referring expression/visual grounding. The image content mostly comes from real-life scenarios, and the phrases(referring expressions) descriptions are usually intuitive descriptions of the image content.

Amount of source data:

The dataset is split into train(120624), validation(10834), testA(5657), testB(5095)

Amount of evaluation data:

The evaluation data are instances of 5657 image regions and corresponding phrases(referring expressions) from testA split and 5095 image regions and corresponding phrases(referring expressions) from testB split.

Data detail:

KEYSEXPLAIN
'images'
idimage id
file_nameimage filename
widthimage width
heightimage height
coco_urlimage url in coco dataset
flickr_urlimage url in flickr dataset
licenseimage license type
'annotations'
idimage region id
image_idimage region's image id
category_idimage region's category id
bboximage region's bounding box
segmentationimage region's polygon edge segmentation
areaimage region's segmentation area
iscrowdimage region's segmentation is multi-object crowd
'references'
ref_idreferring expression id
ann_idreferring expression's image region id
splitreferring expression's dataset split
sent_idsreferring expressions' indices list
sentencesreferring expressions' sentence contents

Sample of source dataset:

import json as jsonmod import pickle refcoco = jsonmod.load(open('./refcoco/instances.json', 'r')) refcoco_p = pickle.load(open('./refcoco/refs(unc).p', 'rb'),fix_imports=True)

refcoco_p[0]
{'sent_ids': [0, 1, 2], 
 'file_name': 'COCO_train2014_000000581857_16.jpg', 
 'ann_id': 1719310, 
 'ref_id': 0, 
 'image_id': 581857, 
 'split': 'train', 
 'sentences': 
    [{'tokens': ['the', 'lady', 'with', 'the', 'blue', 'shirt'], 'raw': 'THE LADY WITH THE BLUE SHIRT', 'sent_id': 0, 'sent': 'the lady with the blue shirt'}, 
     {'tokens': ['lady', 'with', 'back', 'to', 'us'], 'raw': 'lady w back to us', 'sent_id': 1, 'sent': 'lady with back to us'}, 
     {'tokens': ['blue', 'shirt'], 'raw': 'blue shirt', 'sent_id': 2, 'sent': 'blue shirt'}
    ], 
 'category_id': 1
}
refcoco_p[50000-1]
{'sent_ids': [142208, 142209], 
 'file_name': 'COCO_train2014_000000000072_0.jpg', 
 'ann_id': 598731, 
 'ref_id': 49999, 
 'image_id': 72, 
 'split': 'train', 
 'sentences': 
    [{'tokens': ['right', 'giraffe'], 'raw': 'RIGHT GIRAFFE', 'sent_id': 142208, 'sent': 'right giraffe'}, 
     {'tokens': ['right', 'girafe'], 'raw': 'right girafe', 'sent_id': 142209, 'sent': 'right girafe'}
    ], 
 'category_id': 25
}

refcoco['images'][0]
{'license': 1, 
 'file_name': 'COCO_train2014_000000098304.jpg', 
 'coco_url': 'http://mscoco.org/images/98304', 
 'height': 424, 
 'width': 640, 
 'date_captured': '2013-11-21 23:06:41', 
 'flickr_url': 'http://farm6.staticflickr.com/5062/5896644212_a326e96ea9_z.jpg', 
 'id': 98304
}
refcoco['images'][19994-1]
{'license': 6, 
 'file_name': 'COCO_train2014_000000458751.jpg', 
 'coco_url': 'http://mscoco.org/images/458751', 
 'height': 576, 
 'width': 592, 
 'date_captured': '2013-11-16 21:13:51', 
 'flickr_url': 'http://farm8.staticflickr.com/7018/6821165845_48ebd9590f_z.jpg', 
 'id': 458751
}

refcoco['annotations'][0]
{'segmentation': [[267.52, 229.75, 265.6, 226.68, 265.79, 223.6, 263.87, 220.15, 263.87, 216.88, 266.94, 217.07, 268.48, 221.3, 272.32, 219.95, 276.35, 220.15, 279.62, 218.03, 283.46, 218.42, 285.0, 220.92, 285.0, 223.22, 284.42, 224.95, 280.96, 225.14, 279.81, 226.48, 281.73, 228.41, 279.43, 229.37, 275.78, 229.17, 273.86, 229.56, 274.24, 232.05, 269.82, 231.67, 267.14, 231.48, 266.75, 228.6]], 
 'area': 197.29899999999986, 
 'iscrowd': 0, 
 'image_id': 98304, 
 'bbox': [263.87, 216.88, 21.13, 15.17], 
 'category_id': 18, 
 'id': 3007
}
refcoco['annotations'][196771-1]
{'segmentation': [[203.42, 96.23, 216.68, 104.44, 216.05, 114.54, 226.15, 118.96, 228.67, 132.21, 247.61, 138.52, 250.13, 156.83, 236.88, 159.35, 234.35, 167.56, 274.12, 168.19, 281.69, 185.87, 284.85, 213.01, 267.81, 237.62, 243.19, 236.36, 238.14, 223.74, 232.46, 232.57, 231.2, 284.33, 159.87, 283.07, 159.87, 218.06, 151.67, 206.7, 154.19, 190.92, 159.87, 184.6, 158.61, 166.3, 140.3, 153.04, 142.2, 144.84, 178.81, 147.99, 183.86, 142.94, 169.97, 125.9, 173.13, 114.54, 176.28, 113.91, 185.75, 96.87, 200.9, 94.97]], 
 'area': 16238.20485, 
 'iscrowd': 0, 
 'image_id': 458751, 
 'bbox': [140.3, 94.97, 144.55, 189.36], 
 'category_id': 11, 
 'id': 1808941
}

Dataset structure:

Dataset 2(RefCOCO+)

#Evaluation Metrics-1. Precision or Accuracy

Data description:

RefCOCO+: an referring expression/visual grounding dataset, with images fromMS COCO Website, whose image and corresponding phrases(referring expressions) are manually annotated. RefCOCO+ is a small sized vision language multimodal training and testing benchmark dataset, which can be used to evaluate tasks such as referring expression/visual grounding. The image content mostly comes from real-life scenarios, and the phrases(referring expressions) descriptions are usually intuitive descriptions of the image content.

Amount of source data:

The dataset is split into train(120191), validation(10758), testA(5726), testB(4889)

Amount of evaluation data:

The evaluation data are instances of 5726 image regions and corresponding phrases(referring expressions) from testA split and 4889 image regions and corresponding phrases(referring expressions) from testB split.

Data detail:

KEYSEXPLAIN
'images'
idimage id
file_nameimage filename
widthimage width
heightimage height
coco_urlimage url in coco dataset
flickr_urlimage url in flickr dataset
licenseimage license type
'annotations'
idimage region id
image_idimage region's image id
category_idimage region's category id
bboximage region's bounding box
segmentationimage region's polygon edge segmentation
areaimage region's segmentation area
iscrowdimage region's segmentation is multi-object crowd
'references'
ref_idreferring expression id
ann_idreferring expression's image region id
splitreferring expression's dataset split
sent_idsreferring expressions' indices list
sentencesreferring expressions' sentence contents

Sample of source dataset:

import json as jsonmod refcoco_plus = jsonmod.load(open('./refcoco+/instances.json', 'r')) refcoco_plus_p = pickle.load(open('./refcoco+/refs(unc).p', 'rb'),fix_imports=True)

refcoco_plus_p[0]
{'sent_ids': [0, 1, 2], 
 'file_name': 'COCO_train2014_000000581857_16.jpg', 
 'ann_id': 1719310, 
 'ref_id': 0, 
 'image_id': 581857, 
 'split': 'train', 
 'sentences': [
    {'tokens': ['navy', 'blue', 'shirt'], 'raw': 'navy blue shirt', 'sent_id': 0, 'sent': 'navy blue shirt'}, 
    {'tokens': ['woman', 'back', 'in', 'blue'], 'raw': 'woman back in blue', 'sent_id': 1, 'sent': 'woman back in blue'}, 
    {'tokens': ['blue', 'shirt'], 'raw': 'blue shirt', 'sent_id': 2, 'sent': 'blue shirt'}
 ], 
 'category_id': 1
}
refcoco_plus_p[49856-1]
{'sent_ids': [141560, 141561, 141562, 141563], 
 'file_name': 'COCO_train2014_000000000072_0.jpg', 
 'ann_id': 598731, 
 'ref_id': 49855, 
 'image_id': 72, 
 'split': 'train', 
 'sentences': [
    {'tokens': ['shorter', 'giraffe'], 'raw': 'shorter giraffe', 'sent_id': 141560, 'sent': 'shorter giraffe'}, 
    {'tokens': ['giraffe', 'closest', 'to', 'camera'], 'raw': 'giraffe closest to camera', 'sent_id': 141561, 'sent': 'giraffe closest to camera'}, 
    {'tokens': ['bent', 'neck'], 'raw': 'bent neck', 'sent_id': 141562, 'sent': 'bent neck'}, 
    {'tokens': ['shorter', 'animal'], 'raw': 'shorter animal', 'sent_id': 141563, 'sent': 'shorter animal'}
 ], 
 'category_id': 25
}

refcoco_plus['images'][0]
{'license': 1, 
 'file_name': 'COCO_train2014_000000098304.jpg', 
 'coco_url': 'http://mscoco.org/images/98304', 
 'height': 424, 
 'width': 640, 
 'date_captured': '2013-11-21 23:06:41', 
 'flickr_url': 'http://farm6.staticflickr.com/5062/5896644212_a326e96ea9_z.jpg', 
 'id': 98304
}
refcoco_plus['images'][19992-1]
{'license': 6, 
 'file_name': 'COCO_train2014_000000458751.jpg', 
 'coco_url': 'http://mscoco.org/images/458751', 
 'height': 576, 
 'width': 592, 
 'date_captured': '2013-11-16 21:13:51', 
 'flickr_url': 'http://farm8.staticflickr.com/7018/6821165845_48ebd9590f_z.jpg', 
 'id': 458751
}

refcoco_plus['annotations'][0]
{'segmentation': [[267.52, 229.75, 265.6, 226.68, 265.79, 223.6, 263.87, 220.15, 263.87, 216.88, 266.94, 217.07, 268.48, 221.3, 272.32, 219.95, 276.35, 220.15, 279.62, 218.03, 283.46, 218.42, 285.0, 220.92, 285.0, 223.22, 284.42, 224.95, 280.96, 225.14, 279.81, 226.48, 281.73, 228.41, 279.43, 229.37, 275.78, 229.17, 273.86, 229.56, 274.24, 232.05, 269.82, 231.67, 267.14, 231.48, 266.75, 228.6]], 
 'area': 197.29899999999986, 
 'iscrowd': 0, 
 'image_id': 98304, 
 'bbox': [263.87, 216.88, 21.13, 15.17], 
 'category_id': 18, 
 'id': 3007
}
refcoco_plus['annotations'][196737-1]
{'segmentation': [[203.42, 96.23, 216.68, 104.44, 216.05, 114.54, 226.15, 118.96, 228.67, 132.21, 247.61, 138.52, 250.13, 156.83, 236.88, 159.35, 234.35, 167.56, 274.12, 168.19, 281.69, 185.87, 284.85, 213.01, 267.81, 237.62, 243.19, 236.36, 238.14, 223.74, 232.46, 232.57, 231.2, 284.33, 159.87, 283.07, 159.87, 218.06, 151.67, 206.7, 154.19, 190.92, 159.87, 184.6, 158.61, 166.3, 140.3, 153.04, 142.2, 144.84, 178.81, 147.99, 183.86, 142.94, 169.97, 125.9, 173.13, 114.54, 176.28, 113.91, 185.75, 96.87, 200.9, 94.97]], 
 'area': 16238.20485, 
 'iscrowd': 0, 
 'image_id': 458751, 
 'bbox': [140.3, 94.97, 144.55, 189.36], 
 'category_id': 11, 
 'id': 1808941
}

Dataset structure:

Dataset 3(RefCOCOg)

#Evaluation Metrics-1. Precision or Accuracy

Data description:

RefCOCOg: an referring expression/visual grounding dataset, with images fromMS COCO Website, whose image and corresponding phrases(referring expressions) are manually annotated. RefCOCOg is a small sized vision language multimodal training and testing benchmark dataset, which can be used to evaluate tasks such as referring expression/visual grounding. The image content mostly comes from real-life scenarios, and the phrases(referring expressions) descriptions are usually intuitive descriptions of the image content.

Amount of source data:

The dataset is split into train(80512), validation(4896), test(9602)

Amount of evaluation data:

The evaluation data are instances of 9602 image regions and corresponding phrases(referring expressions) from test split.

Data detail:

KEYSEXPLAIN
'images'
idimage id
file_nameimage filename
widthimage width
heightimage height
coco_urlimage url in coco dataset
flickr_urlimage url in flickr dataset
licenseimage license type
'annotations'
idimage region id
image_idimage region's image id
category_idimage region's category id
bboximage region's bounding box
segmentationimage region's polygon edge segmentation
areaimage region's segmentation area
iscrowdimage region's segmentation is multi-object crowd
'references'
ref_idreferring expression id
ann_idreferring expression's image region id
splitreferring expression's dataset split
sent_idsreferring expressions' indices list
sentencesreferring expressions' sentence contents

Sample of source dataset:

import json as jsonmod refcoco_g = jsonmod.load(open('./refcocog/instances.json', 'r')) refcoco_g_p = pickle.load(open('./refcocog/refs(umd).p', 'rb'),fix_imports=True)

refcoco_g_p[0]
{'image_id': 380440, 
 'split': 'test', 
 'sentences': [
    {'tokens': ['the', 'man', 'in', 'yellow', 'coat'], 'raw': 'the man in yellow coat', 'sent_id': 8, 'sent': 'the man in yellow coat'}, 
    {'tokens': ['skiier', 'in', 'red', 'pants'], 'raw': 'Skiier in red pants.', 'sent_id': 9, 'sent': 'skiier in red pants'}
 ], 
 'file_name': 'COCO_train2014_000000380440_491042.jpg', 
 'category_id': 1, 
 'ann_id': 491042, 
 'sent_ids': [8, 9], 
 'ref_id': 0
}
refcoco_g_p[49822-1]
{'image_id': 573297, 
 'split': 'train', 
 'sentences': [
    {'tokens': ['a', 'person', 'in', 'red', 'dress', 'and', 'he', 'is', 'seeing', 'his', 'mobile'], 'raw': 'A person in red dress and he is seeing his mobile.', 'sent_id': 104558, 'sent': 'a person in red dress and he is seeing his mobile'}, 
    {'tokens': ['man', 'wearing', 'a', 'red', 'costume'], 'raw': 'Man wearing a red costume.', 'sent_id': 104559, 'sent': 'man wearing a red costume'}
 ], 
 'file_name': 'COCO_train2014_000000573297_472971.jpg', 
 'category_id': 1, 
 'ann_id': 472971, 
 'sent_ids': [104558, 104559], 
 'ref_id': 49821
}

refcoco_g['images'][0]
{'license': 1, 
 'file_name': 'COCO_train2014_000000131074.jpg', 
 'coco_url': 'http://mscoco.org/images/131074', 
 'height': 428, 
 'width': 640, 
 'date_captured': '2013-11-21 01:03:06', 
 'flickr_url': 'http://farm9.staticflickr.com/8308/7908210548_33e532d119_z.jpg', 
 'id': 131074
}
refcoco_g['images'][25799-1]
{'license': 5, 
 'file_name': 'COCO_train2014_000000524286.jpg', 
 'coco_url': 'http://mscoco.org/images/524286', 
 'height': 480, 
 'width': 640, 
 'date_captured': '2013-11-22 01:08:02', 
 'flickr_url': 'http://farm4.staticflickr.com/3286/3160643026_c2691d2c55_z.jpg', 
 'id': 524286
}

refcoco_g['annotations'][0]
{'segmentation': [[21.11, 239.09, 16.31, 274.6, 198.65, 349.45, 240.87, 336.98, 320.52, 293.79, 334.91, 248.69, 357.95, 273.64, 353.15, 289.0, 398.25, 267.88, 437.6, 251.57, 412.65, 228.54, 240.87, 210.31, 219.76, 141.21, 113.24, 153.69, 63.34, 156.57, 26.87, 169.04]], 
 'area': 48667.84089999999, 
 'iscrowd': 0, 
 'image_id': 131074, 
 'bbox': [16.31, 141.21, 421.29, 208.24], 
 'category_id': 65, 
 'id': 318235
}
refcoco_g['annotations'][208960-1]
{'segmentation': [[158.56, 212.49, 158.56, 94.92, 467.06, 85.21, 476.76, 209.26]], 
 'area': 37887.193, 
 'iscrowd': 0, 
 'image_id': 524286, 
 'bbox': [158.56, 85.21, 318.2, 127.28], 
 'category_id': 76, 
 'id': 1635174
}

Dataset structure:

Citation information:

{RefCOCO, RefCOCO+,
  title={Modeling context in referring expressions},
  author={Yu, Licheng and Poirson, Patrick and Yang, Shan and Berg, Alexander C and Berg, Tamara L},
  booktitle={Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14},
  pages={69--85},
  year={2016},
  organization={Springer}
}
{RefCOCOg,
  title={Generation and comprehension of unambiguous object descriptions},
  author={Mao, Junhua and Huang, Jonathan and Toshev, Alexander and Camburu, Oana and Yuille, Alan L and Murphy, Kevin},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages={11--20},
  year={2016}
}

Licensing information:

[Datasets] ALL RefCOCO* DatasetsMS COCO Image DatasetRefCOCORefCOCO+RefCOCOg [Licenses] Attribution-NonCommercial-ShareAlike LicenseAttribution-NonCommercial LicenseAttribution-NonCommercial-NoDerivs LicenseAttribution LicenseAttribution-ShareAlike LicenseAttribution-NoDerivs LicenseNo known copyright restrictionsUnited States Government Work