Evaluation Dataset

The following datasets are all transformed into standard Evaluation Prompts before evaluation.

Dataset 1（RefCOCO）

#Evaluation Metrics-1. Precision or Accuracy

Data description：

RefCOCO: an referring expression/visual grounding dataset, with images fromMS COCO Website, whose image and corresponding phrases(referring expressions) are manually annotated. RefCOCO is a small sized vision language multimodal training and testing benchmark dataset, which can be used to evaluate tasks such as referring expression/visual grounding. The image content mostly comes from real-life scenarios, and the phrases(referring expressions) descriptions are usually intuitive descriptions of the image content.

Amount of source data：

The dataset is split into train(120624), validation(10834), testA(5657), testB(5095)

Amount of evaluation data：

The evaluation data are instances of 5657 image regions and corresponding phrases(referring expressions) from testA split and 5095 image regions and corresponding phrases(referring expressions) from testB split.

Data detail：

KEYS	EXPLAIN
'images'
id	image id
file_name	image filename
width	image width
height	image height
coco_url	image url in coco dataset
flickr_url	image url in flickr dataset
license	image license type
'annotations'
id	image region id
image_id	image region's image id
category_id	image region's category id
bbox	image region's bounding box
segmentation	image region's polygon edge segmentation
area	image region's segmentation area
iscrowd	image region's segmentation is multi-object crowd
'references'
ref_id	referring expression id
ann_id	referring expression's image region id
split	referring expression's dataset split
sent_ids	referring expressions' indices list
sentences	referring expressions' sentence contents

Sample of source dataset：

import json as jsonmod import pickle refcoco = jsonmod.load(open('./refcoco/instances.json', 'r')) refcoco_p = pickle.load(open('./refcoco/refs(unc).p', 'rb'),fix_imports=True)

refcoco_p[0]
{'sent_ids': [0, 1, 2], 
 'file_name': 'COCO_train2014_000000581857_16.jpg', 
 'ann_id': 1719310, 
 'ref_id': 0, 
 'image_id': 581857, 
 'split': 'train', 
 'sentences': 
    [{'tokens': ['the', 'lady', 'with', 'the', 'blue', 'shirt'], 'raw': 'THE LADY WITH THE BLUE SHIRT', 'sent_id': 0, 'sent': 'the lady with the blue shirt'}, 
     {'tokens': ['lady', 'with', 'back', 'to', 'us'], 'raw': 'lady w back to us', 'sent_id': 1, 'sent': 'lady with back to us'}, 
     {'tokens': ['blue', 'shirt'], 'raw': 'blue shirt', 'sent_id': 2, 'sent': 'blue shirt'}
    ], 
 'category_id': 1
}
refcoco_p[50000-1]
{'sent_ids': [142208, 142209], 
 'file_name': 'COCO_train2014_000000000072_0.jpg', 
 'ann_id': 598731, 
 'ref_id': 49999, 
 'image_id': 72, 
 'split': 'train', 
 'sentences': 
    [{'tokens': ['right', 'giraffe'], 'raw': 'RIGHT GIRAFFE', 'sent_id': 142208, 'sent': 'right giraffe'}, 
     {'tokens': ['right', 'girafe'], 'raw': 'right girafe', 'sent_id': 142209, 'sent': 'right girafe'}
    ], 
 'category_id': 25
}

refcoco['images'][0]
{'license': 1, 
 'file_name': 'COCO_train2014_000000098304.jpg', 
 'coco_url': 'http://mscoco.org/images/98304', 
 'height': 424, 
 'width': 640, 
 'date_captured': '2013-11-21 23:06:41', 
 'flickr_url': 'http://farm6.staticflickr.com/5062/5896644212_a326e96ea9_z.jpg', 
 'id': 98304
}
refcoco['images'][19994-1]
{'license': 6, 
 'file_name': 'COCO_train2014_000000458751.jpg', 
 'coco_url': 'http://mscoco.org/images/458751', 
 'height': 576, 
 'width': 592, 
 'date_captured': '2013-11-16 21:13:51', 
 'flickr_url': 'http://farm8.staticflickr.com/7018/6821165845_48ebd9590f_z.jpg', 
 'id': 458751
}

refcoco['annotations'][0]
{'segmentation': [[267.52, 229.75, 265.6, 226.68, 265.79, 223.6, 263.87, 220.15, 263.87, 216.88, 266.94, 217.07, 268.48, 221.3, 272.32, 219.95, 276.35, 220.15, 279.62, 218.03, 283.46, 218.42, 285.0, 220.92, 285.0, 223.22, 284.42, 224.95, 280.96, 225.14, 279.81, 226.48, 281.73, 228.41, 279.43, 229.37, 275.78, 229.17, 273.86, 229.56, 274.24, 232.05, 269.82, 231.67, 267.14, 231.48, 266.75, 228.6]], 
 'area': 197.29899999999986, 
 'iscrowd': 0, 
 'image_id': 98304, 
 'bbox': [263.87, 216.88, 21.13, 15.17], 
 'category_id': 18, 
 'id': 3007
}
refcoco['annotations'][196771-1]
{'segmentation': [[203.42, 96.23, 216.68, 104.44, 216.05, 114.54, 226.15, 118.96, 228.67, 132.21, 247.61, 138.52, 250.13, 156.83, 236.88, 159.35, 234.35, 167.56, 274.12, 168.19, 281.69, 185.87, 284.85, 213.01, 267.81, 237.62, 243.19, 236.36, 238.14, 223.74, 232.46, 232.57, 231.2, 284.33, 159.87, 283.07, 159.87, 218.06, 151.67, 206.7, 154.19, 190.92, 159.87, 184.6, 158.61, 166.3, 140.3, 153.04, 142.2, 144.84, 178.81, 147.99, 183.86, 142.94, 169.97, 125.9, 173.13, 114.54, 176.28, 113.91, 185.75, 96.87, 200.9, 94.97]], 
 'area': 16238.20485, 
 'iscrowd': 0, 
 'image_id': 458751, 
 'bbox': [140.3, 94.97, 144.55, 189.36], 
 'category_id': 11, 
 'id': 1808941
}

Dataset structure：

Dataset 2（RefCOCO+）

#Evaluation Metrics-1. Precision or Accuracy

Data description：

RefCOCO+: an referring expression/visual grounding dataset, with images fromMS COCO Website, whose image and corresponding phrases(referring expressions) are manually annotated. RefCOCO+ is a small sized vision language multimodal training and testing benchmark dataset, which can be used to evaluate tasks such as referring expression/visual grounding. The image content mostly comes from real-life scenarios, and the phrases(referring expressions) descriptions are usually intuitive descriptions of the image content.

Amount of source data：

The dataset is split into train(120191), validation(10758), testA(5726), testB(4889)

Amount of evaluation data：

The evaluation data are instances of 5726 image regions and corresponding phrases(referring expressions) from testA split and 4889 image regions and corresponding phrases(referring expressions) from testB split.

Data detail：

KEYS	EXPLAIN
'images'
id	image id
file_name	image filename
width	image width
height	image height
coco_url	image url in coco dataset
flickr_url	image url in flickr dataset
license	image license type
'annotations'
id	image region id
image_id	image region's image id
category_id	image region's category id
bbox	image region's bounding box
segmentation	image region's polygon edge segmentation
area	image region's segmentation area
iscrowd	image region's segmentation is multi-object crowd
'references'
ref_id	referring expression id
ann_id	referring expression's image region id
split	referring expression's dataset split
sent_ids	referring expressions' indices list
sentences	referring expressions' sentence contents

Sample of source dataset：

import json as jsonmod refcoco_plus = jsonmod.load(open('./refcoco+/instances.json', 'r')) refcoco_plus_p = pickle.load(open('./refcoco+/refs(unc).p', 'rb'),fix_imports=True)

refcoco_plus_p[0]
{'sent_ids': [0, 1, 2], 
 'file_name': 'COCO_train2014_000000581857_16.jpg', 
 'ann_id': 1719310, 
 'ref_id': 0, 
 'image_id': 581857, 
 'split': 'train', 
 'sentences': [
    {'tokens': ['navy', 'blue', 'shirt'], 'raw': 'navy blue shirt', 'sent_id': 0, 'sent': 'navy blue shirt'}, 
    {'tokens': ['woman', 'back', 'in', 'blue'], 'raw': 'woman back in blue', 'sent_id': 1, 'sent': 'woman back in blue'}, 
    {'tokens': ['blue', 'shirt'], 'raw': 'blue shirt', 'sent_id': 2, 'sent': 'blue shirt'}
 ], 
 'category_id': 1
}
refcoco_plus_p[49856-1]
{'sent_ids': [141560, 141561, 141562, 141563], 
 'file_name': 'COCO_train2014_000000000072_0.jpg', 
 'ann_id': 598731, 
 'ref_id': 49855, 
 'image_id': 72, 
 'split': 'train', 
 'sentences': [
    {'tokens': ['shorter', 'giraffe'], 'raw': 'shorter giraffe', 'sent_id': 141560, 'sent': 'shorter giraffe'}, 
    {'tokens': ['giraffe', 'closest', 'to', 'camera'], 'raw': 'giraffe closest to camera', 'sent_id': 141561, 'sent': 'giraffe closest to camera'}, 
    {'tokens': ['bent', 'neck'], 'raw': 'bent neck', 'sent_id': 141562, 'sent': 'bent neck'}, 
    {'tokens': ['shorter', 'animal'], 'raw': 'shorter animal', 'sent_id': 141563, 'sent': 'shorter animal'}
 ], 
 'category_id': 25
}

refcoco_plus['images'][0]
{'license': 1, 
 'file_name': 'COCO_train2014_000000098304.jpg', 
 'coco_url': 'http://mscoco.org/images/98304', 
 'height': 424, 
 'width': 640, 
 'date_captured': '2013-11-21 23:06:41', 
 'flickr_url': 'http://farm6.staticflickr.com/5062/5896644212_a326e96ea9_z.jpg', 
 'id': 98304
}
refcoco_plus['images'][19992-1]
{'license': 6, 
 'file_name': 'COCO_train2014_000000458751.jpg', 
 'coco_url': 'http://mscoco.org/images/458751', 
 'height': 576, 
 'width': 592, 
 'date_captured': '2013-11-16 21:13:51', 
 'flickr_url': 'http://farm8.staticflickr.com/7018/6821165845_48ebd9590f_z.jpg', 
 'id': 458751
}

refcoco_plus['annotations'][0]
{'segmentation': [[267.52, 229.75, 265.6, 226.68, 265.79, 223.6, 263.87, 220.15, 263.87, 216.88, 266.94, 217.07, 268.48, 221.3, 272.32, 219.95, 276.35, 220.15, 279.62, 218.03, 283.46, 218.42, 285.0, 220.92, 285.0, 223.22, 284.42, 224.95, 280.96, 225.14, 279.81, 226.48, 281.73, 228.41, 279.43, 229.37, 275.78, 229.17, 273.86, 229.56, 274.24, 232.05, 269.82, 231.67, 267.14, 231.48, 266.75, 228.6]], 
 'area': 197.29899999999986, 
 'iscrowd': 0, 
 'image_id': 98304, 
 'bbox': [263.87, 216.88, 21.13, 15.17], 
 'category_id': 18, 
 'id': 3007
}
refcoco_plus['annotations'][196737-1]
{'segmentation': [[203.42, 96.23, 216.68, 104.44, 216.05, 114.54, 226.15, 118.96, 228.67, 132.21, 247.61, 138.52, 250.13, 156.83, 236.88, 159.35, 234.35, 167.56, 274.12, 168.19, 281.69, 185.87, 284.85, 213.01, 267.81, 237.62, 243.19, 236.36, 238.14, 223.74, 232.46, 232.57, 231.2, 284.33, 159.87, 283.07, 159.87, 218.06, 151.67, 206.7, 154.19, 190.92, 159.87, 184.6, 158.61, 166.3, 140.3, 153.04, 142.2, 144.84, 178.81, 147.99, 183.86, 142.94, 169.97, 125.9, 173.13, 114.54, 176.28, 113.91, 185.75, 96.87, 200.9, 94.97]], 
 'area': 16238.20485, 
 'iscrowd': 0, 
 'image_id': 458751, 
 'bbox': [140.3, 94.97, 144.55, 189.36], 
 'category_id': 11, 
 'id': 1808941
}

Dataset structure：

Dataset 3（RefCOCOg）

#Evaluation Metrics-1. Precision or Accuracy

Data description：

RefCOCOg: an referring expression/visual grounding dataset, with images fromMS COCO Website, whose image and corresponding phrases(referring expressions) are manually annotated. RefCOCOg is a small sized vision language multimodal training and testing benchmark dataset, which can be used to evaluate tasks such as referring expression/visual grounding. The image content mostly comes from real-life scenarios, and the phrases(referring expressions) descriptions are usually intuitive descriptions of the image content.

Amount of source data：

The dataset is split into train(80512), validation(4896), test(9602)

Amount of evaluation data：

The evaluation data are instances of 9602 image regions and corresponding phrases(referring expressions) from test split.

Data detail：

KEYS	EXPLAIN
'images'
id	image id
file_name	image filename
width	image width
height	image height
coco_url	image url in coco dataset
flickr_url	image url in flickr dataset
license	image license type
'annotations'
id	image region id
image_id	image region's image id
category_id	image region's category id
bbox	image region's bounding box
segmentation	image region's polygon edge segmentation
area	image region's segmentation area
iscrowd	image region's segmentation is multi-object crowd
'references'
ref_id	referring expression id
ann_id	referring expression's image region id
split	referring expression's dataset split
sent_ids	referring expressions' indices list
sentences	referring expressions' sentence contents

Sample of source dataset：

import json as jsonmod refcoco_g = jsonmod.load(open('./refcocog/instances.json', 'r')) refcoco_g_p = pickle.load(open('./refcocog/refs(umd).p', 'rb'),fix_imports=True)

refcoco_g_p[0]
{'image_id': 380440, 
 'split': 'test', 
 'sentences': [
    {'tokens': ['the', 'man', 'in', 'yellow', 'coat'], 'raw': 'the man in yellow coat', 'sent_id': 8, 'sent': 'the man in yellow coat'}, 
    {'tokens': ['skiier', 'in', 'red', 'pants'], 'raw': 'Skiier in red pants.', 'sent_id': 9, 'sent': 'skiier in red pants'}
 ], 
 'file_name': 'COCO_train2014_000000380440_491042.jpg', 
 'category_id': 1, 
 'ann_id': 491042, 
 'sent_ids': [8, 9], 
 'ref_id': 0
}
refcoco_g_p[49822-1]
{'image_id': 573297, 
 'split': 'train', 
 'sentences': [
    {'tokens': ['a', 'person', 'in', 'red', 'dress', 'and', 'he', 'is', 'seeing', 'his', 'mobile'], 'raw': 'A person in red dress and he is seeing his mobile.', 'sent_id': 104558, 'sent': 'a person in red dress and he is seeing his mobile'}, 
    {'tokens': ['man', 'wearing', 'a', 'red', 'costume'], 'raw': 'Man wearing a red costume.', 'sent_id': 104559, 'sent': 'man wearing a red costume'}
 ], 
 'file_name': 'COCO_train2014_000000573297_472971.jpg', 
 'category_id': 1, 
 'ann_id': 472971, 
 'sent_ids': [104558, 104559], 
 'ref_id': 49821
}

refcoco_g['images'][0]
{'license': 1, 
 'file_name': 'COCO_train2014_000000131074.jpg', 
 'coco_url': 'http://mscoco.org/images/131074', 
 'height': 428, 
 'width': 640, 
 'date_captured': '2013-11-21 01:03:06', 
 'flickr_url': 'http://farm9.staticflickr.com/8308/7908210548_33e532d119_z.jpg', 
 'id': 131074
}
refcoco_g['images'][25799-1]
{'license': 5, 
 'file_name': 'COCO_train2014_000000524286.jpg', 
 'coco_url': 'http://mscoco.org/images/524286', 
 'height': 480, 
 'width': 640, 
 'date_captured': '2013-11-22 01:08:02', 
 'flickr_url': 'http://farm4.staticflickr.com/3286/3160643026_c2691d2c55_z.jpg', 
 'id': 524286
}

refcoco_g['annotations'][0]
{'segmentation': [[21.11, 239.09, 16.31, 274.6, 198.65, 349.45, 240.87, 336.98, 320.52, 293.79, 334.91, 248.69, 357.95, 273.64, 353.15, 289.0, 398.25, 267.88, 437.6, 251.57, 412.65, 228.54, 240.87, 210.31, 219.76, 141.21, 113.24, 153.69, 63.34, 156.57, 26.87, 169.04]], 
 'area': 48667.84089999999, 
 'iscrowd': 0, 
 'image_id': 131074, 
 'bbox': [16.31, 141.21, 421.29, 208.24], 
 'category_id': 65, 
 'id': 318235
}
refcoco_g['annotations'][208960-1]
{'segmentation': [[158.56, 212.49, 158.56, 94.92, 467.06, 85.21, 476.76, 209.26]], 
 'area': 37887.193, 
 'iscrowd': 0, 
 'image_id': 524286, 
 'bbox': [158.56, 85.21, 318.2, 127.28], 
 'category_id': 76, 
 'id': 1635174
}

Dataset structure：

Citation information：

{RefCOCO, RefCOCO+,
  title={Modeling context in referring expressions},
  author={Yu, Licheng and Poirson, Patrick and Yang, Shan and Berg, Alexander C and Berg, Tamara L},
  booktitle={Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14},
  pages={69--85},
  year={2016},
  organization={Springer}
}
{RefCOCOg,
  title={Generation and comprehension of unambiguous object descriptions},
  author={Mao, Junhua and Huang, Jonathan and Toshev, Alexander and Camburu, Oana and Yuille, Alan L and Murphy, Kevin},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages={11--20},
  year={2016}
}

Licensing information：

[Datasets] ALL RefCOCO* Datasets MS COCO Image Dataset RefCOCO RefCOCO+RefCOCOg [Licenses] Attribution-NonCommercial-ShareAlike License Attribution-NonCommercial License Attribution-NonCommercial-NoDerivs License Attribution License Attribution-ShareAlike License Attribution-NoDerivs License No known copyright restrictions United States Government Work

Evaluation Dataset ​

Dataset 1（RefCOCO） ​

Data description： ​

Amount of source data： ​

Amount of evaluation data： ​

Data detail： ​

Sample of source dataset： ​

Dataset structure： ​

Dataset 2（RefCOCO+） ​

Data description： ​

Amount of source data： ​

Amount of evaluation data： ​

Data detail： ​

Sample of source dataset： ​

Dataset structure： ​

Dataset 3（RefCOCOg） ​

Data description： ​

Amount of source data： ​

Amount of evaluation data： ​

Data detail： ​

Sample of source dataset： ​

Dataset structure： ​

Citation information： ​

Licensing information： ​

Evaluation Dataset

Dataset 1（RefCOCO）

Data description：

Amount of source data：

Amount of evaluation data：

Data detail：

Sample of source dataset：

Dataset structure：

Dataset 2（RefCOCO+）

Data description：

Amount of source data：

Amount of evaluation data：

Data detail：

Sample of source dataset：

Dataset structure：

Dataset 3（RefCOCOg）

Data description：

Amount of source data：

Amount of evaluation data：

Data detail：

Sample of source dataset：

Dataset structure：

Citation information：

Licensing information：