Evaluation Dataset
The following datasets are all transformed into standard Evaluation Prompts before evaluation.
Dataset 1(RefCOCO)
#Evaluation Metrics-1. Precision or Accuracy
Data description:
RefCOCO: an referring expression/visual grounding dataset, with images fromMS COCO Website, whose image and corresponding phrases(referring expressions) are manually annotated. RefCOCO is a small sized vision language multimodal training and testing benchmark dataset, which can be used to evaluate tasks such as referring expression/visual grounding. The image content mostly comes from real-life scenarios, and the phrases(referring expressions) descriptions are usually intuitive descriptions of the image content.
Amount of source data:
The dataset is split into train(120624), validation(10834), testA(5657), testB(5095)
Amount of evaluation data:
The evaluation data are instances of 5657 image regions and corresponding phrases(referring expressions) from testA split and 5095 image regions and corresponding phrases(referring expressions) from testB split.
Data detail:
| KEYS | EXPLAIN |
|---|---|
| 'images' | |
| id | image id |
| file_name | image filename |
| width | image width |
| height | image height |
| coco_url | image url in coco dataset |
| flickr_url | image url in flickr dataset |
| license | image license type |
| 'annotations' | |
| id | image region id |
| image_id | image region's image id |
| category_id | image region's category id |
| bbox | image region's bounding box |
| segmentation | image region's polygon edge segmentation |
| area | image region's segmentation area |
| iscrowd | image region's segmentation is multi-object crowd |
| 'references' | |
| ref_id | referring expression id |
| ann_id | referring expression's image region id |
| split | referring expression's dataset split |
| sent_ids | referring expressions' indices list |
| sentences | referring expressions' sentence contents |
Sample of source dataset:
import json as jsonmod import pickle refcoco = jsonmod.load(open('./refcoco/instances.json', 'r')) refcoco_p = pickle.load(open('./refcoco/refs(unc).p', 'rb'),fix_imports=True)
refcoco_p[0]
{'sent_ids': [0, 1, 2],
'file_name': 'COCO_train2014_000000581857_16.jpg',
'ann_id': 1719310,
'ref_id': 0,
'image_id': 581857,
'split': 'train',
'sentences':
[{'tokens': ['the', 'lady', 'with', 'the', 'blue', 'shirt'], 'raw': 'THE LADY WITH THE BLUE SHIRT', 'sent_id': 0, 'sent': 'the lady with the blue shirt'},
{'tokens': ['lady', 'with', 'back', 'to', 'us'], 'raw': 'lady w back to us', 'sent_id': 1, 'sent': 'lady with back to us'},
{'tokens': ['blue', 'shirt'], 'raw': 'blue shirt', 'sent_id': 2, 'sent': 'blue shirt'}
],
'category_id': 1
}
refcoco_p[50000-1]
{'sent_ids': [142208, 142209],
'file_name': 'COCO_train2014_000000000072_0.jpg',
'ann_id': 598731,
'ref_id': 49999,
'image_id': 72,
'split': 'train',
'sentences':
[{'tokens': ['right', 'giraffe'], 'raw': 'RIGHT GIRAFFE', 'sent_id': 142208, 'sent': 'right giraffe'},
{'tokens': ['right', 'girafe'], 'raw': 'right girafe', 'sent_id': 142209, 'sent': 'right girafe'}
],
'category_id': 25
}
refcoco['images'][0]
{'license': 1,
'file_name': 'COCO_train2014_000000098304.jpg',
'coco_url': 'http://mscoco.org/images/98304',
'height': 424,
'width': 640,
'date_captured': '2013-11-21 23:06:41',
'flickr_url': 'http://farm6.staticflickr.com/5062/5896644212_a326e96ea9_z.jpg',
'id': 98304
}
refcoco['images'][19994-1]
{'license': 6,
'file_name': 'COCO_train2014_000000458751.jpg',
'coco_url': 'http://mscoco.org/images/458751',
'height': 576,
'width': 592,
'date_captured': '2013-11-16 21:13:51',
'flickr_url': 'http://farm8.staticflickr.com/7018/6821165845_48ebd9590f_z.jpg',
'id': 458751
}
refcoco['annotations'][0]
{'segmentation': [[267.52, 229.75, 265.6, 226.68, 265.79, 223.6, 263.87, 220.15, 263.87, 216.88, 266.94, 217.07, 268.48, 221.3, 272.32, 219.95, 276.35, 220.15, 279.62, 218.03, 283.46, 218.42, 285.0, 220.92, 285.0, 223.22, 284.42, 224.95, 280.96, 225.14, 279.81, 226.48, 281.73, 228.41, 279.43, 229.37, 275.78, 229.17, 273.86, 229.56, 274.24, 232.05, 269.82, 231.67, 267.14, 231.48, 266.75, 228.6]],
'area': 197.29899999999986,
'iscrowd': 0,
'image_id': 98304,
'bbox': [263.87, 216.88, 21.13, 15.17],
'category_id': 18,
'id': 3007
}
refcoco['annotations'][196771-1]
{'segmentation': [[203.42, 96.23, 216.68, 104.44, 216.05, 114.54, 226.15, 118.96, 228.67, 132.21, 247.61, 138.52, 250.13, 156.83, 236.88, 159.35, 234.35, 167.56, 274.12, 168.19, 281.69, 185.87, 284.85, 213.01, 267.81, 237.62, 243.19, 236.36, 238.14, 223.74, 232.46, 232.57, 231.2, 284.33, 159.87, 283.07, 159.87, 218.06, 151.67, 206.7, 154.19, 190.92, 159.87, 184.6, 158.61, 166.3, 140.3, 153.04, 142.2, 144.84, 178.81, 147.99, 183.86, 142.94, 169.97, 125.9, 173.13, 114.54, 176.28, 113.91, 185.75, 96.87, 200.9, 94.97]],
'area': 16238.20485,
'iscrowd': 0,
'image_id': 458751,
'bbox': [140.3, 94.97, 144.55, 189.36],
'category_id': 11,
'id': 1808941
}Dataset structure:
Dataset 2(RefCOCO+)
#Evaluation Metrics-1. Precision or Accuracy
Data description:
RefCOCO+: an referring expression/visual grounding dataset, with images fromMS COCO Website, whose image and corresponding phrases(referring expressions) are manually annotated. RefCOCO+ is a small sized vision language multimodal training and testing benchmark dataset, which can be used to evaluate tasks such as referring expression/visual grounding. The image content mostly comes from real-life scenarios, and the phrases(referring expressions) descriptions are usually intuitive descriptions of the image content.
Amount of source data:
The dataset is split into train(120191), validation(10758), testA(5726), testB(4889)
Amount of evaluation data:
The evaluation data are instances of 5726 image regions and corresponding phrases(referring expressions) from testA split and 4889 image regions and corresponding phrases(referring expressions) from testB split.
Data detail:
| KEYS | EXPLAIN |
|---|---|
| 'images' | |
| id | image id |
| file_name | image filename |
| width | image width |
| height | image height |
| coco_url | image url in coco dataset |
| flickr_url | image url in flickr dataset |
| license | image license type |
| 'annotations' | |
| id | image region id |
| image_id | image region's image id |
| category_id | image region's category id |
| bbox | image region's bounding box |
| segmentation | image region's polygon edge segmentation |
| area | image region's segmentation area |
| iscrowd | image region's segmentation is multi-object crowd |
| 'references' | |
| ref_id | referring expression id |
| ann_id | referring expression's image region id |
| split | referring expression's dataset split |
| sent_ids | referring expressions' indices list |
| sentences | referring expressions' sentence contents |
Sample of source dataset:
import json as jsonmod refcoco_plus = jsonmod.load(open('./refcoco+/instances.json', 'r')) refcoco_plus_p = pickle.load(open('./refcoco+/refs(unc).p', 'rb'),fix_imports=True)
refcoco_plus_p[0]
{'sent_ids': [0, 1, 2],
'file_name': 'COCO_train2014_000000581857_16.jpg',
'ann_id': 1719310,
'ref_id': 0,
'image_id': 581857,
'split': 'train',
'sentences': [
{'tokens': ['navy', 'blue', 'shirt'], 'raw': 'navy blue shirt', 'sent_id': 0, 'sent': 'navy blue shirt'},
{'tokens': ['woman', 'back', 'in', 'blue'], 'raw': 'woman back in blue', 'sent_id': 1, 'sent': 'woman back in blue'},
{'tokens': ['blue', 'shirt'], 'raw': 'blue shirt', 'sent_id': 2, 'sent': 'blue shirt'}
],
'category_id': 1
}
refcoco_plus_p[49856-1]
{'sent_ids': [141560, 141561, 141562, 141563],
'file_name': 'COCO_train2014_000000000072_0.jpg',
'ann_id': 598731,
'ref_id': 49855,
'image_id': 72,
'split': 'train',
'sentences': [
{'tokens': ['shorter', 'giraffe'], 'raw': 'shorter giraffe', 'sent_id': 141560, 'sent': 'shorter giraffe'},
{'tokens': ['giraffe', 'closest', 'to', 'camera'], 'raw': 'giraffe closest to camera', 'sent_id': 141561, 'sent': 'giraffe closest to camera'},
{'tokens': ['bent', 'neck'], 'raw': 'bent neck', 'sent_id': 141562, 'sent': 'bent neck'},
{'tokens': ['shorter', 'animal'], 'raw': 'shorter animal', 'sent_id': 141563, 'sent': 'shorter animal'}
],
'category_id': 25
}
refcoco_plus['images'][0]
{'license': 1,
'file_name': 'COCO_train2014_000000098304.jpg',
'coco_url': 'http://mscoco.org/images/98304',
'height': 424,
'width': 640,
'date_captured': '2013-11-21 23:06:41',
'flickr_url': 'http://farm6.staticflickr.com/5062/5896644212_a326e96ea9_z.jpg',
'id': 98304
}
refcoco_plus['images'][19992-1]
{'license': 6,
'file_name': 'COCO_train2014_000000458751.jpg',
'coco_url': 'http://mscoco.org/images/458751',
'height': 576,
'width': 592,
'date_captured': '2013-11-16 21:13:51',
'flickr_url': 'http://farm8.staticflickr.com/7018/6821165845_48ebd9590f_z.jpg',
'id': 458751
}
refcoco_plus['annotations'][0]
{'segmentation': [[267.52, 229.75, 265.6, 226.68, 265.79, 223.6, 263.87, 220.15, 263.87, 216.88, 266.94, 217.07, 268.48, 221.3, 272.32, 219.95, 276.35, 220.15, 279.62, 218.03, 283.46, 218.42, 285.0, 220.92, 285.0, 223.22, 284.42, 224.95, 280.96, 225.14, 279.81, 226.48, 281.73, 228.41, 279.43, 229.37, 275.78, 229.17, 273.86, 229.56, 274.24, 232.05, 269.82, 231.67, 267.14, 231.48, 266.75, 228.6]],
'area': 197.29899999999986,
'iscrowd': 0,
'image_id': 98304,
'bbox': [263.87, 216.88, 21.13, 15.17],
'category_id': 18,
'id': 3007
}
refcoco_plus['annotations'][196737-1]
{'segmentation': [[203.42, 96.23, 216.68, 104.44, 216.05, 114.54, 226.15, 118.96, 228.67, 132.21, 247.61, 138.52, 250.13, 156.83, 236.88, 159.35, 234.35, 167.56, 274.12, 168.19, 281.69, 185.87, 284.85, 213.01, 267.81, 237.62, 243.19, 236.36, 238.14, 223.74, 232.46, 232.57, 231.2, 284.33, 159.87, 283.07, 159.87, 218.06, 151.67, 206.7, 154.19, 190.92, 159.87, 184.6, 158.61, 166.3, 140.3, 153.04, 142.2, 144.84, 178.81, 147.99, 183.86, 142.94, 169.97, 125.9, 173.13, 114.54, 176.28, 113.91, 185.75, 96.87, 200.9, 94.97]],
'area': 16238.20485,
'iscrowd': 0,
'image_id': 458751,
'bbox': [140.3, 94.97, 144.55, 189.36],
'category_id': 11,
'id': 1808941
}Dataset structure:
Dataset 3(RefCOCOg)
#Evaluation Metrics-1. Precision or Accuracy
Data description:
RefCOCOg: an referring expression/visual grounding dataset, with images fromMS COCO Website, whose image and corresponding phrases(referring expressions) are manually annotated. RefCOCOg is a small sized vision language multimodal training and testing benchmark dataset, which can be used to evaluate tasks such as referring expression/visual grounding. The image content mostly comes from real-life scenarios, and the phrases(referring expressions) descriptions are usually intuitive descriptions of the image content.
Amount of source data:
The dataset is split into train(80512), validation(4896), test(9602)
Amount of evaluation data:
The evaluation data are instances of 9602 image regions and corresponding phrases(referring expressions) from test split.
Data detail:
| KEYS | EXPLAIN |
|---|---|
| 'images' | |
| id | image id |
| file_name | image filename |
| width | image width |
| height | image height |
| coco_url | image url in coco dataset |
| flickr_url | image url in flickr dataset |
| license | image license type |
| 'annotations' | |
| id | image region id |
| image_id | image region's image id |
| category_id | image region's category id |
| bbox | image region's bounding box |
| segmentation | image region's polygon edge segmentation |
| area | image region's segmentation area |
| iscrowd | image region's segmentation is multi-object crowd |
| 'references' | |
| ref_id | referring expression id |
| ann_id | referring expression's image region id |
| split | referring expression's dataset split |
| sent_ids | referring expressions' indices list |
| sentences | referring expressions' sentence contents |
Sample of source dataset:
import json as jsonmod refcoco_g = jsonmod.load(open('./refcocog/instances.json', 'r')) refcoco_g_p = pickle.load(open('./refcocog/refs(umd).p', 'rb'),fix_imports=True)
refcoco_g_p[0]
{'image_id': 380440,
'split': 'test',
'sentences': [
{'tokens': ['the', 'man', 'in', 'yellow', 'coat'], 'raw': 'the man in yellow coat', 'sent_id': 8, 'sent': 'the man in yellow coat'},
{'tokens': ['skiier', 'in', 'red', 'pants'], 'raw': 'Skiier in red pants.', 'sent_id': 9, 'sent': 'skiier in red pants'}
],
'file_name': 'COCO_train2014_000000380440_491042.jpg',
'category_id': 1,
'ann_id': 491042,
'sent_ids': [8, 9],
'ref_id': 0
}
refcoco_g_p[49822-1]
{'image_id': 573297,
'split': 'train',
'sentences': [
{'tokens': ['a', 'person', 'in', 'red', 'dress', 'and', 'he', 'is', 'seeing', 'his', 'mobile'], 'raw': 'A person in red dress and he is seeing his mobile.', 'sent_id': 104558, 'sent': 'a person in red dress and he is seeing his mobile'},
{'tokens': ['man', 'wearing', 'a', 'red', 'costume'], 'raw': 'Man wearing a red costume.', 'sent_id': 104559, 'sent': 'man wearing a red costume'}
],
'file_name': 'COCO_train2014_000000573297_472971.jpg',
'category_id': 1,
'ann_id': 472971,
'sent_ids': [104558, 104559],
'ref_id': 49821
}
refcoco_g['images'][0]
{'license': 1,
'file_name': 'COCO_train2014_000000131074.jpg',
'coco_url': 'http://mscoco.org/images/131074',
'height': 428,
'width': 640,
'date_captured': '2013-11-21 01:03:06',
'flickr_url': 'http://farm9.staticflickr.com/8308/7908210548_33e532d119_z.jpg',
'id': 131074
}
refcoco_g['images'][25799-1]
{'license': 5,
'file_name': 'COCO_train2014_000000524286.jpg',
'coco_url': 'http://mscoco.org/images/524286',
'height': 480,
'width': 640,
'date_captured': '2013-11-22 01:08:02',
'flickr_url': 'http://farm4.staticflickr.com/3286/3160643026_c2691d2c55_z.jpg',
'id': 524286
}
refcoco_g['annotations'][0]
{'segmentation': [[21.11, 239.09, 16.31, 274.6, 198.65, 349.45, 240.87, 336.98, 320.52, 293.79, 334.91, 248.69, 357.95, 273.64, 353.15, 289.0, 398.25, 267.88, 437.6, 251.57, 412.65, 228.54, 240.87, 210.31, 219.76, 141.21, 113.24, 153.69, 63.34, 156.57, 26.87, 169.04]],
'area': 48667.84089999999,
'iscrowd': 0,
'image_id': 131074,
'bbox': [16.31, 141.21, 421.29, 208.24],
'category_id': 65,
'id': 318235
}
refcoco_g['annotations'][208960-1]
{'segmentation': [[158.56, 212.49, 158.56, 94.92, 467.06, 85.21, 476.76, 209.26]],
'area': 37887.193,
'iscrowd': 0,
'image_id': 524286,
'bbox': [158.56, 85.21, 318.2, 127.28],
'category_id': 76,
'id': 1635174
}Dataset structure:
Citation information:
{RefCOCO, RefCOCO+,
title={Modeling context in referring expressions},
author={Yu, Licheng and Poirson, Patrick and Yang, Shan and Berg, Alexander C and Berg, Tamara L},
booktitle={Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14},
pages={69--85},
year={2016},
organization={Springer}
}
{RefCOCOg,
title={Generation and comprehension of unambiguous object descriptions},
author={Mao, Junhua and Huang, Jonathan and Toshev, Alexander and Camburu, Oana and Yuille, Alan L and Murphy, Kevin},
booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
pages={11--20},
year={2016}
}Licensing information:
[Datasets] ALL RefCOCO* DatasetsMS COCO Image DatasetRefCOCORefCOCO+RefCOCOg [Licenses] Attribution-NonCommercial-ShareAlike LicenseAttribution-NonCommercial LicenseAttribution-NonCommercial-NoDerivs LicenseAttribution LicenseAttribution-ShareAlike LicenseAttribution-NoDerivs LicenseNo known copyright restrictionsUnited States Government Work
Dataset 4 (ARPGrounding)
#Evaluation Metrics-1. Precision or Accuracy
Data description:
ARPGrounding is a visual grounding benchmark designed to evaluate the compositional reasoning ability of vision-language models (VLMs). The dataset is built on top of the Visual Genome (VG) dataset, constructed via dependency parsing and further manually filtered. ARPGrounding aims to assess fine-grained compositional understanding in vision and language, especially in confusing scenarios where current models are prone to failure. Images come from real, complex visual scenes. The referring phrases are organized as ambiguous positive/negative pairs, requiring the model to distinguish two visually similar or semantically related objects in the same image based on attribute, relation, or priority differences.
Amount of source data:
The dataset is derived from 108,249 images in Visual Genome and then filtered/constructed to form ARPGrounding.
Amount of evaluation data:
The evaluation set contains 11,425 region-level instances paired with phrases (referring expressions). It consists of three subsets:
- Attribute subset: 6,632 samples
- Relation subset: 370 samples
- Priority subset: 4,423 samples
Data detail:
| KEYS | EXPLAIN |
|---|---|
attribute_vg.pkl | Attribute subset |
| phrase | Positive/negative phrase describing an object (e.g., red building) |
| x | Top-left x coordinate of the bounding box |
| y | Top-left y coordinate of the bounding box |
| w | Bounding box width |
| h | Bounding box height |
| attributes | List of attributes (e.g., ['brown', 'red']) |
| names | List of object names (e.g., ['building']) |
| object_id | Unique object ID in Visual Genome |
| synsets | WordNet synsets (e.g., ['building.n.01']) |
relationship_vg.pkl | Relation subset |
| phrase | Phrase describing spatial/action relations (e.g., grass on top of sand) |
| x, y, w, h | Bounding box coordinates of the involved region (xywh format) |
priority_vg.pkl | Priority/Saliency subset |
| phrase | Phrase describing position priority or contrastive relations |
| x, y, w, h | Bounding box coordinates of the salient region (xywh format) |
| Data hierarchy | |
| num_images | Total number of images in the dataset |
| pairs | List of positive/negative paired samples for each image |
| pair[0] (pos) | Positive sample dictionary |
| pair[1] (neg) | Negative sample dictionary |
Sample of source dataset:
import pickle
data = pickle.load(open(attribute_vg.pkl, "rb"))
Example 1: Attribute - Disambiguating Color
Example (Attribute VG Pairs, positive and negative):
[
{
'synsets': ['building.n.01'],
'h': 298,
'object_id': 1023846,
'names': ['building'],
'w': 282,
'attributes': ['brown', 'red'],
'y': 13,
'x': 165,
'phrase': 'red building'
},
{
'synsets': ['building.n.01'],
'h': 384,
'object_id': 1023819,
'names': ['building'],
'w': 251,
'attributes': ['orange', 'brown', 'tall'],
'y': 4,
'x': 547,
'phrase': 'orange building'
}
]Example 2: Relation - Disambiguating Spatial Position
Example (Relation VG Pairs, positive and negative):
[
{
'h': 156,
'w': 799,
'y': 264,
'x': 1,
'phrase': 'grass on top of sand'
},
{
'h': 80,
'w': 374,
'y': 468,
'x': 273,
'phrase': 'grass in sand'
}
]Example 3: Priority - Disambiguating the Subject
Example (Priority VG Pairs, positive and negative):
[
{
'h': 200,
'w': 196,
'y': 391,
'x': 294,
'phrase': 'cpu on floor'
},
{
'h': 73,
'w': 402,
'y': 526,
'x': 237,
'phrase': 'floor under cpu'
}
]Dataset structure:
Dataset Composition and Protocol
ARPGrounding consists of three core parts, aiming to test compositional reasoning along different dimensions:
- Attribute: 6,632 samples. Tests whether a model can distinguish objects of the same category with different attributes (e.g., color/material/state). Example: distinguishing “a brown dog” vs “a black dog”.
- Relation: 370 samples. Tests whether a model can identify the target purely based on relations between objects (typically spatial/action relations). Example: “a computer on the table” vs “a computer under the table”.
- Priority: 4,423 samples. Tests whether a model can correctly identify the grammatical subject, without being misled by other nouns mentioned in the text. Often involves complex sentence structures where subject/object positions are swapped.
Citation information:
@inproceedings{ARPGrounding,
title={Investigating Compositional Challenges in Vision-Language Models for Visual Grounding},
author={Zeng, Yunan and Huang, Yan and Zhang, Jinjin and Jie, Zequn and Chai, Zhenhua and Wang, Liang},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
pages={14141--14151},
year={2024}
}Source Dataset / Copyright & Usage Notes
[Datasets] ARPGrounding (Based on Visual Genome)
[Licenses] This dataset is derived from Visual Genome. Image licenses follow the original Visual Genome image sources (e.g., Flickr) and may vary per image (e.g., Attribution, NonCommercial, NoDerivs, ShareAlike, etc.). Please refer to the Visual Genome website for licensing information and use the images according to their original licenses. ARPGrounding annotations and processing scripts are provided under the same license as this repository unless otherwise stated.