Skip to content

评测数据

以下数据集均转化为标准评测Prompt后再进行评测。

数据集 1(RefCOCO)

#评测指标-1. 精度(Precision or Accuracy)

数据描述:

RefCOCO是一个指代表达/视觉定位数据集,图像来自MS COCO网站的其中一部分,区域图像和对应短语(指代表达)来自人工标注。RefCOCO是一个小型的视觉-语言多模态训练与测试基准数据集,可以用于评测指代表达/视觉定位等任务,图像内容多来自生活场景,短语(指代表达)描述通常是对于图像区域内容的直观描述。

源数据量:

数据集分成训练集(120624),验证集(10834),测试集A(5657),测试集B(5095)

评测数据量:

评测数据为源数据测试集A中的5657个区域图像和对应短语(指代表达)实例,以及测试集B中的5095个区域图像和对应短语(指代表达)实例

源数据字段:

KEYSEXPLAIN
'images'
id图像id
file_name图像文件名
width图像宽度
height图像高度
coco_url图像在coco数据集的url
flickr_url图像在flickr数据集的url
license图像许可类型
'annotations'
id图像区域id
image_id图像区域所属图像id
category_id图像区域的分类id
bbox图像区域的矩形包围框
segmentation图像区域的多边形边缘分割
area图像区域的分割面积
iscrowd图像区域分割是多目标群体
'references'
ref_id指代表达id
ann_id指代表达所属图像区域id
split指代表达的数据集划分类型
sent_ids多个指代表达的id列表
sentences多个指代表达的具体内容

源数据集样例:

import json as jsonmod import pickle refcoco = jsonmod.load(open('./refcoco/instances.json', 'r')) refcoco_p = pickle.load(open('./refcoco/refs(unc).p', 'rb'),fix_imports=True)

refcoco_p[0]
{'sent_ids': [0, 1, 2], 
 'file_name': 'COCO_train2014_000000581857_16.jpg', 
 'ann_id': 1719310, 
 'ref_id': 0, 
 'image_id': 581857, 
 'split': 'train', 
 'sentences': 
    [{'tokens': ['the', 'lady', 'with', 'the', 'blue', 'shirt'], 'raw': 'THE LADY WITH THE BLUE SHIRT', 'sent_id': 0, 'sent': 'the lady with the blue shirt'}, 
     {'tokens': ['lady', 'with', 'back', 'to', 'us'], 'raw': 'lady w back to us', 'sent_id': 1, 'sent': 'lady with back to us'}, 
     {'tokens': ['blue', 'shirt'], 'raw': 'blue shirt', 'sent_id': 2, 'sent': 'blue shirt'}
    ], 
 'category_id': 1
}
refcoco_p[50000-1]
{'sent_ids': [142208, 142209], 
 'file_name': 'COCO_train2014_000000000072_0.jpg', 
 'ann_id': 598731, 
 'ref_id': 49999, 
 'image_id': 72, 
 'split': 'train', 
 'sentences': 
    [{'tokens': ['right', 'giraffe'], 'raw': 'RIGHT GIRAFFE', 'sent_id': 142208, 'sent': 'right giraffe'}, 
     {'tokens': ['right', 'girafe'], 'raw': 'right girafe', 'sent_id': 142209, 'sent': 'right girafe'}
    ], 
 'category_id': 25
}

refcoco['images'][0]
{'license': 1, 
 'file_name': 'COCO_train2014_000000098304.jpg', 
 'coco_url': 'http://mscoco.org/images/98304', 
 'height': 424, 
 'width': 640, 
 'date_captured': '2013-11-21 23:06:41', 
 'flickr_url': 'http://farm6.staticflickr.com/5062/5896644212_a326e96ea9_z.jpg', 
 'id': 98304
}
refcoco['images'][19994-1]
{'license': 6, 
 'file_name': 'COCO_train2014_000000458751.jpg', 
 'coco_url': 'http://mscoco.org/images/458751', 
 'height': 576, 
 'width': 592, 
 'date_captured': '2013-11-16 21:13:51', 
 'flickr_url': 'http://farm8.staticflickr.com/7018/6821165845_48ebd9590f_z.jpg', 
 'id': 458751
}

refcoco['annotations'][0]
{'segmentation': [[267.52, 229.75, 265.6, 226.68, 265.79, 223.6, 263.87, 220.15, 263.87, 216.88, 266.94, 217.07, 268.48, 221.3, 272.32, 219.95, 276.35, 220.15, 279.62, 218.03, 283.46, 218.42, 285.0, 220.92, 285.0, 223.22, 284.42, 224.95, 280.96, 225.14, 279.81, 226.48, 281.73, 228.41, 279.43, 229.37, 275.78, 229.17, 273.86, 229.56, 274.24, 232.05, 269.82, 231.67, 267.14, 231.48, 266.75, 228.6]], 
 'area': 197.29899999999986, 
 'iscrowd': 0, 
 'image_id': 98304, 
 'bbox': [263.87, 216.88, 21.13, 15.17], 
 'category_id': 18, 
 'id': 3007
}
refcoco['annotations'][196771-1]
{'segmentation': [[203.42, 96.23, 216.68, 104.44, 216.05, 114.54, 226.15, 118.96, 228.67, 132.21, 247.61, 138.52, 250.13, 156.83, 236.88, 159.35, 234.35, 167.56, 274.12, 168.19, 281.69, 185.87, 284.85, 213.01, 267.81, 237.62, 243.19, 236.36, 238.14, 223.74, 232.46, 232.57, 231.2, 284.33, 159.87, 283.07, 159.87, 218.06, 151.67, 206.7, 154.19, 190.92, 159.87, 184.6, 158.61, 166.3, 140.3, 153.04, 142.2, 144.84, 178.81, 147.99, 183.86, 142.94, 169.97, 125.9, 173.13, 114.54, 176.28, 113.91, 185.75, 96.87, 200.9, 94.97]], 
 'area': 16238.20485, 
 'iscrowd': 0, 
 'image_id': 458751, 
 'bbox': [140.3, 94.97, 144.55, 189.36], 
 'category_id': 11, 
 'id': 1808941
}

数据集构成和规范:

数据集 2(RefCOCO+)

#评测指标-1. 精度(Precision or Accuracy)

数据描述:

RefCOCO+是一个指代表达/视觉定位数据集,图像来自MS COCO网站的其中一部分,区域图像和对应短语(指代表达)来自人工标注。RefCOCO+是一个小型的视觉-语言多模态训练与测试基准数据集,可以用于评测指代表达/视觉定位等任务,图像内容多来自生活场景,短语(指代表达)描述通常是对于图像区域内容的直观描述。

源数据量:

数据集分成训练集(120191),验证集(10758),测试集A(5726),测试集B(4889)

评测数据量:

评测数据为源数据测试集A中的5726个区域图像和对应短语(指代表达)实例,以及测试集B中的4889个区域图像和对应短语(指代表达)实例

源数据字段:

KEYSEXPLAIN
'images'
id图像id
file_name图像文件名
width图像宽度
height图像高度
coco_url图像在coco数据集的url
flickr_url图像在flickr数据集的url
license图像许可类型
'annotations'
id图像区域id
image_id图像区域所属图像id
category_id图像区域的分类id
bbox图像区域的矩形包围框
segmentation图像区域的多边形边缘分割
area图像区域的分割面积
iscrowd图像区域分割是多目标群体
'references'
ref_id指代表达id
ann_id指代表达所属图像区域id
split指代表达的数据集划分类型
sent_ids多个指代表达的id列表
sentences多个指代表达的具体内容

源数据集样例:

import json as jsonmod refcoco_plus = jsonmod.load(open('./refcoco+/instances.json', 'r')) refcoco_plus_p = pickle.load(open('./refcoco+/refs(unc).p', 'rb'),fix_imports=True)

refcoco_plus_p[0]
{'sent_ids': [0, 1, 2], 
 'file_name': 'COCO_train2014_000000581857_16.jpg', 
 'ann_id': 1719310, 
 'ref_id': 0, 
 'image_id': 581857, 
 'split': 'train', 
 'sentences': [
    {'tokens': ['navy', 'blue', 'shirt'], 'raw': 'navy blue shirt', 'sent_id': 0, 'sent': 'navy blue shirt'}, 
    {'tokens': ['woman', 'back', 'in', 'blue'], 'raw': 'woman back in blue', 'sent_id': 1, 'sent': 'woman back in blue'}, 
    {'tokens': ['blue', 'shirt'], 'raw': 'blue shirt', 'sent_id': 2, 'sent': 'blue shirt'}
 ], 
 'category_id': 1
}
refcoco_plus_p[49856-1]
{'sent_ids': [141560, 141561, 141562, 141563], 
 'file_name': 'COCO_train2014_000000000072_0.jpg', 
 'ann_id': 598731, 
 'ref_id': 49855, 
 'image_id': 72, 
 'split': 'train', 
 'sentences': [
    {'tokens': ['shorter', 'giraffe'], 'raw': 'shorter giraffe', 'sent_id': 141560, 'sent': 'shorter giraffe'}, 
    {'tokens': ['giraffe', 'closest', 'to', 'camera'], 'raw': 'giraffe closest to camera', 'sent_id': 141561, 'sent': 'giraffe closest to camera'}, 
    {'tokens': ['bent', 'neck'], 'raw': 'bent neck', 'sent_id': 141562, 'sent': 'bent neck'}, 
    {'tokens': ['shorter', 'animal'], 'raw': 'shorter animal', 'sent_id': 141563, 'sent': 'shorter animal'}
 ], 
 'category_id': 25
}

refcoco_plus['images'][0]
{'license': 1, 
 'file_name': 'COCO_train2014_000000098304.jpg', 
 'coco_url': 'http://mscoco.org/images/98304', 
 'height': 424, 
 'width': 640, 
 'date_captured': '2013-11-21 23:06:41', 
 'flickr_url': 'http://farm6.staticflickr.com/5062/5896644212_a326e96ea9_z.jpg', 
 'id': 98304
}
refcoco_plus['images'][19992-1]
{'license': 6, 
 'file_name': 'COCO_train2014_000000458751.jpg', 
 'coco_url': 'http://mscoco.org/images/458751', 
 'height': 576, 
 'width': 592, 
 'date_captured': '2013-11-16 21:13:51', 
 'flickr_url': 'http://farm8.staticflickr.com/7018/6821165845_48ebd9590f_z.jpg', 
 'id': 458751
}

refcoco_plus['annotations'][0]
{'segmentation': [[267.52, 229.75, 265.6, 226.68, 265.79, 223.6, 263.87, 220.15, 263.87, 216.88, 266.94, 217.07, 268.48, 221.3, 272.32, 219.95, 276.35, 220.15, 279.62, 218.03, 283.46, 218.42, 285.0, 220.92, 285.0, 223.22, 284.42, 224.95, 280.96, 225.14, 279.81, 226.48, 281.73, 228.41, 279.43, 229.37, 275.78, 229.17, 273.86, 229.56, 274.24, 232.05, 269.82, 231.67, 267.14, 231.48, 266.75, 228.6]], 
 'area': 197.29899999999986, 
 'iscrowd': 0, 
 'image_id': 98304, 
 'bbox': [263.87, 216.88, 21.13, 15.17], 
 'category_id': 18, 
 'id': 3007
}
refcoco_plus['annotations'][196737-1]
{'segmentation': [[203.42, 96.23, 216.68, 104.44, 216.05, 114.54, 226.15, 118.96, 228.67, 132.21, 247.61, 138.52, 250.13, 156.83, 236.88, 159.35, 234.35, 167.56, 274.12, 168.19, 281.69, 185.87, 284.85, 213.01, 267.81, 237.62, 243.19, 236.36, 238.14, 223.74, 232.46, 232.57, 231.2, 284.33, 159.87, 283.07, 159.87, 218.06, 151.67, 206.7, 154.19, 190.92, 159.87, 184.6, 158.61, 166.3, 140.3, 153.04, 142.2, 144.84, 178.81, 147.99, 183.86, 142.94, 169.97, 125.9, 173.13, 114.54, 176.28, 113.91, 185.75, 96.87, 200.9, 94.97]], 
 'area': 16238.20485, 
 'iscrowd': 0, 
 'image_id': 458751, 
 'bbox': [140.3, 94.97, 144.55, 189.36], 
 'category_id': 11, 
 'id': 1808941
}

数据集构成和规范:

数据集 3(RefCOCOg)

#评测指标-1. 精度(Precision or Accuracy)

数据描述:

RefCOCOg是一个指代表达/视觉定位数据集,图像来自MS COCO网站的其中一部分,区域图像和对应短语(指代表达)来自人工标注。RefCOCOg是一个小型的视觉-语言多模态训练与测试基准数据集,可以用于评测指代表达/视觉定位等任务,图像内容多来自生活场景,短语(指代表达)描述通常是对于图像区域内容的直观描述。

源数据量:

数据集分成训练集(80512),验证集(4896),测试集(9602)

评测数据量:

评测数据为源数据测试集中的9602个区域图像和对应短语(指代表达)实例

源数据字段:

KEYSEXPLAIN
'images'
id图像id
file_name图像文件名
width图像宽度
height图像高度
coco_url图像在coco数据集的url
flickr_url图像在flickr数据集的url
license图像许可类型
'annotations'
id图像区域id
image_id图像区域所属图像id
category_id图像区域的分类id
bbox图像区域的矩形包围框
segmentation图像区域的多边形边缘分割
area图像区域的分割面积
iscrowd图像区域分割是多目标群体
'references'
ref_id指代表达id
ann_id指代表达所属图像区域id
split指代表达的数据集划分类型
sent_ids多个指代表达的id列表
sentences多个指代表达的具体内容

源数据集样例:

import json as jsonmod refcoco_g = jsonmod.load(open('./refcocog/instances.json', 'r')) refcoco_g_p = pickle.load(open('./refcocog/refs(umd).p', 'rb'),fix_imports=True)

refcoco_g_p[0]
{'image_id': 380440, 
 'split': 'test', 
 'sentences': [
    {'tokens': ['the', 'man', 'in', 'yellow', 'coat'], 'raw': 'the man in yellow coat', 'sent_id': 8, 'sent': 'the man in yellow coat'}, 
    {'tokens': ['skiier', 'in', 'red', 'pants'], 'raw': 'Skiier in red pants.', 'sent_id': 9, 'sent': 'skiier in red pants'}
 ], 
 'file_name': 'COCO_train2014_000000380440_491042.jpg', 
 'category_id': 1, 
 'ann_id': 491042, 
 'sent_ids': [8, 9], 
 'ref_id': 0
}
refcoco_g_p[49822-1]
{'image_id': 573297, 
 'split': 'train', 
 'sentences': [
    {'tokens': ['a', 'person', 'in', 'red', 'dress', 'and', 'he', 'is', 'seeing', 'his', 'mobile'], 'raw': 'A person in red dress and he is seeing his mobile.', 'sent_id': 104558, 'sent': 'a person in red dress and he is seeing his mobile'}, 
    {'tokens': ['man', 'wearing', 'a', 'red', 'costume'], 'raw': 'Man wearing a red costume.', 'sent_id': 104559, 'sent': 'man wearing a red costume'}
 ], 
 'file_name': 'COCO_train2014_000000573297_472971.jpg', 
 'category_id': 1, 
 'ann_id': 472971, 
 'sent_ids': [104558, 104559], 
 'ref_id': 49821
}

refcoco_g['images'][0]
{'license': 1, 
 'file_name': 'COCO_train2014_000000131074.jpg', 
 'coco_url': 'http://mscoco.org/images/131074', 
 'height': 428, 
 'width': 640, 
 'date_captured': '2013-11-21 01:03:06', 
 'flickr_url': 'http://farm9.staticflickr.com/8308/7908210548_33e532d119_z.jpg', 
 'id': 131074
}
refcoco_g['images'][25799-1]
{'license': 5, 
 'file_name': 'COCO_train2014_000000524286.jpg', 
 'coco_url': 'http://mscoco.org/images/524286', 
 'height': 480, 
 'width': 640, 
 'date_captured': '2013-11-22 01:08:02', 
 'flickr_url': 'http://farm4.staticflickr.com/3286/3160643026_c2691d2c55_z.jpg', 
 'id': 524286
}

refcoco_g['annotations'][0]
{'segmentation': [[21.11, 239.09, 16.31, 274.6, 198.65, 349.45, 240.87, 336.98, 320.52, 293.79, 334.91, 248.69, 357.95, 273.64, 353.15, 289.0, 398.25, 267.88, 437.6, 251.57, 412.65, 228.54, 240.87, 210.31, 219.76, 141.21, 113.24, 153.69, 63.34, 156.57, 26.87, 169.04]], 
 'area': 48667.84089999999, 
 'iscrowd': 0, 
 'image_id': 131074, 
 'bbox': [16.31, 141.21, 421.29, 208.24], 
 'category_id': 65, 
 'id': 318235
}
refcoco_g['annotations'][208960-1]
{'segmentation': [[158.56, 212.49, 158.56, 94.92, 467.06, 85.21, 476.76, 209.26]], 
 'area': 37887.193, 
 'iscrowd': 0, 
 'image_id': 524286, 
 'bbox': [158.56, 85.21, 318.2, 127.28], 
 'category_id': 76, 
 'id': 1635174
}

数据集构成和规范:

论文引用:

{RefCOCO, RefCOCO+,
  title={Modeling context in referring expressions},
  author={Yu, Licheng and Poirson, Patrick and Yang, Shan and Berg, Alexander C and Berg, Tamara L},
  booktitle={Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14},
  pages={69--85},
  year={2016},
  organization={Springer}
}
{RefCOCOg,
  title={Generation and comprehension of unambiguous object descriptions},
  author={Mao, Junhua and Huang, Jonathan and Toshev, Alexander and Camburu, Oana and Yuille, Alan L and Murphy, Kevin},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages={11--20},
  year={2016}
}

源数据集版权使用说明:

[Datasets] 各种RefCOCO数据集MS COCO图像数据集RefCOCORefCOCO+RefCOCOg [Licenses] Attribution-NonCommercial-ShareAlike LicenseAttribution-NonCommercial LicenseAttribution-NonCommercial-NoDerivs LicenseAttribution LicenseAttribution-ShareAlike LicenseAttribution-NoDerivs LicenseNo known copyright restrictionsUnited States Government Work

数据集 4(ARPGrounding)

#评测指标-1. 精度(Precision or Accuracy)

数据描述:

ARPGrounding是一个专门针对视觉-语言模型(VLMs)组合性推理能力进行评测的视觉定位数据集。该数据集建立在Visual Genome(VG)数据集之上,通过依赖解析生成和人工筛选构建而成。ARPGrounding旨在评估模型对细粒度视觉-语言组合性的理解能力,特别是针对现有模型容易出现混淆的场景。图像内容来自真实复杂的视觉场景,短语(指代表达)设计为成对出现的具有歧义的描述,要求模型根据属性、关系或优先级的不同,在同一图像中区分两个视觉上相似或语义相关的对象。数据成对出现,每个样例配对有一个干扰的负样例,需要模型排除负样例干扰,准确找出正确样例。

数据集构成和规范:

源数据量:

数据集源自Visual Genome的108,249张图像,经过筛选和构建生成。

评测数据量:

评测数据包含11,425个区域图像和对应短语(指代表达)实例。具体分为三个子集:属性(Attribute)子集6,632个样本,关系(Relation)子集370个样本,以及优先级(Priority)子集4,423个样本。

源数据字段:

ARPGrounding 数据集字段定义

KEYSEXPLAIN
'attribute_vg.pkl'属性识别子集
phrase描述对象的正向/负向短语 (如: red building)
x矩形框左上角横坐标
y矩形框左上角纵坐标
w矩形框宽度
h矩形框高度
attributes对象的属性列表 (如: ['brown', 'red'])
names对象的名称列表 (如: ['building'])
object_id视觉基因组 (Visual Genome) 中的对象唯一 ID
synsetsWordNet 语义集定义 (如: ['building.n.01'])
'relationship_vg.pkl'关系推理子集
phrase描述空间、动作关系的短语 (如: grass on top of sand)
x, y, w, h关系涉及区域的检测框坐标 (xywh 格式)
'priority_vg.pkl'优先级/显著性子集
phrase描述位置优先级或对比关系的短语
x, y, w, h显著性区域的检测框坐标 (xywh 格式)
数据结构层级
num_images数据集包含的图像总数
pairs每张图像对应的正负样本对列表
pair[0] (pos)正例字典 (Positive Sample)
pair[1] (neg)负例字典 (Negative Sample)

源数据集样例:

import pickle

data = pickle.load(open(attribute_vg.pkl, "rb"))

样例 1: Attribute (属性) - 区分颜色

数据示例 (Attribute VG Pairs, postive and negitive)

    [
        {
            'synsets': ['building.n.01'],
            'h': 298,
            'object_id': 1023846,
            'names': ['building'],
            'w': 282,
            'attributes': ['brown', 'red'],
            'y': 13,
            'x': 165,
            'phrase': 'red building'
        },
        {
            'synsets': ['building.n.01'],
            'h': 384,
            'object_id': 1023819,
            'names': ['building'],
            'w': 251,
            'attributes': ['orange', 'brown', 'tall'],
            'y': 4,
            'x': 547,
            'phrase': 'orange building'
        }
    ]

样例 2: Relation (关系) - 区分空间位置

数据示例 (Relation VG Pairs, postive and negitive)

    [
        {
            'h': 156,
            'w': 799,
            'y': 264,
            'x': 1,
            'phrase': 'grass on top of sand'
        },
        {
            'h': 80,
            'w': 374,
            'y': 468,
            'x': 273,
            'phrase': 'grass in sand'
        },
    ]

样例 3: Priority (优先级) - 区分主语对象

数据示例 (Priority VG Pairs, postive and negitive)

    [
        {
            'h': 200,
            'w': 196,
            'y': 391,
            'x': 294,
            'phrase': 'cpu on floor'
        },
        {
            'h': 73,
            'w': 402,
            'y': 526,
            'x': 237,
            'phrase': 'floor under cpu'
        },
    ]

ARPGrounding 数据集由三个核心部分构成,旨在测试模型在不同维度的组合推理能力:

  1. Attribute (属性): 包含6,632个样本。用于测试模型区分同一类别但具有不同属性(如颜色、材质、状态等)的对象的能力。例如区分“棕色的狗”和“黑色的狗”。
  2. Relation (关系): 包含370个样本。用于测试模型仅根据对象间的关系(通常是空间或动作关系)区分目标的能力。例如区分“桌子上的电脑”和“桌子下的电脑”。
  3. Priority (优先级): 包含4,423个样本。用于测试模型是否能识别文本中的语法主体(Subject),而不被文本中提及的其他名词误导。通常涉及主语和宾语位置互换的复杂句式。

论文引用:

    {ARPGrounding,
    title={Investigating Compositional Challenges in Vision-Language Models for Visual Grounding},
    author={Zeng, Yunan and Huang, Yan and Zhang, Jinjin and Jie, Zequn and Chai, Zhenhua and Wang, Liang},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages={14141--14151},
    year={2024}
    }

源数据集版权使用说明:

[Datasets]ARPGrounding (Based on Visual Genome)

[Licenses] This dataset is derived from Visual Genome. Image licenses follow the original Visual Genome image sources (e.g., Flickr) and may vary per image (e.g., Attribution, NonCommercial, NoDerivs, ShareAlike, etc.). Please refer to the Visual Genome website for licensing information and use the images according to their original licenses. ARPGrounding annotations and processing scripts are provided under the same license as this repository unless otherwise stated.