Skip to content

Notes

The GT annotations for counting-type questions in the TDIUC dataset are all English words, not Arabic numerals. When performing evaluation, it is important to take note of this and consider using num2words for conversion to the corresponding English words. In contrast, this type of answers in VQA2.0 and VQA-CP datasets are basically Arabic numerals, with words like "three" primarily occurring within phrases. Therefore, when evaluating different datasets, one should be mindful of the format of answers for this type of question.

Evaluation Dataset

VQA2.0

Accuracy

Data description:

VQA is a new dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. The dataset contains 265,016 images from COCO and abstract scenes, with at least 3 questions (5.4 questions on average) per image, 10 ground truth answers per question and 3 plausible (but likely incorrect) answers per question. We choose the balanced real image part.

Dataset structure:

Amount of source data:

The dataset is split into auxiliary train(82783), validation(40504), test(81434), We evaluate the validation set which has 214354 question-answer pairs.

Data detail:

Questions:
KEYSEXPLAIN
infoinformation
task_typetype of annotations in the JSON file
data_typesource of the images
data_subtypetype of data subtype
questionsa list of questions
licensename and url of the license
Annotations:
KEYSEXPLAIN
infoinformation
data_typesource of the images
data_subtypetype of data subtype
annotationsa list of answers
licensename and url of the license

Sample of source dataset:

Questions:
{
"info" : info,
"task_type" : str,
"data_type": str,
"data_subtype": str,
"questions" : [question],
"license" : license
}

info {
"year" : int,
"version" : str,
"description" : str,
"contributor" : str,
"url" : str,
"date_created" : datetime
}

license{
"name" : str,
"url" : str
}

question{
"question_id" : int,
"image_id" : int,
"question" : str
}
Annotations:
{
"info" : info,
"data_type": str,
"data_subtype": str,
"annotations" : [annotation],
"license" : license
}

info {
"year" : int,
"version" : str,
"description" : str,
"contributor" : str,
"url" : str,
"date_created" : datetime
}

license{
"name" : str,
"url" : str
}

annotation{
"question_id" : int,
"image_id" : int,
"question_type" : str,
"answer_type" : str,
"answers" : [answer],
"multiple_choice_answer" : str
}

answer{
"answer_id" : int,
"answer" : str,
"answer_confidence": str
}

Preset method

BLIP

Introduction

BLIP, proposed by Salesforce, is a Multimodal Pre-training model based on Transformer. By introducing Multimodal mixture of Encoder-Decoder (MED), it can operate either as a unimodal encoder (including image encoder and text encoder), or an image-grounded text encoder, or an image-grounded text decoder, thereby unifying multimodal understanding and generation tasks. Additionally, it reduces the noise in supervised captions by incoporating Captioner-Filter mechanism. BLIP achieves good performance and wide application on many vision-language tasks, including image-text retrieval, image captioning and visual question answering.

Citation

@inproceedings{li2022blip,
  title={Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation},
  author={Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
  booktitle={International conference on machine learning},
  pages={12888--12900},
  year={2022},
  organization={PMLR}
}

Licensing information:

http://creativecommons.org/licenses/by/4.0/

Citation information:

{{VQA},
author = {Stanislaw Antol and Aishwarya Agrawal and Jiasen Lu and Margaret Mitchell and Dhruv Batra and C. Lawrence Zitnick and Devi Parikh},
title = {{VQA}: {V}isual {Q}uestion {A}nswering},
booktitle = {International Conference on Computer Vision (ICCV)},
year = {2015},
}

TDIUC

MPT

Data description:

TDIUC is a new dataset that divides VQA into 12 constituent tasks that makes it easier to measure and compare the performance of VQA algorithms. VQA encompasses so many other computer vision problems, e.g., object detection, object classification, attribute classification, positional reasoning, counting, etc. However, prior datasets are heavily unbalanced toward certain kinds of questions. For example, in many datasets object presence questions are far more common than questions requiring positional reasoning, meaning that an algorithm that excels at positional reasoning is not able to showcase its abilities on these datasets. TDIUC's performance metrics compensate for this bias. Another issue with other datasets is that many questions can be answered from just the question, so the algorithm ignores the image. TDIUC introduces absurd questions, that demand an algorithm look at the image to determine if the question is appropriate for the image.

Dataset structure:

Amount of source data:

The dataset contains 167,437 images (from MS-COCO and Visual Genome), 1,654,167 question-answer pairs, with the training set consisting of 1,115,299 question-answer pairs, and the validation set consisting of 538,868 question-answer pairs.

Data detail:

Questions:
KEYSEXPLAIN
infoinformation
task_typetype of annotations in the JSON file
data_typesource of the images
data_subtypetype of data subtype
questionsa list of questions
licencename and url of the licence
Annotations:
KEYSEXPLAIN
infoinformation
task_typetype of annotations in the JSON file
data_typesource of the images
data_subtypetype of data subtype
annotationsa list of answers
licencename and url of the licence

Sample of dataset:

Questions:
{
"info" : str,
"task_type" : str,
"data_type": str,
"data_subtype": str,
"questions" : [question],
"licence" : str
}

question{
"question_id" : int,
"image_id" : int,
"question" : str
}
Annotations:
{
"info" : str,
"task_type" : str,
"data_type": str,
"data_subtype": str,
"annotations" : [annotation],
"licence" : licence
}

licence{
"name" : str,
"url" : str
}

annotation{
"question_id" : int,
"image_id" : int,
"question_type" : str,
"ans_source" : str,
"answers" : [answer]
}

answer{
"answer_id" : int,
"answer" : str,
"answer_confidence": str
}

Preset method

BLIP

Introduction

BLIP, proposed by Salesforce, is a Multimodal Pre-training model based on Transformer. By introducing Multimodal mixture of Encoder-Decoder (MED), it can operate either as a unimodal encoder (including image encoder and text encoder), or an image-grounded text encoder, or an image-grounded text decoder, thereby unifying multimodal understanding and generation tasks. Additionally, it reduces the noise in supervised captions by incoporating Captioner-Filter mechanism. BLIP achieves good performance and wide application on many vision-language tasks, including image-text retrieval, image captioning and visual question answering.

Citation

@inproceedings{li2022blip,
  title={Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation},
  author={Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
  booktitle={International conference on machine learning},
  pages={12888--12900},
  year={2022},
  organization={PMLR}
}

Citation information:

{kafle2017analysis,
  title={An Analysis of Visual Question Answering Algorithms},
  author={Kafle, Kushal and Kanan, Christopher},
  booktitle={ICCV},
  year={2017}
}

VQA-CP

Accuracy

Data description:

The Visual Question Answering under Changing Priors (VQA-CP) v1 and v2 datasets are created by re-organizing the train and val splits of the VQA v1 and VQA v2 datasets respectively, such that the distribution of answers per question type (such as, "how many", "what color is", etc.) is by design different in the test split compared to the train split. We choose VQA-CP v2 for evaluation.

Dataset structure:

Amount of source data:

The training set consists of 438,183 question-answer pairs, and the test set consists of 219,928 question-answer pairs, with each question having 10 ground truth answers.

Data detail:

Questions:
KEYSEXPLAIN
image_idid of the image
coco_splittrain2014 / val2014
question_idid of the question
questionone question
Annotations:
KEYSEXPLAIN
image_idid of the image
coco_splittrain2014 / val2014
question_idid of the question
question_typetype of the question
answer_typetype of the answer
multiple_choice_answercorrect multiple choice answer
answersone answer

Sample of source dataset:

Questions:
[{
"question_id" : int,
"image_id" : int,
"coco_split": str,
"question": str
},
...
]
Annotations:
[{
"question_id" : int,
"image_id" : int,
"coco_split" : str,
"question_type" : str,
"answer_type" : str,
"answers" : [answer],
"multiple_choice_answer" : str
},
...
]

answer{
"answer_id" : int,
"answer" : str,
"answer_confidence": str
}

Preset method

BLIP

Introduction

BLIP, proposed by Salesforce, is a Multimodal Pre-training model based on Transformer. By introducing Multimodal mixture of Encoder-Decoder (MED), it can operate either as a unimodal encoder (including image encoder and text encoder), or an image-grounded text encoder, or an image-grounded text decoder, thereby unifying multimodal understanding and generation tasks. Additionally, it reduces the noise in supervised captions by incoporating Captioner-Filter mechanism. BLIP achieves good performance and wide application on many vision-language tasks, including image-text retrieval, image captioning and visual question answering.

Citation

@inproceedings{li2022blip,
  title={Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation},
  author={Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
  booktitle={International conference on machine learning},
  pages={12888--12900},
  year={2022},
  organization={PMLR}
}

Citation information:

{vqa-cp,
author = {Aishwarya Agrawal and Dhruv Batra and Devi Parikh and Aniruddha Kembhavi},
title = {Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2018},
}

ChartQA

Data description:

Charts are very popular in data analysis. When exploring diagrams, people often ask a variety of complex reasoning problems that involve multiple logical and arithmetic operations. They also often refer to the visual characteristics of charts in questions. However, most existing datasets do not focus on such complex reasoning questions because their questions are template-based and their answers come from a fixed vocabulary list. ChartQA is a large-scale benchmark that covers 9.6K manually written questions and 23.1K questions generated from human-written chart summaries.

Dataset structure:

We used the original test dataset, which includes 1250 questions.

Citation information:

@inproceedings{masry-etal-2022-chartqa,
               title = “{C}hart{QA}: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning”,
               author = “Masry, Ahmed and
               Long, Do and
               Tan, Jia Qing and
               Joty, Shafiq and
               Hoque, Enamul”,
               booktitle = “Findings of the Association for Computational Linguistics: ACL 2022”,
               month = may,
               year = “2022”,
               address = “Dublin, Ireland”,
               publisher = “Association for Computational Linguistics”,
               url = “https://aclanthology.org/2022.findings-acl.177”,
               doi = “10.18653/v1/2022.findings-acl.177”,
               pages = “2263–2279”,
}

Licensing information:

GPL-3.0 license

CMMMU

Data description:

Similar to its companion MMMU, CMMMU includes 12,000 manually collected, multimodal questions from university exams, quizzes, and textbooks across 6 core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Sciences, and Technology & Engineering. These questions cover 30 disciplines and include 39 highly heterogeneous image types such as charts, maps, tables, musical notation, and chemical structures.

Dataset structure:

We used the original validation dataset, which includes 900 questions of multiple choice questions, true/false questions and fill-in-the-blank questions.

Citation information:

@article{zhang2024cmmmu,
         title={CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark},
         author={Ge, Zhang and Xinrun, Du and Bei, Chen and Yiming, Liang and Tongxu, Luo and Tianyu, Zheng and Kang, Zhu and Yuyang, Cheng and Chunpu, Xu and Shuyue, Guo and Haoran, Zhang and Xingwei, Qu and Junjie, Wang and Ruibin, Yuan and Yizhi, Li and Zekun, Wang and Yudong, Liu and Yu-Hsuan, Tsai and Fengji, Zhang and Chenghua, Lin and Wenhao, Huang and Wenhu, Chen and Jie, Fu},
         journal={arXiv preprint arXiv:2401.20847},
         year={2024},
}

Licensing information:

apache-2.0

CMMU

Data description:

Currently there is a lack of comprehensive and neutral evaluation benchmarks in the field of Chinese multimodal models. Therefore, in order to promote the further development of this field, BAAI has proposed CMMU - Chinese multimodal multi-question type comprehension and reasoning evaluation benchmark. The current CMMU v0.1 version has extracted and produced 3603 questions from the national primary, middle and high school examination questions of the Chinese education system, including multiple-choice questions and fill-in-the-blank questions, and uses multiple evaluation methods to avoid the model from "randomly guessing the answer correct".

Dataset structure:

CMMU v0.1 version includes 3603 questions, and 2585 of the questions have an answer associated. Dividing the validation dataset and the test dataset 1:1 (the validation dataset contains 1800 questions and the test dataset contains 1803 questions), the validation dataset will be fully public, making it easy for researchers to test models.

  • According to the school period, there are 250 primary school level questions, 1697 and 1656 middle and high school level questions, of which only the mathematics subject is included for primary school level questions, and 7 subjects are included for middle and high school level questions.
  • The distribution ratio of questions divided into "normal" and "difficult" is roughly 8:2, and the difficulty is divided into 2 categories according to the difficulty of the questions by experienced teachers.

Licensing information:

apache-2.0

HallusionBench

Data description:

HallusionBench is an advanced diagnostic suite designed for evaluating image-contextual reasoning. This dataset presents significant challenges to advanced large vision-language models (LVLMs), such as GPT-4V (ision), Gemini Pro Vision, Claude 3, and LLaVA1.5, emphasizing nuanced understanding and interpretation of visual data. HallusionBench consists of 346 images and 1129 questions, all of which have been crafted by human experts. The goal of HallusionBench is to fill the gaps in the hallucinatory assessment of existing benchmarks by offering more subjects, more image types, and more visual input modalities, (including images and videos). In addition, HallusionBench focuses on evaluating language and visual hallucinations beyond the narrow confines of object hallucinations.

Dataset structure:

We used the original test dataset, which includes 346 images and 951 questions related to the images.

Citation information:

@misc{guan2023hallusionbench,
      title={HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models},
      author={Tianrui Guan and Fuxiao Liu and Xiyang Wu and Ruiqi Xian and Zongxia Li and Xiaoyu Liu and Xijun Wang and Lichang Chen and Furong Huang and Yaser Yacoob and Dinesh Manocha and Tianyi Zhou},
      year={2023},
      eprint={2310.14566},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Licensing information:

BSD 3-Clause License

MathVista

Data description:

MATHVISTA dataset is a benchmark designed to evaluate the mathematical reasoning ability of large language models (LLMs) and large multimodal models (LMMs) in a visual context. The dataset consists of 6,141 examples derived from 28 existing multimodal datasets involving mathematical problems, and 3 newly created datasets (IQTest, FunctionQA, and PaperQA). Accomplishing these tasks requires deep visual comprehension and combinatorial reasoning skills that are difficult for even the most advanced foundation models.

Dataset structure:

We used the original testmini dataset, which includes 1000 questions of multiple choice questions and Q&A questions.

Citation information:

@inproceedings{lu2024mathvista,
               author = {Lu, Pan and Bansal, Hritik and Xia, Tony and Liu, Jiacheng and Li, Chunyuan and Hajishirzi, Hannaneh and Cheng, Hao and Chang, Kai-Wei and Galley, Michel and Gao, Jianfeng},
               title = {MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts},
               booktitle = {International Conference on Learning Representations (ICLR)},
               year = {2024}
}

Licensing information:

cc-by-sa-4.0

MMBench

Data description:

MMBench is a novel multimodal benchmark designed to comprehensively evaluate the performance of large visual-linguistic models (VLMs). The benchmark consists of approximately 3,000 multiple choice questions, covering 20 different dimensions of competency, which are designed to systematically assess the capabilities of VLMs in different domains such as object localization, social reasoning, and more. Each competency dimension contains more than 75 questions to ensure a balanced and comprehensive assessment of the various competencies.

Dataset structure:

We used the original dev dataset, which includes 1164 multiple choice questions with Chinese and English versions.

Citation information:

@article{MMBench,
         author = {Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin},
         journal = {arXiv:2307.06281},
         title = {MMBench: Is Your Multi-modal Model an All-around Player?},
         year = {2023},
}

Licensing information:

Apache-2.0 license

MMMU

Data description:

Massive Multi-discipline Multimodal Understanding, is a new benchmark which aims at evaluating the performance multimodal models on large-scale multidisciplinary tasks that require university-level subject knowledge and thoughtful reasoning. MMMU includes a curated collection of 11.5K multimodal questions from university exams, quizzes, and textbooks across 6 core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Sciences, and Technology & Engineering. These questions cover 30 disciplines and 183 sub-domains, including 30 highly heterogeneous image types such as charts, maps, tables, musical notation, and chemical structures.

Dataset structure:

We used the original validation dataset, which includes 900 questions of multiple choice questions and fill-in-the-blank questions.

Citation information:

@article{yue2023mmmu,
         title={Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi},
         author={Yue, Xiang and Ni, Yuansheng and Zhang, Kai and Zheng, Tianyu and Liu, Ruoqi and Zhang, Ge and Stevens, Samuel and Jiang, Dongfu and Ren, Weiming and Sun, Yuxuan and others},
         journal={arXiv preprint arXiv:2311.16502},
         year={2023}
}

Licensing information:

apache-2.0

ScienceQA

Data description:

ScienceQA is a large-scale, multimodal scientific question-answering dataset developed by UCLA, Arizona State University, and Allen Institute for AI. The dataset aims to improve the multi-hop reasoning ability and interpretability of AI systems in the answer to scientific questions. It contains about 21,208 questions that are derived from primary to high school science courses and cover a variety of subject areas such as natural sciences, social sciences, and language sciences.

Each question is accompanied by several types of contextual information, such as text, images (including natural images and diagrams), options, and correct answers. What is unique is that ScienceQA not only provides the correct answer to the question, but also provides detailed lectures and explanations, which are designed to reveal the Chain of Thought (CoT) in the problem-solving process, similar to the thought process of human problem-solving. These detailed annotations help train and evaluate how AI models understand and interpret complex scientific questions.

Dataset structure:

We used the original test dataset, which includes 2017 questions.

Citation information:

@inproceedings{lu2022learn,
               title={Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering},
               author={Lu, Pan and Mishra, Swaroop and Xia, Tony and Qiu, Liang and Chang, Kai-Wei and Zhu, Song-Chun and Tafjord, Oyvind and Clark, Peter and Ashwin Kalyan},
               booktitle={The 36th Conference on Neural Information Processing Systems (NeurIPS)},
               year={2022}
}

Licensing information:

MIT license

Seed-bench

Data description:

Seed-bench is a large-scale benchmark evaluation, used to evaluate multimodal large language model (MLLMs). It includes 19,000 multiple choice questions with accurate human annotation, covering 12 evaluation dimensions, including understanding of image and video modalities.

Seed-bench is collected in July 2023.

Dataset structure:

We selected 14,232 multiple choice questions as the test dataset for the platform from the original dataset.

Citation information:

@article{li2023seed,
         title={Seed-bench: Benchmarking multimodal llms with generative comprehension},
         author={Li, Bohao and Wang, Rui and Wang, Guangzhi and Ge, Yuying and Ge, Yixiao and Shan, Ying},
         journal={arXiv preprint arXiv:2307.16125},
         year={2023}
}

Licensing information:

cc-by-4.0

TextVQA

Data description:

TextVQA is a dataset focused on Visual Question Answering (VQA), which aims to promote the ability of VQA models to understand and process text information in images. The TextVQA dataset contains 45,336 questions that are based on 28,408 images and require inferences from the text in the images to answer. These questions are asked by human annotators, who are required to ask questions that need to be answered by reading text in an image. Each question has 10 answers provided by human annotators, and the diversity and complexity of these questions and answers suggests that the VQA model needs to have the ability to read and understand text.

Dataset structure:

We used the original val dataset, which includes 5000 questions.

Citation information:

@article{Singh2019TextVQA,
         title={Towards VQA Models That Can Read},
         author={Singh, Amanpreet and Natarajan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Batra, Dhruv and Parikh, Devi and Rohrbach, Marcus},
         journal={arXiv preprint arXiv:1904.08920},
         year={2019}
}

Licensing information:

CC BY 4.0

Charxiv

Data description:

CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from scientific papers. CharXiv includes two types of questions: (1) descriptive questions about examining basic chart elements and (2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts.

Dataset structure:

Amount of source data:

The dataset is divided into validation set (1k) and test set (1.32k)

Licensing information:

CC BY-SA 4.0

Citation information:

@article{wang2024charxiv,
  title={CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs},
  author={Wang, Zirui and Xia, Mengzhou and He, Luxi and Chen, Howard and Liu, Yitao and Zhu, Richard and Liang, Kaiqu and Wu, Xindi and Liu, Haotian and Malladi, Sadhika and Chevalier, Alexis and Arora, Sanjeev and Chen, Danqi},
  journal={arXiv preprint arXiv:2406.18521},
  year={2024}
}

CV_Bench

Data description:

CV-Bench (Cambrian Vision-Centric Benchmark) is a comprehensive vision evaluation benchmark dataset containing 2,638 human-validated samples. This dataset evaluates the performance of multi-modal models on classic vision tasks by repurposing standard vision benchmark datasets such as ADE20k, COCO, and OMNI3D. The dataset focuses on two main aspects: 2D understanding (via spatial relationships and object counting) and 3D understanding (via depth order and relative distance). Each sample contains fields such as image, question, multiple options, and correct answer. The benchmark is unique in that it transforms traditional vision tasks into natural language problems, thereby testing the model's basic visual understanding capabilities in a multi-modal environment.

Dataset structure:

Amount of source data:

The dataset contains 2,638 human-validated samples.

Citation information:

@misc{tong2024cambrian1,
      title={Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs}, 
      author={Shengbang Tong and Ellis Brown and Penghao Wu and Sanghyun Woo and Manoj Middepogu and Sai Charitha Akula and Jihan Yang and Shusheng Yang and Adithya Iyer and Xichen Pan and Austin Wang and Rob Fergus and Yann LeCun and Saining Xie},
      year={2024},
      eprint={2406.16860},
}

Math_vers

Data description:

MathVerse is a benchmark dataset specifically designed to evaluate the ability of multimodal large language models (MLLMs) in solving mathematical vision problems. The dataset contains 2,612 high-quality mathematical vision problems covering three main areas: plane geometry, solid geometry, and functions, and is subdivided into 12 detailed categories. Each question was transformed into 6 different versions providing varying degrees of multimodal information content, resulting in a total of 15,000 test samples. The dataset is unique in that it provides a comprehensive assessment of whether the model truly understands the mathematical diagrams for reasoning.

Dataset structure:

Amount of source data:

The dataset contains 2,612 high-quality mathematical vision problems, resulting in a total of 15,000 test samples.

Citation information:

@inproceedings{zhang2024mathverse,
  title={MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?},
  author={Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, Hongsheng Li},
  booktitle={arXiv},
  year={2024}
}

MMMU-PRO

Data description:

MMMU-Pro (A More Robust Multi-discipline Multimodal Understanding Benchmark) is an enhanced version of the MMMU benchmark designed to more rigorously evaluate the multi-modal understanding capabilities of advanced AI models. This dataset contains two subsets: standard subset and visual subset. The standard subset increases the number of candidate answers from 4 to 10, while the visual subset requires the model to integrate visual and textual information directly from screenshots or photos to answer the question. The dataset covers multiple subject areas, including art, science, medicine, etc.

Citation information:

@article{yue2024mmmu,
  title={MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark},
  author={Xiang Yue and Tianyu Zheng and Yuansheng Ni and Yubo Wang and Kai Zhang and Shengbang Tong and Yuxuan Sun and Botao Yu and Ge Zhang and Huan Sun and Yu Su and Wenhu Chen and Graham Neubig},
  journal={arXiv preprint arXiv:2409.02813},
  year={2024}
}

MM-Vet_v2

Data description:

MM-Vet v2 is an enhanced version of the MM-Vet benchmark specifically designed to evaluate the comprehensive capabilities of large multimodal models. Based on the original six core visual-language ability assessments (recognition, knowledge, spatial perception, language generation, OCR and mathematics), this data set adds a new assessment of "image and text sequence understanding" to better simulate Interleaved image and text sequences in real scenes. This benchmark provides a more comprehensive and rigorous standard for evaluating the practical application capabilities of multimodal models.

Ocrbench

Data description:

OCRBench is a comprehensive evaluation benchmark dataset designed to evaluate the capabilities of large multi-modal models in text-related vision tasks. The dataset covers five main tasks: text recognition, scene text visual question answering (VQA), document Guided VQA, Key Information Extraction (KIE) and Handwritten Mathematical Expression Recognition (HMER). The dataset contains 1000 manually verified and corrected question and answer pairs, involving 29 sub-datasets.

Dataset structure:

Amount of source data:

The dataset contains 1000 manually verified and corrected question-answer pairs across 29 sub-datasets.

Citation information:

@misc{liu2024ocrbenchhiddenmysteryocr,
      title={OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models}, 
      author={Yuliang Liu and Zhang Li and Mingxin Huang and Biao Yang and Wenwen Yu and Chunyuan Li and Xucheng Yin and Cheng-lin Liu and Lianwen Jin and Xiang Bai},
      year={2024},
      eprint={2305.07895},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2305.07895}, 
}

CII-Bench

Data description:

CII-Bench (Chinese Image Implication Understanding Benchmark) is the first benchmark data set specifically to evaluate the ability of multi-modal large language models to understand the deep meaning of Chinese images. The data set contains 698 images, covering six fields: life, art, society, politics, environment and traditional Chinese culture, with a total of 800 multiple-choice questions designed. The uniqueness of the data set is that all images are sourced from the Chinese Internet and have been manually reviewed. In particular, famous paintings and other content that can deeply reflect Chinese traditional culture are added.

Dataset structure:

Amount of source data:

The data set contains 698 Chinese images, covering six major fields: life, art, society, politics, environment and Chinese traditional culture, with a total of 800 multiple-choice questions designed.

Citation information:

@misc{zhang2024mllmsunderstanddeepimplication,
      title={Can MLLMs Understand the Deep Implication Behind Chinese Images?}, 
      author={Chenhao Zhang and Xi Feng and Yuelin Bai and Xinrun Du and Jinchang Hou and Kaixin Deng and Guangzeng Han and Qinrui Li and Bingli Wang and Jiaheng Liu and Xingwei Qu and Yifei Zhang and Qixuan Zhao and Yiming Liang and Ziqiang Liu and Feiteng Fang and Min Yang and Wenhao Huang and Chenghua Lin and Ge Zhang and Shiwen Ni},
      year={2024},
      eprint={2410.13854},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.13854}, 
}

Math_Vision

Data description:

MATH-V (Math-Vision) is a benchmark dataset specifically designed to evaluate the mathematical reasoning capabilities of large multi-modal models. This data set contains 3,040 high-quality mathematics problems, which are derived from real mathematics competitions, covering 16 different mathematics disciplines (including algebra, analytic geometry, arithmetic, combinatorial geometry, etc.) and are divided into 5 difficulty levels Classification. The dataset is unique in its comprehensiveness and realism, enabling in-depth assessment of a model's capabilities in visual mathematical reasoning.

Dataset structure:

Amount of source data:

This data set contains 3,040 high-quality mathematics problems, which are derived from real mathematics competitions, covering 16 different mathematics disciplines (including algebra, analytic geometry, arithmetic, combinatorial geometry, etc.) and are divided into 5 difficulty levels Classification.

Citation information:

@misc{wang2024measuring,
      title={Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset}, 
      author={Ke Wang and Junting Pan and Weikang Shi and Zimu Lu and Mingjie Zhan and Hongsheng Li},
      year={2024},
      eprint={2402.14804},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}