Evaluation Task and Evaluation Data Introduction

Multimodal Domain Evaluation (Multimodal)

The main focus is on the multi-dimensional performance of models in tasks such as image-text classification, image-text matching, and image-text generation.

Currently, it includes the following evaluation tasks:

Visual Question Answering: visual question answering is designed to enable computers to answer natural language questions related to image content, often suing metrics such as Accuracy to measure similarity between the generated answer and the reference answer.
Text-to-Image Generation: text-to-image generation is designed to enable computers to "associate" and "create" based on given text, automatically generating images that are semantically consistent and content-wise realistic.
Image-Text Matching: committed to measuring the semantic correlation between visual and linguistic content, achieving smeantic between between image and text, including image-to-text retrieval (i2t) and text-to-image retrieval (t2i) two evaluation methods, often using datasets of Flickr30k, MS COCO, etc.
Visual Grounding: also known as Referring Expression, which is finer-grained than Image-Text Matching task, whose input includes an image and a sentence/phrase about referent, and the output is its bounding box in the image. RefCOCO、RefCOCO+、RefCOCOg datasets are used for evaluation.
Video Retrieval: Aims to retrieve semantically relevant video clips from large-scale datasets based on text descriptions. Evaluation typically uses Top-k Recall (R@k), which measures the proportion of relevant videos retrieved within the top-k results, commonly used to assess retrieval coverage and effectiveness.
Video Question Answering: Focuses on fine-grained multimodal understanding by large vision-language models (LVLMs). Given a video and a question, the model generates an answer. The MVBench dataset covers 20 tasks including action prediction, object interaction, and causal reasoning. Accuracy is used as the evaluation metric by comparing model outputs with ground-truth answers.
Color Understanding: Aims to comprehensively evaluate the capabilities of Vision-Language Models (VLMs) in color understanding, including color perception, reasoning, and robustness. Including the evaluation dataset of ColorBench.
Text-to-Video: Aims to perform “association” and “creation” based on a user‑provided textual description, automatically producing videos that are semantically consistent, visually plausible, temporally coherent, and logically sound. Including the evaluation datasets of MSR-VTT and UCF-101.
Text-based Video Retrieval: Text-to-video retrieval aims to automatically locate the video segments most semantically relevant to a given natural language description from a large-scale video database. Including the evaluation dataset of MSR-VTT.

Evaluation Task and Evaluation Data Introduction ​

Multimodal Domain Evaluation (Multimodal) ​

Evaluation Task and Evaluation Data Introduction

Multimodal Domain Evaluation (Multimodal)