Evaluation Data

MVBench

Adaptation Method:

Multimodal understanding evaluation, using the qwen2.5-vl model to directly process video and text inputs through prompt engineering to guide the model to generate answers. The text responses output by the model are compared with standard answers to calculate accuracy.

Data Description:

MVBench is a multimodal video understanding benchmark designed specifically for evaluating Large Vision Language Models (LVLMs). The dataset covers 20 video understanding tasks, including action recognition, object interaction, state change, and more, aiming to comprehensively assess the model's multimodal video understanding capabilities. Each task automatically collected 200 question-answer pairs, totaling approximately 4000 data points for efficient evaluation.

Dataset structure:

Amount of source data:

MVBench contains 20 subtasks, with approximately 200 test samples per subtask, totaling about 4000 samples.

Amount of Evaluation data:

The evaluation uses MVBench's complete test set, which includes test samples from all 20 subtasks.

Task Types:

MVBench covers the following 20 video understanding tasks:

Action Sequence
Action Prediction
Action Antonym
Fine-grained Action
Unexpected Action
Object Existence
Object Interaction
Object Shuffle
Moving Direction
Action Localization
Scene Transition
Action Count
Moving Count
Moving Attribute
State Change
Fine-grained Pose
Character Order
Egocentric Navigation
Episodic Reasoning
Counterfactual Inference

Data detail:

KEYS	EXPLAIN
video	Path to the video file
question	Question about the video
candidates	List of candidate answers
answer	Standard answer

Sample of source dataset:

{
  "video": "166583.webm",
  "question": "What is the action performed by the person in the video?",
  "candidates": ["Not sure", "Scattering something down", "Piling something up"],
  "answer": "Piling something up"
}

Citation information:

@inproceedings{li2024mvbench,
  title={Mvbench: A comprehensive multi-modal video understanding benchmark},
  author={Li, Kunchang and Wang, Yali and He, Yinan and Li, Yizhuo and Wang, Yi and Liu, Yi and Wang, Zun and Xu, Jilan and Chen, Guo and Luo, Ping and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={22195--22206},
  year={2024}
}

Licensing information:

MVBench dataset follows an open license for research and non-commercial use.

Evaluation Data ​

MVBench ​

Adaptation Method: ​

Data Description: ​

Dataset structure: ​

Amount of source data: ​

Amount of Evaluation data: ​

Task Types: ​

Data detail: ​

Sample of source dataset: ​

Citation information: ​

Licensing information: ​