Skip to content

Evaluation Data

MVBench

Accuracy

Adaptation Method:

Multimodal understanding evaluation, using the qwen2.5-vl model to directly process video and text inputs through prompt engineering to guide the model to generate answers. The text responses output by the model are compared with standard answers to calculate accuracy.

Data Description:

MVBench is a multimodal video understanding benchmark designed specifically for evaluating Large Vision Language Models (LVLMs). The dataset covers 20 video understanding tasks, including action recognition, object interaction, state change, and more, aiming to comprehensively assess the model's multimodal video understanding capabilities. Each task automatically collected 200 question-answer pairs, totaling approximately 4000 data points for efficient evaluation.

Dataset structure:

Amount of source data:

MVBench contains 20 subtasks, with approximately 200 test samples per subtask, totaling about 4000 samples.

Amount of Evaluation data:

The evaluation uses MVBench's complete test set, which includes test samples from all 20 subtasks.

Task Types:

MVBench covers the following 20 video understanding tasks:

  1. Action Sequence
  2. Action Prediction
  3. Action Antonym
  4. Fine-grained Action
  5. Unexpected Action
  6. Object Existence
  7. Object Interaction
  8. Object Shuffle
  9. Moving Direction
  10. Action Localization
  11. Scene Transition
  12. Action Count
  13. Moving Count
  14. Moving Attribute
  15. State Change
  16. Fine-grained Pose
  17. Character Order
  18. Egocentric Navigation
  19. Episodic Reasoning
  20. Counterfactual Inference

Data detail:

KEYSEXPLAIN
videoPath to the video file
questionQuestion about the video
candidatesList of candidate answers
answerStandard answer

Sample of source dataset:

{
  "video": "166583.webm",
  "question": "What is the action performed by the person in the video?",
  "candidates": ["Not sure", "Scattering something down", "Piling something up"],
  "answer": "Piling something up"
}

Citation information:

@inproceedings{li2024mvbench,
  title={Mvbench: A comprehensive multi-modal video understanding benchmark},
  author={Li, Kunchang and Wang, Yali and He, Yinan and Li, Yizhuo and Wang, Yi and Liu, Yi and Wang, Zun and Xu, Jilan and Chen, Guo and Luo, Ping and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={22195--22206},
  year={2024}
}

Licensing information:

MVBench dataset follows an open license for research and non-commercial use.