Evaluation Data
MVBench
Adaptation Method:
Multimodal understanding evaluation, using the qwen2.5-vl model to directly process video and text inputs through prompt engineering to guide the model to generate answers. The text responses output by the model are compared with standard answers to calculate accuracy.
Data Description:
MVBench is a multimodal video understanding benchmark designed specifically for evaluating Large Vision Language Models (LVLMs). The dataset covers 20 video understanding tasks, including action recognition, object interaction, state change, and more, aiming to comprehensively assess the model's multimodal video understanding capabilities. Each task automatically collected 200 question-answer pairs, totaling approximately 4000 data points for efficient evaluation.
Dataset structure:
Amount of source data:
MVBench contains 20 subtasks, with approximately 200 test samples per subtask, totaling about 4000 samples.
Amount of Evaluation data:
The evaluation uses MVBench's complete test set, which includes test samples from all 20 subtasks.
Task Types:
MVBench covers the following 20 video understanding tasks:
- Action Sequence
- Action Prediction
- Action Antonym
- Fine-grained Action
- Unexpected Action
- Object Existence
- Object Interaction
- Object Shuffle
- Moving Direction
- Action Localization
- Scene Transition
- Action Count
- Moving Count
- Moving Attribute
- State Change
- Fine-grained Pose
- Character Order
- Egocentric Navigation
- Episodic Reasoning
- Counterfactual Inference
Data detail:
KEYS | EXPLAIN |
---|---|
video | Path to the video file |
question | Question about the video |
candidates | List of candidate answers |
answer | Standard answer |
Sample of source dataset:
{
"video": "166583.webm",
"question": "What is the action performed by the person in the video?",
"candidates": ["Not sure", "Scattering something down", "Piling something up"],
"answer": "Piling something up"
}
Citation information:
@inproceedings{li2024mvbench, title={Mvbench: A comprehensive multi-modal video understanding benchmark}, author={Li, Kunchang and Wang, Yali and He, Yinan and Li, Yizhuo and Wang, Yi and Liu, Yi and Wang, Zun and Xu, Jilan and Chen, Guo and Luo, Ping and others}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={22195--22206}, year={2024} }
Licensing information:
MVBench dataset follows an open license for research and non-commercial use.