Evaluation Data
MVBench
Adaptation Method:
Multimodal understanding evaluation, using the qwen2.5-vl model to directly process video and text inputs through prompt engineering to guide the model to generate answers. The text responses output by the model are compared with standard answers to calculate accuracy.
Data Description:
MVBench is a multimodal video understanding benchmark designed specifically for evaluating Large Vision Language Models (LVLMs). The dataset covers 20 video understanding tasks, including action recognition, object interaction, state change, and more, aiming to comprehensively assess the model's multimodal video understanding capabilities. Each task automatically collected 200 question-answer pairs, totaling approximately 4000 data points for efficient evaluation.
Dataset structure:
Amount of source data:
MVBench contains 20 subtasks, with approximately 200 test samples per subtask, totaling about 4000 samples.
Amount of Evaluation data:
The evaluation uses MVBench's complete test set, which includes test samples from all 20 subtasks.
Task Types:
MVBench covers the following 20 video understanding tasks:
- Action Sequence
- Action Prediction
- Action Antonym
- Fine-grained Action
- Unexpected Action
- Object Existence
- Object Interaction
- Object Shuffle
- Moving Direction
- Action Localization
- Scene Transition
- Action Count
- Moving Count
- Moving Attribute
- State Change
- Fine-grained Pose
- Character Order
- Egocentric Navigation
- Episodic Reasoning
- Counterfactual Inference
Data detail:
| KEYS | EXPLAIN |
|---|---|
| video | Path to the video file |
| question | Question about the video |
| candidates | List of candidate answers |
| answer | Standard answer |
Sample of source dataset:
{
"video": "166583.webm",
"question": "What is the action performed by the person in the video?",
"candidates": ["Not sure", "Scattering something down", "Piling something up"],
"answer": "Piling something up"
}Citation information:
@inproceedings{li2024mvbench, title={Mvbench: A comprehensive multi-modal video understanding benchmark}, author={Li, Kunchang and Wang, Yali and He, Yinan and Li, Yizhuo and Wang, Yi and Liu, Yi and Wang, Zun and Xu, Jilan and Chen, Guo and Luo, Ping and others}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={22195--22206}, year={2024} }
Licensing information:
MVBench dataset follows an open license for research and non-commercial use.
Animal-Bench
Adaptation Method:
Multimodal understanding evaluation, using VL models to directly process video and text inputs, guiding models to generate answers through prompt engineering. The text responses output by the model are compared with standard answers to calculate accuracy. This benchmark particularly focuses on the generalization ability of models in non-human-centric scenarios, aiming to address the "Agent Bias" problem of existing models in understanding animal behavior and ecological environments.
Data Description:
Animal-Bench is a multimodal video understanding benchmark dataset focused on animal-centric scenarios, designed specifically to evaluate the understanding capabilities of large vision-language models (LVLMs) in natural animal scenes.
Published at NeurIPS 2024, this dataset covers 7 major animal categories, 819 species, and includes 13 video understanding tasks, involving animal behavior, conservation biology characteristics, and complex natural environment interactions. The data is constructed through an automated pipeline and undergoes strict manual validation, aiming to comprehensively evaluate the perception and reasoning capabilities of models in real wild environments.
Dataset Composition and Specifications:
Amount of source data:
Animal-Bench contains 13 subtasks, covering various ecological environments such as land, ocean, and sky, with a total of 41,839 question-answer pairs.
Amount of evaluation data:
The evaluation uses the complete test set of Animal-Bench, including test samples from all 13 tasks.
Task Types:
Animal-Bench covers the following 13 animal video understanding tasks, divided into common tasks and special tasks:
Common Tasks
- Object: Object Existence, Object Recognition
- Action: Action Recognition, Action Sequence, Action Prediction
- Time: Action Localization
- Count: Action Count, Object Count
- Reasoning: Abductive ReasoningSpecial Tasks
- Predator-Prey Behavior Monitoring
- Social Interaction Analysis
- Breeding Behavior Monitoring
- Stress and Pain DetectionSource Dataset Example:
{
"video": "leopard_hunt_001.mp4",
"question": "What behavior is the leopard demonstrating in the video?",
"candidates": ["Sleeping", "Hunting", "Playing", "Grooming"],
"answer": "Hunting"
}Citation:
@inproceedings{jing2024animalbench,
title={Animal-Bench: Benchmarking Multimodal Video Models for Animal-centric Video Understanding},
author={Jing, Yinuo and Zhang, Ruxu and Liang, Kongming and Li, Yongxiang and He, Zhongjiang and Ma, Zhanyu and Guo, Jun},
booktitle={Proceedings of the Advances in Neural Information Processing Systems (NeurIPS)},
year = "2024",
url = "https://github.com/PRIS-CV/Animal-Bench"
}Source Dataset Copyright Usage Instructions:
The Animal-Bench dataset follows an open license for research and non-commercial use.