Skip to content

Evaluation Data

MVBench

Accuracy

Adaptation Method:

Multimodal understanding evaluation, using the qwen2.5-vl model to directly process video and text inputs through prompt engineering to guide the model to generate answers. The text responses output by the model are compared with standard answers to calculate accuracy.

Data Description:

MVBench is a multimodal video understanding benchmark designed specifically for evaluating Large Vision Language Models (LVLMs). The dataset covers 20 video understanding tasks, including action recognition, object interaction, state change, and more, aiming to comprehensively assess the model's multimodal video understanding capabilities. Each task automatically collected 200 question-answer pairs, totaling approximately 4000 data points for efficient evaluation.

Dataset structure:

Amount of source data:

MVBench contains 20 subtasks, with approximately 200 test samples per subtask, totaling about 4000 samples.

Amount of Evaluation data:

The evaluation uses MVBench's complete test set, which includes test samples from all 20 subtasks.

Task Types:

MVBench covers the following 20 video understanding tasks:

  1. Action Sequence
  2. Action Prediction
  3. Action Antonym
  4. Fine-grained Action
  5. Unexpected Action
  6. Object Existence
  7. Object Interaction
  8. Object Shuffle
  9. Moving Direction
  10. Action Localization
  11. Scene Transition
  12. Action Count
  13. Moving Count
  14. Moving Attribute
  15. State Change
  16. Fine-grained Pose
  17. Character Order
  18. Egocentric Navigation
  19. Episodic Reasoning
  20. Counterfactual Inference

Data detail:

KEYSEXPLAIN
videoPath to the video file
questionQuestion about the video
candidatesList of candidate answers
answerStandard answer

Sample of source dataset:

{
  "video": "166583.webm",
  "question": "What is the action performed by the person in the video?",
  "candidates": ["Not sure", "Scattering something down", "Piling something up"],
  "answer": "Piling something up"
}

Citation information:

@inproceedings{li2024mvbench,
  title={Mvbench: A comprehensive multi-modal video understanding benchmark},
  author={Li, Kunchang and Wang, Yali and He, Yinan and Li, Yizhuo and Wang, Yi and Liu, Yi and Wang, Zun and Xu, Jilan and Chen, Guo and Luo, Ping and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={22195--22206},
  year={2024}
}

Licensing information:

MVBench dataset follows an open license for research and non-commercial use.

Animal-Bench

Evaluation Metric - Accuracy

Adaptation Method:

Multimodal understanding evaluation, using VL models to directly process video and text inputs, guiding models to generate answers through prompt engineering. The text responses output by the model are compared with standard answers to calculate accuracy. This benchmark particularly focuses on the generalization ability of models in non-human-centric scenarios, aiming to address the "Agent Bias" problem of existing models in understanding animal behavior and ecological environments.

Data Description:

Animal-Bench is a multimodal video understanding benchmark dataset focused on animal-centric scenarios, designed specifically to evaluate the understanding capabilities of large vision-language models (LVLMs) in natural animal scenes.

Published at NeurIPS 2024, this dataset covers 7 major animal categories, 819 species, and includes 13 video understanding tasks, involving animal behavior, conservation biology characteristics, and complex natural environment interactions. The data is constructed through an automated pipeline and undergoes strict manual validation, aiming to comprehensively evaluate the perception and reasoning capabilities of models in real wild environments.

Dataset Composition and Specifications:

Amount of source data:

Animal-Bench contains 13 subtasks, covering various ecological environments such as land, ocean, and sky, with a total of 41,839 question-answer pairs.

Amount of evaluation data:

The evaluation uses the complete test set of Animal-Bench, including test samples from all 13 tasks.

Task Types:

Animal-Bench covers the following 13 animal video understanding tasks, divided into common tasks and special tasks:

Common Tasks

text
- Object: Object Existence, Object Recognition
- Action: Action Recognition, Action Sequence, Action Prediction
- Time: Action Localization
- Count: Action Count, Object Count
- Reasoning: Abductive Reasoning

Special Tasks

text
- Predator-Prey Behavior Monitoring
- Social Interaction Analysis
- Breeding Behavior Monitoring
- Stress and Pain Detection

Source Dataset Example:

json
{
    "video": "leopard_hunt_001.mp4",
    "question": "What behavior is the leopard demonstrating in the video?",
    "candidates": ["Sleeping", "Hunting", "Playing", "Grooming"],
    "answer": "Hunting"
}

Citation:

bibtex
@inproceedings{jing2024animalbench,
  title={Animal-Bench: Benchmarking Multimodal Video Models for Animal-centric Video Understanding},
  author={Jing, Yinuo and Zhang, Ruxu and Liang, Kongming and Li, Yongxiang and He, Zhongjiang and Ma, Zhanyu and Guo, Jun},
  booktitle={Proceedings of the Advances in Neural Information Processing Systems (NeurIPS)},
  year = "2024",
  url = "https://github.com/PRIS-CV/Animal-Bench"
}

The Animal-Bench dataset follows an open license for research and non-commercial use.