Skip to content

Evaluation Task and Evaluation Data Introduction

Multimodal Domain Evaluation (Multimodal)

The main focus is on the multi-dimensional performance of models in tasks such as image-text classification, image-text matching, and image-text generation.

Currently, it includes the following evaluation tasks:

  • Visual Question Answering: visual question answering is designed to enable computers to answer natural language questions related to image content, often suing metrics such as Accuracy to measure similarity between the generated answer and the reference answer.
  • Text-to-Image Generation::text-to-image generation is designed to enable computers to "associate" and "create" based on given text, automatically generating images that are semantically consistent and content-wise realistic.
  • Image-Text Matching:committed to measuring the semantic correlation between visual and linguistic content, achieving smeantic between between image and text, including image-to-text retrieval (i2t) and text-to-image retrieval (t2i) two evaluation methods, often using datasets of Flickr30k, MS COCO, etc.