Evaluation Metrics

1. Recall

The metric of 'recall' refers to the following meanings: when in the query-based information retrieval tasks (such as the image-text matching task), for all the evaluation instances (such as 1000 images and 5000 sentences in the Flickr30k dataset) the model ranks the correct retrieval result(s) (such as one of the 5 matched texts in the direction of image-to-text/i2t test) within the specified ranking range (such as entering the top 10) in terms of the average fraction of times (such as 0.900 score, ranging from 0 to 1; sometimes represented in percentage %, such as 90.0 score; a larger score means the model has a stronger retrieval performance).

The concept of recall may vary in different evaluation situations, so we have listed the main recall measurement indicators considered in the evaluation work, their application scenarios, and relevant formal definitions.

What needs to be added are:

The specific evaluation methods are slightly different when the metric of 'Recall' is used to evaluate two types of vision language benchmark datasets with a image text ratio of 1:5, e.g., Flickr30k and MS COCO.

When evaluating the Flickr30k dataset and the number of images N=1000, the test is called the 1K Test.

When evaluating the MS COCO dataset and the number of images N=5000, the test is called 5K Test.

For simplicity, this evaluation platform temporarily does not provide evaluation APIs for the 1K Test and 5 fold 1K Test of the MS COCO dataset at N=1000 and 5*1000 (average of 5 evaluations), respectively.

There are six basic evaluation indicators(metrics) for both benchmark evaluation datasets, namely R @ 1/5/10 in the i2t direction(i.e. i2t_R@K) and R @ 1/5/10 in the t2i direction(i.e. t2i_R@K).

There are two composite evaluation indicators, namely the sum of these six indicators, Recall Sum (i.e. Rsum), and the average value of these six indicators, Mean Recall (i.e. mR).

It should be noted that, Flickr30k and MS COCO are both datasets with an image text ratio of 1:5 (Neither the ideal 1:1 ratio, nor the other fixed or mutable ratios), and each image has 5 matched texts. Therefore, during the i2t test, only one of the 5 matched texts in the retrieval results can meet the requirements of the metric 'recall'; however, during the t2i test, only one image is matched with it, therefore, the requirements of the metric 'recall' will be more demanding, and the evaluation recall rate will be lower than when it is in the i2t test. When evaluating the datasets with other image text ratios, the users should flexibly adjust the evaluation strategy based on the actual data matching situation. The introduction of the following evaluation indicators is based on the setting when the image text ratio is 1:5.

1.1 i2t_R@K

i2t_R@K refers to the average recall rate when the image to text retrieval evaluation (i2t test) is performed on the image text matching task, where the model ranks all 5 sentences that match the query image among the top K (every rank is calculated based on the highest ranking among the 5 sentences).

The i2t test on Flickr30k, MSCOCO, and other datasets with a text to image ratio of 1:5 uses i2t_R@K (K=1, 5, 10) as the default basic metric of recall.

1.2 t2i_R@K

t2i_R@K refers to the average recall rate when the text to image retrieval evaluation (t2i test) is performed on the image text matching task, where the model ranks the only image that matches the query sentence in the top K.

The t2i test on Flickr30k, MSCOCO, and other datasets with a text to image ratio of 1:5 uses t2i_R@K (K=1, 5, 10) as the default basic metric of recall.

1.3 R@Sum

R@Sum refers to: after completing the image to text retrieval evaluation (i2t test) and text to image retrieval evaluation (t2i test), the total recall rate obtained by the sum of the six evaluation indicators: i2t_R@1 + i2t_R@5 + i2t_R@10 + t2i_R@1 + t2i_R@5 + t2i_R@10.

1.4 m_R

m_R refers to: after the image to text retrieval evaluation (i2t test) and text to image retrieval evaluation (t2i test) are completed, the average recall rate obtained by the mean of the six evaluation indicators: R@Sum/6.

Evaluation Metrics ​

1. Recall ​

1.1 i2t_R@K ​

1.2 t2i_R@K ​