Evaluation Overview

Beijing Academy of Artificial Intellifence has launched the FlagEval large model evaluation system and open platform, aiming to establish a scientific, fair, and open evaluation benchmark, method, and toolset to assist researchers in evaluting the performance of foundation models and training algorithms comprehensively, while exploring the use of AI methods to assist subjective evaluation, greatly improving the efficiency and objectivity in evaluation.

The FlagEval open evaluation platform has created automated evaluation and self-adapted evaluation mechanisms, supporting various chip architectures such as NVIDIA, Ascend (Pengcheng Cloud Brain), Cambricon, KUNLUNXIN, as well as various deep learning frameworks such as PyTorch and MindSpore.

As an important topic of the "Scientific and Technological Innovation 2030" flagship project, FlagEval is jointly building with cooperation units (in alphabetical order) such as Beihang University, Beijing Normal University, Beijing University of Posts and Telecommunication, China Electronics Standardization Institute, Minjiang University, Nankai University, Peking University, and the Institute of Automation of the Chinese Academy of Sciences, to publish authoritative evaluation rankings.

In the future, FlagEval will continue to be a great "booster for AI large model innovation", promoting "optimization" through evaluation, "use" through evaluation, and "share" through evaluation.

Promote "optimization" through evaluation: provide detailed evaluation results and analysis to help researchers and developers understand the strengths and weaknesses of models, so as to conduct targeted optimization.
Promote "use" through evaluation: providing evaluations of rich downstream tasks in multiple fields, users can refer to evaluation results and select the most suitable model and algorithm according to their own needs.
Promote "share" through evaluation: adhering to the spirit of "open source openness", encourage researchers and developers to evaluate and share their models and algorithms.

Natural Language Processing (NLP)

Since the end of 2022, the field of large language models has shown a flourishing trend, with technology publications even updated on a weekly basis. However, with the rapid emergence of new models, research on evaluation methods and tools is comparatively lagged behind, making it difficult for demand sides to find suitable models. At the same time, production sides also need fairer standards to evaluate the advantages and disadvantages of models, so that researchers can continuously optimize models.

The current difficulties faced by large language model evaluation mainly include three points:

The potential of foundation models is difficult to accurately evaluate, and traditional benchmark testing methods are no longer applicable. Foundation models are a huge knowledge base with tremendous potential, but we cannot yet determine the specific form and upper limit of its potential. Traditional evaluation methods face the problem of expiration in foundation model evaluation. A single accuracy metrics cannot fully reflect the potential of models completing tasks, and more metrics need to be introduced to comprehensively measure the potential of models.
The training cost of large models is high, therefore it is necessary to combine evaluation results in training processes and adjust training strategies in time to reduce the cost of trial and error.
Lack of authoritative neutral rankings for extensive comparison evaluation. Most research teams and enterprises are limited by computing resources and cannot conduct extensive model comparison evaluation. Therefore, an authoritative neutral ranking is very necessary, which is crucial for the selection of large models at the industrial implementating level.

Based on this background, FlagEval's large language model evaluation system innovatively constructed a three-dimensional evaluation framework of "capacity-task-metrics", fine-grained to depict the cognitive capacity boundaries of foundation models, and visualized the evaluation results. Currently, it includes 30+ capacities x 5 major tasks x 4 major metrics, with a total of 600+ sub-dimensions. The task dimension includes nearly 30 subjective and objective evaluation datasets, over 100,000 evaluation questions, and evaluation datasets in more dimensions are being integrated.

As shown in the figure below, Model X and Model Y evaluate different metrics in the same capability and task.

Capacity Framework: Depicting Model Cognitive Capacity Boundaries

Task Framework: Refine the "Capacity" Labels of Tasks

By decoupling "tasks" and "capacities", each task corresponds to diverse capacities and is evaluated through diverse datasets. Currently, there are 22 subjective and objective evaluation datasets with 84,433 evaluation questions. The types and quantities of datasets will continue to be expanded based on the capacity framework in the future.

In addition to well-known public datasets such as HellaSwag and MMLU, FlagEval also integrates subjective evaluation dataset, Chinese Linguistics & Cognition Challenge (CLCC), built by BAAI, and vocabulary-level semantic relationship judgement, sentence-level semantic relationship judgement, polysemous word understanding, and rhetorical device judgement evaluation datasets, jointly built by Peking University and Minjiang University. More evaluation datasets of more dimensions are also being integrated continuously.

Metrics Framework: Different Tasks have Different Emphasized Metrics

FlagEval v0.5 only supports Accuracy metrics. Subsequent updates will continue to add metrics such as Uncertainty, Robustness, and Efficiency.

Accuracy:Accuracy is the fundamental attribute of a model, and the accuracy of the output determines whether the model is usable. In FlagEval, accuracy is the general term for accuracy measures in each evaluation scenario and task, including exact-match accuracy in text classification, F1 score based on word overlap in question answering, MRR and NDCG scores for information retrieval, and ROUGE score for abstract, etc.
Uncertainty: Refers to the confidence or certainty measure of a model's predicted results, which is very important for making appropriate expectations and responses in the event that the model may be wrong. For example, in high-risk environments such as decision-making, the uncertainty metrics of the model can allow us to anticipate possible incorrect results and make appropriate adjustments and interventions to avoid potential risks.
Robustness:Robustness refers to the capability of a model to maintain its performance in the face of perturbations from inputs. For example, a robust model should be able to answer questions correctly even if the problem is slightly reworded or contains minor typos. Robustness is particularly important for practical applications because inputs are often noisy or hostile. In the context of language modes, robustness can be evaluated by perturbing input texts and measuring the changes in model outputs.
Efficiency:Efficiency usually refers to the computational efficiency of a model, inlucing the time and computing resources for training and inference. Efficiency will affect the feasibility of a model in practical applications. For example, a very accurate model that requires a lot of computing resources or time for training or inference may not be suitable for use in environments with limited resources or requiring rapid response.

Evaluation Method

FlagEval uses different evaluation methods for foundation models and fine-tuned models:

Different evaluation methods are used for foundation models and fine-tuned models.

Foundation model evaluation is mainly based on objective evaluation of "adaptation evaluation + prompt learning evaluation".
- The adaptation evaluation mainly examines the selection capability of foundation models under fixed options. We referred to the Language Model Evaluation Harness framework and extended the evaluation capability to Chinese language proficiency.
- The prompt learning evaluation mainly examines the open generative capability of foundation models under system learning. We referred to the HELM evaluation framework and extended the evaluation capability to Chinese language proficiency.
Fine-tuned model evaluation starts with reusing foundation models' objective evaluation, examining whether the fine-tuning process has improved or worsened certain capabilities of the foundation model. Then subjective evaluation is introduced.
- Manual subjective evaluation: for subjective questions created manually, "multi-person back-to-back annotation + third-person arbitration" is used. Multi-person back-to-back annotation will also use GPT-4's annotation method to increase diversity.
- Automatic subjective evaluation: on the subjective questions created based on the capacity framework by GPT-4, GPT-4 automated annotation is used for annotation.

See following table for specific methods of objective evaluation and subjective evaluation:

	Objective Evaluation	Subjective Evaluation
Tasks	Mainly focus on tasks that can be automatically evaluated and have standard answers, such as classification, multiple choice question answering, information retrieval, etc.	Mainly focus on tasks that cannot be automatically evaluated and do not have standard answers, such as open question answering, conditional text generation, etc.
Method	Evaluation is conducted in In-context form, supporting Few-Shot/Zero-Shot.	Evaluation is conducted in dialogue form, mainly with Zero-Shot.
Evaluation Specifications	Computing power, data, and other evaluation infrastructure remain unified.	Multiple rounds of evaluation standard training, multi-person back-to-back annotation, assisted by annotation tools.
Characteristics	The amount of evaluation data is large, the speed is fast, and it is convenient for rapid verification, but the capability dimensions that can be evaluated are limited.	The amount of evaluation data is small, and the speed is slow, but the capability dimensions that can be evaluated are rich, making it easy to discover model weaknesses.
Resource Support	Rich data sources, such as existing benchmarks; rich automated evaluation tools, such as data sampling, metrics calculation, etc.	Experienced evaluation and annotation team.

Support Automated Evaluation and Self-Adaptation Evaluation

Automated evaluation mechanism:

Deploy inference services, subjective evaluation & objective evaluation fully automated pipeline
Automatic monitoring at each stage, fully automatic connection from inference service to evaluation Self-Adaptation evaluation mechanism:
Users can choose evaluation strategies based on model type and status, and the platform will integrate evaluation results
Automatic notification alarms for full-cycle events such as evaluation starts, ends, and errors.

Vision Domain Evaluation

With the rapid development of deep learning technology, vision foundation models have become an important tool for solving complex vision tasks. These large models have strong representation learning capabilities through massive data training, enabling them to achieve excellent performance on various downstream tasks. However, as the model scale continues to increase, evaluating their performance becomes increasingly difficult. Traditional evaluation methods often only focus on a single task or metrics, making it difficult to fully reflect models' universality and performance advantages. Therefore, building a reasonable, comprehensive, and objective evaluation system has become an important issue in the industry.

In the era of small models, models are often optimized for specific tasks, and the focus of evaluation is mainly on task performance. However, in the era of large models, due to stronger versatility and transfer learning capability of models, a single task evaluation can no longer meet needs. We need a more comprehensive evaluation system that can not only evaluate the performance of models on specific tasks, but also evaluate their versatility and performance in multiple tasks and scenarios.

Currently, there are various methods and standards for evaluating vision foundation models in the industry, but there is a lack of a unified and authoritative evaluation system. Moreover, due to the diversity of complexity of vision tasks, it is difficult to find a universal evaluation metrics that can fully cover all situations. The selection and design of evaluation metrics need to consider the characteristics and requirements of different tasks as well as the performance of models at different levels comprehensively. At the same time, as the number of model parameters increases, the computing power cost of evaluation also increases significantly, making it difficult for many research teams and enterprises to conduct extensive evaluation on model comparison.

Faced with these challenges, we believe that a good vision foundation model evaluation system should have several characteristics:

Comprehensiveness: Can evaluate the performance of models on multiple different tasks and datasets.
Objectivity: Evaluation metrics should be objective, repeatable, and not influenced by human factors.
Flexibility: Adapt to different models and task requirements, update to reflect the latest research progress and technological development in a timely manner.

By constructing such an evaluation system, we can not only understand the performance and advantages of vision foundation models more accurately, but also provide strong support for the development and application of models.

The FlagEval vision foundation model evaluation system currently includes 7 sub-capabilities of the model's perception, analysis, and understanding dimensions, covering more than 10 vision tasks such as image classification, semantic segmentation, depth estimation, video classification, and small-sample image classification. It currently includes more than 20 evaluation datasets such as ImageNet, Place365, COCO, NYUv2, KITTI, ADE20K.

"Capacity-Task" Framework

Metrics System

FlagEval Vision Foundation Model Plan uses the following metrics. For detailed metrics descriptions, please refer to the introduction page of evaluation datasets:

Performance: Evaluating the performance of a model on specific tasks is the most basic function of evaluation. Different tasks have different performance metrics, such as classification accuracy and retrieval recall.
Robustness: Robustness refers to the capability of a model to maintain its performance in the face of input perturbations. Robustness is particularly important for practical applications because inputs are often noisy or hostile.
Efficiency: Efficiency usually refers to the computational efficiency of a model, including the time and computing resources for training and inference. Efficiency affects the feasibility of models in practical applications. For example, if a very accurate model requires a lot of computing resources or time for training or inference, it may not be suitable for use in environments with limited resources or requiring rapid response.

Evaluation Method

Vision foundation models use adaptation evaluation
Specific task fine-tuned models use direct evaluation

Currently, FlagEval platform has launched tasks such as image classification, semi-supervised image classification, and image retrieval. In the future, it will continue to improve and update the vision foundation model evaluation system, adding multiple vision tasks such as object detection, instance segmentation, and video classification, as well as evaluation metrics such as AP and maskAP.

Multimodal Domain Evaluation

FlagEval multimodal large model evaluation evaluates the performance and effectiveness of multimodal foundation models by processing and analyzing multiple modal data, which can help us better understand the advantages and disadvantages of multimodal foundation models. The capability of multimodal foundation models to process and analyze multimodal data can be reflected in their performance on multimodal tasks. For example, in multimodal tasks, visual question answering and text-to-image generation are commonly used tasks. The former evaluates models' understanding of text and images, while the latter evaluates models' understanding of text and generation of images. Therefore, how to design multimodal tasks accurately and efficiently is the main problem that multimodal foundation model evaluation needs to face.

FlagEval multimodal large model evaluation tasks cover comprehension tasks and generative tasks. Comprehension tasks include text/image retrieval, video/text retrieval, image questions, video question answering, visual grounding, etc. Generative tasks include text-to-image generation, text-to-video generation. It supports evaluation of different frameworks and flexible embedding of adaptive methods.

The evaluation dataset plans to cover public datasets and self-built datasets, support cross-modal automatic generation, ensure data universality, cover data from different scenarios, and support dynamic expansion of data. It supports automatic evaluation, manual evaluation, and human-machine collaborative evaluation for generative tasks.

Capacity Framework

The FlagEval multimodal large model capability system currently includes 8 sub-capabilities of multimodal understanding, cross-modal understanding, and cross-modal generation of models. It plans to gradually cover multiple multimodal tasks such as visual question answering, image-text retrieval, text-to-image generation, and visual grounding. Currently, it includes more than 10 evaluation datasets such as VQA2.0, TDIUC, MS-COCO, CUB, CelebA-HQ, Oxford-102 Flower, MSR-VTT, UCF-101, Flickr30k (F30k).

Metrics System

The purpose of the multimodal foundation model evaluation method is to evaluate the performance and effectiveness of multimodal foundation models, so that researchers and developers can better understand the advantages and limitations of models, and promote the improvement and development of models. The evaluation of performance and effectiveness of models can usually be measured by objective metrics such as accuracy, robustness, and generalization ability in tasks such as image generation and image-text retrieval. Therefore, the multimodal foundation model evaluation method plans to cover multiple evaluation metrics.

Performance: Evaluating the performance of a model on specific tasks is the most basic function of evaluation. Different multimodal tasks have different performance metrics, such as the visual question answering accuracy and the image-text retrieval recall.
Robustness: Robustness refers to the capability of a model to maintain its performance in the face of input perturbations. Robustness is particularly important for practical applications because inputs are often noisy or hostile.
Efficiency: Efficiency usually refers to the computational efficiency of a model, including the time and computing resources for training and inference. Efficiency affects the feasibility of models in practical applications. For example, if a very accurate model requires a lot of computing resources or time for training or inference, it may not be suitable for use in environments with limited resources or requiring rapid response.

Evaluation Method

Multimodal foundation models use adaptation evaluation.
Specific task fine-tuned multimodal models use direct evaluation, using objective and subjective evaluation metrics.

Audio Domain Foundation Model Evaluation

As a hot field of artificial intelligence technology, audio and related multimodal technologies, including speech recognition, text to speech, speech conversion, speech translation, speech enhancement, speaking comprehension, voiceprint recognition, speech authentication, etc., have been integrated into people's lives and production activities. Its application scenarios include smart medical care, intelligent manufacturing, smart finance, etc. Audio and related multimodal technologies are an important part of general artificial intelligence technology and the supporting technology for future support of multimodal, multilingual, and multitasking general large models. The evaluation of the generalization ability, cognitive ability, robustness, and security of audio and related multimodal foundation models is of great significance for the popularization and promotion of artificial intelligence technology.

Currently, research on foundation model evaluation and measurement has made some progress. However, these research still focus on evaluating the results of simple tasks and cannot form a comprehensive evaluation of the overall performance of models. Therefore, in terms of evaluation benchmarks, methods, and tools, we need to pay more attention to the evaluation of models' generalization ability and generality to different tasks and scenarios.

FlagEval is building a fair evaluation system based on the generalization ability of multi-task audio large models, providing developers and researchers with a standardized evaluation framework and tools to ensure that their models have efficient, accurate, and secure performance in different tasks and scenarios.

Capacity Framework

FlagEval v1.0 currently only covers two types of tasks: speech recognition and emotion recognition. In the future, it will support richer evaluation tasks, including:

Task Framework

The evaluation system evaluates the generalization ability of audio foundation models from a multi-task perspective, covering classification tasks, recognition tasks, generative tasks, semantic understanding tasks, multimodal tasks, etc. Specific tasks include speech recognition, speech conversion, language recognition, dialogue slot filling, dialogue intention recognition, emotion recognition, speaker confirmation, speaker identification, speaker log, etc.

The first launched evaluation tasks include speech recognition and emotion recognition tasks. Stay tuned for more tasks.

Metrics Framework

Different tasks have different emphasized metrics:

Accuracy:
- Speech recognition:
  - CER, Character Error Rate
  - WER, Word Error Rate
- Emotion recognition:
  - WAR, Weighted Average Recall
  - UAR, Unweighted Average Recall
- Speech generative tasks (general metrics):
  - Objective evaluation metrics:
    - MCD, Mel Cepstrum Distortion
    - CER, Character Error Rate
    - WER, Word Error Rate
    - Speaker Similarity
  - Subjective evaluation metrics:
    - MOS, Mean Opinion Score
- Language recognition:
  - EER, Equal Error Rate
  - ACC, Accuracy
- Speaker identification:
  - CER, Classification Error Rate
- Speaker confirmation:
  - EER, Equal Error Rate
- Speech Log:
  - DER, Diarization Error Rate
  - SER, Speaker Error Rate
- Audio classification:
  - ACC, Accuracy
- Speech-to-text translation:
  - BLEU, BiLingual Evaluation Understudy
- Speaker separation:
  - SDR, Signal-to-Distortion Ratio
  - SISNR, Scale-Invariant Signal-to-Noise Ratio
- Speech enhancement:
  - PESQ, Perceptual Evaluation of Speech Quality
  - STOI, Short Time Objective Intelligibility
- Dialogue slot value filling:
  - F1, Slot Type F1 Score
  - WER, Slot Value WER (Slot Value Word Error Rate)
- Dialogue intention recognition:
  - ACC, Accuracy
Robustness:
Universal Model is an important development direction of multimodal large models. As an important dimension of universality, robustness reflects the performance changes of models under different acoustic or language disturbances, including noise robustness, domain robustness, acoustic environment robustness, speaker robustness, oral robustness, accent robustness, language robustness, etc. For specific tasks, by introducing important interference factors, corresponding evaluation datasets are designed to test the universality of audio foundation models.
- Accent robustness: the impact of accents and dialects
- Environmental robustness: the impact of noise and reverberation
- Equipment robustness: the impact of near-field, far-field, and array mricophones
- Multi-speaker robustness: whether models can handle multiple speakers
- Multilingual robustness: whether models can handle multiple languages
- Domain robustness: reflects the generalization ability of foundation models in multitasking and different application scenarios
- Oral robustness: spontaneous speech includes various common speech phenomena, such as repetition, hesitation, correction, meaningless syllables, etc. Oral robustness reflects the performance changes of foundation models in the real world
Fairness:

Fairness is one of the important dimensions of trusted AI. By introducing user-related interference factors such as gender and age, it provides an objective evaluation standard for the fairness of audio foundation models, such as the speech recognition performance of the elderly and young children.

Efficiency:
- RTF, Real Time Factor
- Average Inference Cost, the average time spent by the model on each sample during inference

Evaluation Method

In order to evaluate the generalization ability of audio foundation models, unified universal head models are constructed based on the characteristics of tasks, including linear head models and nonlinear head models, to achieve fair comparison of audio foundation models and use foundation models to cover the solution of most audio information processing problems.

Referring to SuperB's evaluation ideas, this evaluation system adopts different fine-tuning methods for evaluation on multiple tasks. Currently, FlagEval v0.5 only supports freezing/thawing foundation model parameters for evaluation. In the future, different fine-tuning methods such as weighted sum and lora will be developed to evaluate the generalization ability and versatility of foundation models. The specific business process is as follows:

Evaluation Process

This evaluation system is divided into two modules, upstream model management and downstream task management, based on the differences in tasks. The upstream model management defines the foundation model interfaces that need to be provided by model providers, with a focus on customizing upstream models and corresponding parameter settings. The downstream task management model defines specific task datasets, data augmentation strategies, head models, loss functions, optimizers, and test metrics calculators.

Future Outlook

In the future, we hope to further improve the evaluation of model structure, generalization ability, and security, and the construction of evaluation datasets for complex tasks, through interdisciplinary cooperation and research, and establish a more comprehensive universal multimodal large model evaluation benchmark, method, and tool system.

Evaluation Overview ​

Natural Language Processing (NLP) ​

Capacity Framework: Depicting Model Cognitive Capacity Boundaries ​

Task Framework: Refine the "Capacity" Labels of Tasks ​

Metrics Framework: Different Tasks have Different Emphasized Metrics ​

Evaluation Method ​

Support Automated Evaluation and Self-Adaptation Evaluation ​

Vision Domain Evaluation ​

"Capacity-Task" Framework ​

Metrics System ​

Evaluation Method ​

Multimodal Domain Evaluation ​

Capacity Framework ​

Metrics System ​

Evaluation Method ​

Audio Domain Foundation Model Evaluation ​

Capacity Framework ​

Task Framework ​

Metrics Framework ​

Evaluation Method ​

Evaluation Process ​

Future Outlook ​

Evaluation Overview

Natural Language Processing (NLP)

Capacity Framework: Depicting Model Cognitive Capacity Boundaries

Task Framework: Refine the "Capacity" Labels of Tasks

Metrics Framework: Different Tasks have Different Emphasized Metrics

Evaluation Method

Support Automated Evaluation and Self-Adaptation Evaluation

Vision Domain Evaluation

"Capacity-Task" Framework

Metrics System

Evaluation Method

Multimodal Domain Evaluation

Capacity Framework

Metrics System

Evaluation Method

Audio Domain Foundation Model Evaluation

Capacity Framework

Task Framework

Metrics Framework

Evaluation Method

Evaluation Process

Future Outlook