Evaluation Operation Process

After completing registration and filling in personal information as required for evaluation participation, users need to wait for platform approval. Once approved, users can log in and click the [Evaluation Console] button to enter the Evaluation Console Page. Click [Create Evaluation] to enter the parameter configuration interface. The first step is to select one of the five evaluation domains, followed by further parameter configuration.

Selection of Different Evaluation Domains

NLP Domain Model Evaluation

Create Evaluation

On the Create Evaluation page, users can click the [NLP] evaluation domain, fill in the relevant parameters, and click Submit to proceed to the Model Evaluation Instance Details page.

In the parameter input interface, the basic information that needs to be filled out includes:

输入图片说明

Form Details：

Parameter Category	Parameter Name	Parameter Description
Basic Information	Evaluation Domain	Select the evaluation domain: NLP. (Required, single selection)
	Model Name	Enter the name of the model to be evaluated. (Required) It must be 3–32 visible characters long, containing only lowercase letters, numbers, and hyphens (-). Must start with a lowercase letter and end with a lowercase letter or number. Must be unique across the platform.
	Description	Enter a description. (Optional) Supports 0–256 visible characters.
	Evaluation Dataset	Select the evaluation framework. (Required) Currently, only the HELM evaluation framework is supported. More datasets are being integrated.
	Deployment Method	Choose one of the following two options: Private Deployment API Evaluation (Required fields: API Endpoint; Optional fields: Online Model Name, API KEY)
	Evaluation Dimension	Required: Choose whether to include robustness evaluation Accuracy evaluation is a built-in option

Depending on the selected deployment method in the basic information section, different forms need to be filled out：

Deployment Method	Parameter Category	Parameter Name	Parameter Description
Private Deployment	Environment Configuration	Select the image	Required: Select the base image for task execution. The image contains all software dependencies required to run the program and provides the runtime environment for inference. You can choose from platform-preloaded images or user-imported images. If the platform-preloaded images do not meet your needs, you may import your own image via Evaluation Management / Image Management. For custom image building using Dockerfile, please refer to the recommended guidelines.
		Task Priority	Required: Choose from High, Medium, or Low priority.
		Select GPU Type	Required: Choose the type of GPU needed for evaluation. (Single selection only) Note: Ascend cards support the MindSpore framework, while other cards support the PyTorch framework.
		Set Resources	Required: Set the computing resources based on your model size. Only single-machine inference is supported. If the resource list is empty after selecting a GPU type, it means there are currently no available resources for that type; please choose another. The platform restricts the maximum resources allowed based on model size: If model parameters ≤ 7B, you may use up to 1 A100 GPU. If 7B < parameters ≤ 20B, you may use up to 4 A100 GPUs. If parameters > 20B, you may use up to 8 A100 GPUs.
API TEST	Basic Information	Evaluation API	Required: Must comply with the platform’s API specification.
		Online Model Name	Optional
		Online API KEY	Optional

In addition to the information above, model-related details also need to be provided:

输入图片说明

Parameter Category	Parameter Name	Parameter Description
Model Information	Model Type	Base: A foundational model obtained through pre-training. SFT: A model fine-tuned on top of a base model. Required, single selection.
	Paper Link
	Model Link

After the user checks the box “I acknowledge that manually labeled test set annotations must not be used during model training or in the submitted results,” the evaluation instance can be submitted. Upon submission, the system redirects to the Evaluation Instance Details Page, which is divided into three sections: upper, lower left, and lower right.

输入图片说明

Upper half:
- Display the specific paramters when the user [Create Evaluation].
- User operations include: [Edit], [Upload Model & Code], [Inference Verification], [Launch Inference Evaluation].
- [Edit] operation: editing is allowed except for the model name and evaluation domain.
- [Upload Model & Code] operation: pop up the guide page for uploading model and code.

输入图片说明

[Inference Verification] operation:
- The platform randomly selects 1-3 pieces of data from the test data to form inference verification, making it convenient for users to quickly verify whether their models and services can run.
- It is recommended to conduct inference verification before officially launching the inference evaluation. If problems are found, they can be fixed immediately. You can also check whether the answers in the inference verification meet expectations.
【Launch Inference Evaluation] operation: after the user verifies that the model and service are free of problems, they can click to launch the inference evaluation. After launching, the entire inference evaluation process will last for several hours. An email notification will be sent to the user when the evaluation task is completed.
- Currently, it only supports single-machine inference evaluation.
Lower left:
- Display the previous versions of inference launch and inference verification initiated by the user.
- Displayed parameters include: version, task type, status, submitter, operation.
- Task types include: inference verification & formal inference.
- [Stop] operation: click this operation to end the inference process.
Lower right: includes inference verification details page, model inference progress page, metrics details page, and log page
- Inference verification tab page: dataset progress display (task name, dataset name, status).
  - Verification results - the platform provides partial inference results that users can confirm on their own. Data fields: ID, Dataset, Sample, Label, Predict.

输入图片说明

Model inference progress tab page
- Mainly record inference submission time; task name, dataset name, status, start time, end time.
- Inference process: serial (task dataset) inference. The state flow of each dataset is shown in the following image.

输入图片说明

Metrics Details Page: During the inference evaluation, users can view the performance results of datasets that have already been evaluated at any time. Each evaluation task will have its own calculated average score.

输入图片说明

After the entire inference task is completed, users can choose whether to submit to the leaderboard based on their scores, and can choose [anonymous submission] or [named submission].
Users click the [named submission & anonymous submission] button to synchronize their scores to the NLP leaderboard, which is consistent with the format of the NLP leaderboard. Only the scores submitted by clicking will appear on the leaderboard.
Named submission: display the real name and organization of the submitter; anonymous submission: display the real name and organization of the submitter as [anonymous]. Note:
Users only have five opportunities per month to launch formal inference evaluations.

Subjective Inference Evaluation

If the user selects the [Chinese Open QA] task and uses the SFT model, it will involve subjective inference and evaluation. Due to the subjective inference evaluation requiring manual annotation evaluation, which is relatively expensive, each user is limited to a maximum of 2 manual evaluations per month. After completing the inference in [Chinese Open QA], click [Launch Manual Evaluation] button.

Upload Model & Code Specifications

(1) Install flageval-serving

pip install flageval-serving

Refer to https://github.com/FlagOpen/FlagEval/tree/master/flageval-serving

(2) Prepare Model & Code Directory Structure You need to place the model and code in the same directory to access FlagEval for offline model evaluation.

demo/
|—— model                # Model directory
|   —— info              # Model file, please ensure the model weight file format and the selected card type and framework uploaded match
|—— local_guniconf.py    # Service configuration file
|—— meta.json            # Model information
|—— service.py           # Service entrance (other dependent modules)

Define Model Information meta.json

{    "parameter_scale_billion": 7        // Model Parameter Scale (unit billion)}

Define Service Need to defineservice.py in Service ：

from flageval.serving.service import NLPModelService, NLPEvalRequest, NLPEvalResponse, NLPCompletion


class Service(NLPModelService):
    def global_init(self, model_path: str):
        print("Initial model with path", model_path)

    def infer(self, req: NLPEvalRequest) -> NLPEvalResponse:
        print(req)
        return NLPEvalResponse(
            completions=[
                NLPCompletion(
                    text='Hello, world!',
                    tokens='Hello, world!',
                ),
            ]
        )

global_init used for model global initialization, called once when service starts;
infer used to handle inference requests and is called during evaluation.

Local Model Debugging

You can run the debugging service with the following command:

 flageval-serving --service demo/service.py dev demo/model

Then you can construct a curl request for testing:

curl -X POST http://127.0.0.1:5000/func -H 'Content-Type: application/json' -d '{
    "engine": "hello",
    "prompt": "hello",
    "temperature": 1.5,
    "num_return_sequences": 1,
    "max_new_tokens": 1,
    "top_p": 1,
    "echo_prompt": false,
    "top_k_per_token": 1,
    "stop_sequences": []
}'

Set the number for inference service progress

By default, we use 1 progress to run the model inference service. If customization is required, you can adjust the workers configuration item in local_guniconf.py:

workers = 4    # Use 4 Worker progress

Model Upload

Click [Upload Model & Code] on the page to receive token and fill in the following command:

flageval-serving upload --token='token' demo

The upload speed is determined by the export internet bandwidth of the upload host and the service entrance bandwidth of the evaluation platform, with a maximum speed of approximately 12.5MB/s.

CV Domain Model Evaluation

Currently, the Flageval platform supports users in uploading pre-trained CV backbone models. Users can freeze the backbone parameters and fine-tune the model on different tasks and datasets to generate downstream task models. These models can then be evaluated through inference on the corresponding test datasets, yielding performance metrics specific to each dataset.

Create Evaluation

Users can click on [Create Evaluation] on the evaluation task list page, and a [Create Evaluation] dialog box will pop up. Fill in the relevant parameters and click submit to jump to the details page of the model evaluation instance.

输入图片说明

Parameter Category	Parameter Name	Parameter Description
Basic Information	Evaluation Domain	Select evaluation domain CV. Required, single choice.
	Model Name	Enter the name of the model to be evaluated, required. Support uppercase and lowercase letters, numbers, middle hyphen -, underscore _, decimal point ., start with English letters, end with English letters and numbers, length 3–128. The name is unique on all platforms, duplicates not allowed.
	Description	Fill in the description information. Optional. Supports 0–256 visible characters.
	Evaluation task	Selection evaluation task. Required, multiple choice. Different tasks are associated with different adaptation methods.
	Adaptation Method	Please select the adaptation method to fine-tune. For details of adaptation methods, please refer to the task description.
	Evaluation Dimension	Required: Users can optionally choose to include robustness evaluation. Accuracy evaluation is a built-in option by default.
Environment Configuration	Select Image	Required. Select the base image for task execution. Image encapsulates the software dependencies required for program execution and provides an operation environment for inference. Can choose [Platform Preset Image] or [User Self-Import Image] for image. If [Platform Preset Image] does not meet user needs, users can import their own image in [SUBMIT/Images Manage]. For details, please refer to image management.
	Select Card Type	Select the card type required for evaluation. Required, single choice. Note: Ascend card type supports MindSpore framework, while other card types support Pytorch framework.
	Set Resources	Set the required resources based on the model size, only single-machine inference is supported. Required, single choice. After selecting a card type, if the set resource list is empty, it means that the card type has no resources temporarily. Please choose another card type. The platform will limit the maximum recourse value based on the model size. For example, in NLP model inference evaluation: Parameter ≤ 7B → max 1 A100 7B < Parameter ≤ 20B → max 4 A100s Parameter > 20B → max 8 A100s
Model Information	Model Type	Currently, only Backbone pre-trained models are supported in the CV domain.

Users submit the evaluation instance and jump to the evaluation instance details page, which is divided into three parts: upper, lower left, and lower right.

输入图片说明

Upper half:
- Display the specific paramters when the user [Create Evaluation].
- User operations include: [Edit], [Upload Model & Code], [Fine-Tuning & Inference Verification], [Fine-Tuning & Launch Inference Evaluation].
- [Edit] operation: editing is allowed except for the model name and evaluation domain.
- [Upload Model & Code] operation: pop up the guide page for uploading model and code.
- [Fine-Tuning & Inference Verification] operation:
  - The platform randomly selects 1-3 epoch data from the verification data to form fine-tune verification data, making it convenient for users to quickly verify whether their models and services can run.
  - It is recommended to conduct a verification process before officially launching fine-tuning & inference evaluation. If problems are found, they can be fixed immediately.
- [Fine-Tuning & Launch Inference Evaluation] operation: after the user verifies that the model and service are free of problems, they can click fine-tuning & launch inference evaluation. After launching, the entire fine-tuning + inference evaluation process will last for several hours. An email notification will be sent to the user when the evaluation task is completed.
  - Currently, it only supports single-machine fine-tuning and inference evaluation.
Lower left:
- Display the previous versions of inference launch and inference verification initiated by the user.
- Displayed parameters include: version, task type, status, submitter, operation.
- Task types include: inference verification & formal inference.
- [Stop] operation: click this operation to end the inference process.
Lower right: includes verification details page, fine-tuning & inference details page, and log page
- Verification details page: dataset progress display (task name, dataset name, status).
Fine-tuning & Inference details page
- Mainly record task name, dataset name, status, evaluation metrics, start time, end time.

输入图片说明

Log page
- If a task fails, users can locate the cause of failures by viewing the log.

Note:：

Users only have five opportunities per month to launch formal fine-tuning & inference evaluations.

Upload Model & Code Specifications

Open source sample code, please refer to GitHub (1) Install flageval-serving

pip install flageval-servingorpip install --upgrade flageval-serving

Refer to https://github.com/FlagOpen/FlagEval/tree/master/flageval-serving

(2) Prepare Model & Code

Directory Structure

The model and code need to be placed in the same directory. Open source sample code, please refer to GitHub

|—— demo|   |—— meta.json|   |—— model            # Model related code, can be called by model_adapter.py|          |—— checkpoint.pth|    —— model_adapter.py

Model parameter file: place in model/checkpoint.pth. When the model is initialized, the model path will be passed through string checkpoint_path. Users can read it and initialize the model on their own. Note that the model name and location must be placed in the model directory whe uploading, and named after checkpoint.pth, otherwise the system will have error parsing.

model_adapter.py

Provide an interface for obtaining the model through the model_init method of the ModelAdapter class, as shown below:

import os
import sys

# current_dir refers to the directory where model_adapter.py is located.
# If you need to access the file system, you can use current_dir. For example:
# abs_path = os.path.join(current_dir, yourfile)  # 'yourfile' is the relative path to your file

current_dir = os.path.dirname(os.path.abspath(__file__))
sys.path.append(current_dir)

from model.user_model import get_model  # noqa E402

class ModelAdapter:
    def model_init(self, checkpoint_path):
        model = get_model()
        # Initialize weights from the checkpoint path
        return model

meta.json

Model configuration file that provides basic information about the model, where the train_batch_size and val_batch_size parameters are set to optional and default values are provided by the system.

{
    "model_name": "resnet_demo",
    "is_rgb": true,
    "output_channels": [256, 512, 1024, 2048], // Number of channels for features output by the backbone
    "output_dims": [4, 4, 4, 4], // Spatial dimensions of the features output by the backbone
    "mean": [123.675, 116.28, 103.53], // Mean values for input image preprocessing
    "std": [58.395, 57.12, 57.375], // Standard deviation values for input image preprocessing
    "tasks": {
        "classification": {
            "train_batch_size": 128,
            "val_batch_size": 128,
            "out_indices": [3]
        },
        "segmentation": { // Segmentation only supports features with an output_dim of 4
            "train_batch_size": 8,
            "val_batch_size": 8,
            "out_indices": [0, 1, 2, 3]
        },
        "semi_supervised_classification": {
            "train_batch_size": 32,
            "val_batch_size": 128,
            "out_indices": [3]
        }
    }
}

Model Upload

Click [Upload Model & Code] on the page to receive tokenand fill in the following command:

flageval-serving upload --token='TOKEN' demo

The upload speed is determined by the export internet bandwidth of the upload host and the service entrance bandwidth of the evaluation platform, with a maximum speed of approximately 12.5MB/s.

BackBone Model Format Specifications

BackBone Output Format Specification

For image classification, image retrieval, few-shot classification, and semi-supervised tasks, the platform provides a preset global average pooling layer as the Neck and a linear classifier as the Head. It requires the output feature map from the last layer of the user’s Backbone. The output feature can be either a 3D or 4D tensor, and it must match the output_dims specified in the meta.json file.

For intensive prediction tasks such as semantic segmentation and depth estimation, the platform presets different Heads and needs to obtain all four layers of feature maps output by the user's BackBone. It is recommended that users process the feature map format as [B, C, H, W] in the forward function of BackBone (B is Batch_size size, C is the feature dimension size, and H and W are the height and width of the image), then output the four layers of feature maps in tuple format.

Example code for the forward function of SwinTransformer is as follows:

def forward(self, x):    x, hw_shape = self.patch_embed(x)        if self.use_abs_pos_embed:        x = x + resize_pos_embed(            self.absolute_pos_embed, self.patch_resolution, hw_shape,            self.interpolate_mode, self.num_extra_tokens)    x = self.drop_after_pos(x)        outs = []    for i, stage in enumerate(self.stages):        x, hw_shape = stage(x, hw_shape)        if i in self.out_indices:            norm_layer = getattr(self, f'norm{i}')            out = norm_layer(x)            out = out.view(-1, *hw_shape, stage.out_channels).permute(0, 3, 1, 2).contiguous()            outs.append(out)                return tuple(outs)

Precautions for Selecting Adaptation Methods

For semantic segmentation tasks, if the user-submitted BackBone uses a feature pyramid output form similar to Swin or ViT-Adapter, please choose UPerNet as the adaptation method. If the BackBone uses a normal ViT output form, please choose SETR as the adaptation method.

Depth estimation tasks currently do not support the evaluation of ordinary ViT architectures, only supporting models in the form of feature pyramid output. For the latter, both adaptation methods can be selected.

For image classification tasks and image retrieval tasks, all adaptation methods can be selected.

Model Input Size: For each task, the model input size is as follows:
Image classification, image retrieval, small sample classification and semi-supervised: 224*224
Depth estimation:
- KITTI Eigen split dataset: 352*1120
- NYU-Depth V2 dataset: 480*640
Semantic segmentation: 512*512

Multimodal Domain Model Evaluation

Currently, FlagEval platform supports users to upload direct inference models (without the need of finetune) in the Multimodal domain for inference evaluation.

Create Evaluation

Parameter Category	Parameter Name	Parameter Description
Basic information	Evaluation Domain	Select evaluation domain Multimodal. Required, single choice.
	Model Name	Enter the name of the model to be evaluated; required. Support uppercase and lowercase letters, numbers, middle hyphen -, underscore _, decimal point ., start with English letters, end with English letters and numbers, length 3–128. The name is unique on all platforms, duplicates not allowed.
	Description	Fill in the description information. Optional Supports 0–256 visible characters.
	Evaluation Task	Selection evaluation task. Required, multiple choice.
	Deployment Method	Choose one of the two options: Private Deployment Choose one of the two options: API Evaluation: Requires additional input fields including: Evaluation API (Required); Online Model Name (Optional); Online API KEY (Optional).
	Evaluation Dimension	Required: Option to enable robustness evaluation. Accuracy evaluation is enabled by default as a built-in option.

Different forms need to be filled out depending on the Deployment Method selected in the Basic Information section.

输入图片说明

Deployment Method	Parameter Category	Parameter Name	Parameter Description
Private Deployment	Environment Configuration	Select Image	Required: Select the base image for task execution. The image packages all software dependencies needed for inference. You can choose a platform-preloaded image or a user-imported image. If the platform image does not meet your needs, you may import a custom image under [Evaluation Management / Image Management]. For custom Dockerfile image building, please refer to the official guide.
		Task Priority	Required: Choose from High, Medium, or Low.
		Select GPU Type	Required: Select the type of GPU needed for evaluation. (Single selection) Note: Ascend GPUs support the MindSpore framework, while other GPUs support PyTorch.
		Set Resources	Set the required resources based on the model size. Only single-machine inference is supported. This field is required and allows only one selection. If the resource list is empty after selecting a GPU type, it means that the selected GPU currently has no available resources. Please choose a different type. The platform restricts the maximum available resources based on model size. For example, in NLP model inference evaluation: If parameters ≤ 7B, you can use up to 1 A100 GPU. If 7B < parameters ≤ 20B, you can use up to 4 A100 GPUs. If parameters > 20B, you can use up to 8 A100 GPUs.
API Evaluation	Basic Information	Evaluation API	Required: Must conform to the platform’s API specification.
		Online Model Name	Required
		Online API KEY	Required

In addition to the above information, model-related details must also be provided:

输入图片说明

Parameter Category	Parameter Name	Parameter Description
Model Information	Model Type	Base: A foundational model obtained through pre-training. SFT: A model fine-tuned based on a base model. Required, single selection.
	Paper Link
	Model Link

Users submit the evaluation instance and jump to the evaluation instance details page, which is divided into three parts: upper, lower left, and lower right.

输入图片说明

Upper half:
- Display the specific paramters when the user [Create Evaluation].
- User operations include: [Edit], [Upload Model & Code], [Inference Verification], [Launch Inference Evaluation].
- [Edit] operation: editing is allowed except for the model name and evaluation domain.
- [Upload Model & Code] operation: pop up the guide page for uploading model and code.
- [Inference Verification] operation:
  - The platform randomly selects 1-3 pieces of data from the test data to form inference verification, making it convenient for users to quickly verify whether their models and services can run.
  - It is recommended to conduct inference verification before officially launching the inference evaluation. If problems are found, they can be fixed immediately. You can also check whether the answers in the inference verification meet expectations.
- [Launch Inference Evaluation] operation: after the user verifies that the model and service are free of problems, they can click to launch the inference evaluation. An email notification will be sent to the user when the evaluation task is completed.
  - Currently, it only supports single-machine inference evaluation.
Lower left:
- Display the previous versions of inference launch and inference verification initiated by the user.
- Displayed parameters include: version, task type, status, submitter, operation.
- Task types include: inference verification & formal inference.
- [Stop] operation: click this operation to end the inference process.
Lower right: includes verification details page, inference evaluation details page, and log page
- Verification details page: dataset progress display (task name, dataset name, status) and the verification results.
- Inference evaluation details page
  - Mainly record task name, dataset name, status, evaluation metrics, start time, end time.
- Log page
  - If a task fails, users can locate the cause of the failure by viewing the log. Note:
Users only have five opportunities per month to launch formal inference evaluations.

Upload Model & Code Specifications

Open source sample code, please refer to github

(1) Install flageval-serving

pip install flageval-servingorpip install --upgrade flageval-serving

Refer to https://github.com/FlagOpen/FlagEval/tree/master/flageval-serving (2) Prepare Model & Code

File Structure

In the files uploaded using flageval-serving, the two files necessary are as follows. The file names must be using checkpoint.pth and run.sh, stored in the root directory of the uploaded file.

example_model
├── checkpoint.pth  # The weight file required for model initialization; users need to load and parse it in their own code.
├── run.sh
└── ….  # Other user files

checkpoint.pth

The weight file required for model initialization. Needs to be loaded and parsed by the user's code.

run.sh

Need to accept three command line parameters, each representing: task_name, server_ip, server_port. The first parameter is the task name (retrieval, vqa, t2i), and the last two parameters are the data server ip and port.

bash run.sh $task_name $server_ip $server_port

There are no restrictions inside run.sh. Ensure that the program runs and outputs the expected results will suffice.

Output Format
- VQA Task
- For each dataset, output a list, each element of the list is in the format of {"question_id": int, "answer": str}, representing the question corresponding to the id and the answer given by the mode, as follows:

[
    {"question_id": str, "answer": str},
    ....
]

Stored in json format, path is $output_dir"/"$dataset_name.json

Retrieval Task

For each dataset, output a n*m matrix representing the similarity between the nth image and the mth sentence, using np.ndarray format, and the storage path is

$output_dir"/"$dataset_name.npy

t2i Task

For each dataset, output n images to $output_dir directory, store in png format, and name the images as ${prompt_id}.png
At the same time, generate another file $output_dir/output_info.json, the content is a list of json. Each element needs to contain "id" (the current question_id) and "image" (the name of the generated image, that is, the path relative to $output_dir), for example:

[
    {    "id": "0",    "image": "0.png"  },
    {    "id": "1",    "image": "1.png"  },
    ....
]

Data Inference

Users need to obtain data, model paths, and dataset-related information through HTTP requests. The main interfaces are as follows:

io_info：Directory and model file path for storing user output.

http://{server_ip}:{server_port}/io_info{    'checkpoint_path': 'model checkpoint path',    'output_dir': 'output dir'}

meta_info：Meta information of dataset

http://{server_ip}:{server_port}/meta_info# for vqa tasks{     'length': number of images,     'name': dataset name,     'type': 'vpa'}# for retrieval tasks{     'name': dataset name,     'image_number': number of images,     'caption_number': number of captions,     'type': 'retrieval'}

get_data：Interface to get data
VQA Task

# Range of index is [0, number of images)
http://{server_ip}:{server_port}/get_data?index={index}

{
    "img_path": a list, each element is the absolute path of the images,
                can be read directly using PIL, cv2, etc.,
    "question": question, is a str, image location use <image1> <image2>,
    "question_id": question_id,
    "type": type is str, is the type of the questions
}

# type can be:
# - multiple-choice
# - multiple-response
# - fill-in-the-blank
# - short-answer
# - yes-no

# Below is a real example
{
    "img_path": ["/share/projset/mmdataset/MathVista/images/20.jpg"],
    "question": "<image 1>Hint: Please answer the question and provide the correct option letter, e.g., A, B, C, D, at the end.\nQuestion: Is the sum of smallest two bar greater than the largest bar?\nChoices:\n(A) Yes\n(B) No",
    "question_id": "20",
    "type": "multiple-choice"
}

Retrieval Task

# Read image, the range of index is [0, number of images)
http://{server_ip}:{server_port}/get_data?index={index}&type=img

{
    "img_path": the absolute path of images, can be read directly using PIL, cv2, etc.
}

# Read caption, the range of index is [0, number of captions)
http://{server_ip}:{server_port}/get_data?index={index}&type=text

{
    "caption": a caption
}

t2i Task (Text-to-Image Generation)

Language: Python

# Index Range: The index parameter should be in the range: [0, number_of_prompts)
url = f"http://{server_ip}:{server_port}/get_data?index={index}"

# Example response
data = {
    "prompt": "A textual description of the image",  # type: str
    "id": 123  # prompt ID, type: int
}

Use Sample Get model checkpoint storage location and output directory

url = f"{server_ip}:{server_port}/io_info"data = requests.get(url).json()checkpoint_path = data['checkpoint_path']output_dir = data['output_dir']

Data related

import requests

# Get VQA data
url = f"http://{server_ip}:{server_port}/get_data?index=12"
data = requests.get(url).json()

# Get retrieval caption (text)
url = f"http://{server_ip}:{server_port}/get_data?index=10&type=text"
data = requests.get(url).json()

# Get retrieval image
url = f"http://{server_ip}:{server_port}/get_data?index=10&type=img"
data = requests.get(url).json()

Model Upload

Click [Upload Model & Code] on the page to receive token and fill in the following command:

flageval-serving upload --token='TOKEN' demo

The upload speed is determined by the export internet bandwidth of the upload host and the service entrance bandwidth of the evaluation platform, with a maximum speed of approximately 12.5MB/s.

For more tutorials on the use of flageval-serving, please refer to the chapter.

Audio Domain Model Evaluation

Currently, the FlagEval platform supports users to upload pre-trained backbone models in the Audio domain, supports freezing the parameters of the backbone model to fine-tune and generate downstream task models under different tasks and datasets, and perform inference evaluation on the corresponding test dataset to obtain the evaluation metrics of the model on the dataset.

Create Evaluation

Parameter Category	Parameter Name	Parameter Description
Basic Information	Evaluation Domain	Select the evaluation domain: Audio. (Required, single selection)
	Model Name	Required: Enter the name of the model to be evaluated. Must be 3–32 visible characters long and contain only lowercase letters, numbers, and hyphens (-). Must start with a lowercase letter and end with a lowercase letter or number. The name must be unique across the entire platform.
	Description	Optional: Provide a description of the model. Supports 0–256 visible characters.
	Evaluation Task	Select the evaluation task(s). (Required, multiple selection allowed)
	Evaluation Dimension	By default, accuracy evaluation is included.
Environment Configuration	Select Image	Required: Choose the base image used to run the task. This image packages all necessary software dependencies and provides the runtime environment for inference. You can choose a platform-preloaded image or a user-imported image. If default images do not meet your needs, upload your own via [Evaluation Management / Image Management]. For building custom images using Dockerfile, refer to the recommended guidelines.
	Task Priority	Required: Choose from High, Medium, or Low.
	Select GPU Type	Required: Select the GPU type for evaluation. (Single selection) Note: Ascend GPUs support the MindSpore framework; other GPU types support PyTorch.
	Set Resources	Required: Set resources based on model size. Only single-machine inference is supported. If the resource list is empty after selecting a GPU type, that GPU is currently unavailable—please select another.
Model Information	Paper Link
Model Information	Model Link

Users submit the evaluation instance and jump to the evaluation instance details page, which is divided into three parts: upper, lower left, and lower right. 输入图片说明

Upper half:
- Display the specific paramters when the user [Create Evaluation].
- User operations include: [Edit], [Upload Model & Code], [Launch Inference Evaluation].
- [Edit] operation: editing is allowed except for the model name and evaluation domain.
- [Upload Model & Code] operation: pop up the guide page for uploading model and code.
- [Launch Inference Evaluation] operation: after the user verifies that the model and service are free of problems, they can click to launch the inference evaluation. An email notification will be sent to the user when the evaluation task is completed.
  - Currently, it only supports single-machine inference evaluation.
Lower left:
- Display the previous versions of the inference launch initiated by the user.
- Displayed parameters include: version, task type, status, submitter, and operation.
- Task types include: inference verification & formal inference.
- [Stop] operation: Click this operation to end the inference process.
Lower right: includes model inference details page and log page
- Model inference details page
  - Mainly record task name, dataset name, status, evaluation metrics, metrics values, start time, and end time.
- Log page
  - If a task fails, users can locate the cause of the failure by viewing the log.

Note:：

Users only have five opportunities per month to launch formal inference evaluations.

Upload Model & Code Specifications

Open source sample code, please refer to GitHub

(1) Install flageval-serving

pip install flageval-servingorpip install --upgrade flageval-serving

Refer to https://github.com/FlagOpen/FlagEval/tree/master/flageval-serving

(2) Prepare Model & Code

Currently, the platform only supports inference evaluation after finetuning the BackBone model. Below are the steps for adding the upstream model (BackBone) and its required components:

First, the upstream model folder structure is as follows, the platform provides an example structure and code. Please refer to the specific code for details:

upstream/example_model|—— model.py|—— convert.py|—— ModelWrapper.py

in model.py, two data classes need to be implemented: ExamplePretrainingConfig and exampleconfig. Initialize the pre-training parameter configuration and model structure, respectively. The specific requirements are as follows:

ExamplePretrainingConfig: use this class to configure pre-training parameters, such as sample rate, maximum and minimum length of input data, etc., such as upstream/wav2vec2/wa2vec2_model.py/AudioPretrainingConfig.
ExampleConfig: this class configures the structure parameters of the custom model, such as the number of encoder layers, the dimensions of network input and output of each layer, the size of dropouts, etc.

In addition, model.py should implement the ExampleModel class, which defines the model structure. Please try to use PyTorch for implementation to avoid additional dependencies.

@dataclassclass ExamplePretrainingConfig:    sample_rate: int = field(    default = 16_000,    metadata = {        "help": "target sample rate. audio files will be up/down sampled to this rate"    },    )    max_sample_size: Optional[int] = field(        default = None, metadata = {"help": "max sample size to crop to for batching"}    )    min_sample_size: Optional[int] = field(        default = None, metadata = {"help": "min sample size to skip small examples"}    )    pass    @dataclassclass ExampleConfig:    encoder_layers: int = field(    default = 12, metadata = {"help": "num encoder layers in the transformer"}    )    passclass ExampleModel(nn.Module):    def __init__(self, config):        super(ExampleModel, self).__init__()        pass        def forward(self, wavs, padding_mask):        pass        return features, mask

convert.py

convert.py file is used to convert checkpoint to a format compatible with our project. The structure of the converted checkpoint should be as follows:

{    "task_cfg": ,    # configuration for ExamplePretrainingConfig    "model_cfg",     # configuration for ExampleConfig    "model_weight",  # load model weights using model.load_state_dict(ckpt_state["model_weight"])}

Users can use the load_converted_model function provided in convert.py to verify the compatibility of the converted model.

Python

def load_converted_model(ckpt: str):    ckpt_state = torch.load(ckpt, map_location = "cpu")        for required_key in ["task_cfg", "model_cfg", "model_weight"]:        if required_key not in ckpt_state:            raise ValueError(                f"{ckpt} is not a valid checkpoint since the required key: {required_key} is missing"            )        task_cfg = merge_with_parent(ExamplePretrainingConfig, ckpt_state["task_cfg"])    model_cfg = merge_with_parent(ExampleConfig, ckpt_state["model_cfg"])    model = ExampleModel(model_cfg)    model.load_state_dict(ckpt_state["model_weight"])    return model, task_cfg

ModelWrapper.py

ModelWrapper.py serves as a unified interface for upstream models on the platform. A class needs to be created in this file, such as MyModel, with the following structure:

Python

class MyModel(nn.Module):    def __init__(self, load_type, **kwargs):        super(MyModel, self).init()            # initialize here and load your model            # to get config and .pt file            ckpt_state = torch.load(ckpt, map_location = "cpu")                        self.task_cfg = merge_with_parent(ExamplePretrainingConfig, skpt_state["task_cfg"])            self.model_cfg = merge_with_parent(ExamplePretrainingConfig, skpt_state["model_cfg"])                        self.model = ExampleModel(self.model_cfg)            self.model.load_state_dict(ckpt_state["model_weight"])    def forward(self, wavs, padding_mask = None):        device = wavs[0].device        # implement forward structure, and return feature and feature_mask. dimensions are (B*L*D)            def get_output_dim(self):        # define your model's output dimension D, using 768 as an example        return 768

FlagEval-Serving Tool Detailed Guide

Users can upload models, code, data, and other evaluation files using the FlagEval-Serving tool. It also supports local testing.

The operation process is as follows:

FlagEval-Serving Tool Detailed Guide

Users can upload models, code, data, and other evaluation files using the FlagEval-Serving tool. It also supports local testing.

The operation process is as follows:

(1) Install flageval-serving

Markdown

pip install flageval-serving或pip install --upgrade flageval-serving

Open-source repository: https://github.com/FlagOpen/FlagEval/tree/master/flageval-serving

(2) Prepare Code, Model & Data

Different evaluation domains require different directory structures and API specifications.

Please refer to the specific requirements of each domain when preparing the files for upload.

For NLP model evaluations, local model debugging is also supported. Please refer to the operation specification for the NLP domain.

(3) Common Commands

    •        A unique token is required to upload or update files.
    •        You can get this token by clicking [Upload Model & Code] on the evaluation details page.
    •        If you are appending files in the current version, use the cp command along with the token provided in the file list page.
    •        Note: All commands below use single quotes ' in English.

Commands	Meaning	Remarks
pip install --upgrade flageval-serving	Update to the latest version	Please make sure you are using the latest version.
flageval-serving upload --token='new token' /demo	Upload the local demo folder to the Flageval platform	'new token' must be obtained from the [Upload Model & Code] section on the evaluation details page.
flageval-serving cp --token='old token' /temp /remote	Append a local folder to the remote directory linked to 'old token'	Similar to the Linux cp command: /temp can be either a directory or a file. If /remote/temp does not exist, it will be created. If it exists, files will be incrementally overwritten. Note: Only use the old token from the file details page—not the new token.
flageval-serving ls --token='old token' /remote	List directories and files under /remote on the platform
flageval-serving	Display help commands

(4) File Upload

Click [Upload Model & Code] on the evaluation page to get your token, then use the following command:

flageval-serving upload --token='TOKEN' demo

Upload speed depends on the outbound bandwidth of the host machine and the inbound bandwidth of the evaluation platform. The maximum upload speed is approximately 12.5 MB/s.

Image Management

By clicking [Evaluation Management / Image Management], users can view both the preloaded image list provided by the platform and their own custom image list (images they have uploaded).

Preloaded Images: The Flageval platform provides preconfigured images for various domains, including all necessary evaluation dependencies, which users can use directly.

Users can check detailed image information based on the image name, description, training framework, compatible hardware, and tags.

输入图片说明

Custom Images: The Flageval platform allows users to import their own pre-built images for use in evaluation tasks.
Currently, the platform only supports importing existing images. It does not support building images on the platform using Dockerfiles. The Dockerfile provided by the user is for review purposes only, to be examined by the platform administrators during the approval process.

输入图片说明

When the user clicks [+ Import Image], a [+ Import Image] dialogue box will pop up. Fill in the required parameters and click Submit to proceed.

输入图片说明

Parameter	Explanation
Image Name	Enter the name of the image. Must be 3–128 characters in length and consist of lowercase letters, numbers, hyphens (-), and dots (.). Must start with a lowercase letter and end with a lowercase letter or number. Required.
Image Tag	Enter the image tag. If left blank, the default will be a timestamp. The use of the keyword `latest` is not allowed. Optional.
Import Method	Currently, only import from public image repositories is supported.
Image URL to Import	Enter the image URL. Required. Supports DockerHub, NGC, or other publicly accessible image repositories.
Dockerfile	Required. Provided for platform reviewers to inspect the image content. Note: For Dockerfile requirements, please refer to the [documentation].
Image Description	Optional. Supports 0–256 characters. Chinese is supported.

After the user clicks [Submit], the system returns to the Image Management list page. The platform administrator will review the imported image. Once approved, the image will be directly imported into the platform’s image repository, and the user will be able to use the image when creating evaluation tasks.

If the image is not approved, please check the reason for rejection and modify the image URL or Dockerfile according to platform specifications before re-importing.

Notifications for approval, successful import, or rejection will be sent to the user via email.

If there is no status update for an extended period, please contact the platform administrator for assistance.

Custom Image Building with Dockerfile

A Dockerfile is a text file used to build container images. It contains a series of instructions and parameters that define how the image should be constructed. Docker builds the image automatically by executing the instructions in the Dockerfile using the docker build command. For details on Dockerfile syntax, please refer to the official documentation, and you may also consult the Quick Start Guide for a beginner-friendly overview.

When performing offline evaluations, if the platform’s preloaded images do not meet the user’s requirements, the user can provide a Dockerfile, and the platform will assist in generating a usable image.

Common Dockerfile keywords include:

Command Keyword	Meaning
FROM	Specifies the base image upon which the current image is built. This must be the first command in the Dockerfile. The selected base image must be secure and trustworthy. It is recommended to prioritize platform-preloaded images or official images.
RUN	Defines the commands to execute during image build. These are used to install packages or perform setup steps while constructing the image.
ENV	Used to set environment variables during the image build process (e.g., `ENV MY_PATH="/usr/mytest"`). These variables are persisted in the container and can be accessed after the container starts.
ARG	Defines temporary variables used during image build (e.g., `ARG TMP="mytemp"`). These variables are only available at build time and are not accessible after the container is started.

Note：

The following environment variables set in ENV are provided by the container and must not be modified.
- NVIDIA_, HOSTNAME, KUBERNETES_, RANK, MASTER_, AIRS_, CUDA_, NCCL_, PADDLE_* Dockerfile Usage Recommendations：
The FROM base image must be selected from the platform-provided candidate images.
Minimize the number of image layers. Commands like RUN and COPY generate new layers during the build process, so it is recommended to use && to chain multiple commands into a single RUN statement.

The following command creates two image layers：

RUN apt-get install -y openssh-serverRUN mkdir -p /run/sshd

After modifying to the following command, two layers are merged into one, which helps reduce the overall image size.

RUN apt-get install -y openssh-server \        && mkdir -p /run/sshd

When the command is long, it is recommended to split it into multiple lines, and sort multi-line parameters for better readability.
- For example, you can sort the list of packages alphabetically to avoid duplicate installations and improve code clarity. Here’s a sample:

RUN apt-get update && apt-get install -y \  bzr \  cvs \  git \  mercurial \  subversion \  && rm -rf /var/lib/apt/lists/*

Reduce image size by cleaning up caches:

RUN apt-get -y ... && apt-get clean

When installing Python packages, it is recommended to use:

RUN pip install --no-cache-dir -i {custom index URL} {package}

Reduce image size by cleaning up caches:

RUN apt-get -y ... && apt-get clean
When installing Python packages, it is recommended to use:RUN pip install --no-cache-dir -i {custom index URL} {package}
When installing Conda-related packages, it is recommended to include clean-up commands afterward. The following parameters mean: • RUN conda clean -p (Remove unused packages) • RUN conda clean -t (Remove tar files) • RUN conda clean -y --all (Remove all packages and cached files)
Use the --no-cache=true option with the docker build command to disable build caching.

Dockerfile Example：This example corresponds to the platform image: flageval-nlp-ngc-pytorch2303:v1.0.

This image already meets the environment requirements for mainstream models such as YuLan-Chat-2-13b, LLaMA, ChatGLM2-6B, and Baichuan-13B-Chat.

If this image meets your needs, it is recommended to use it.

FROM nvcr.io/nvidia/pytorch:23.03-py3
RUN apt-get update
RUN apt-get install -y openssh-server && mkdir -p /run/sshd
Recommended versions; users may install based on their model needs.
It is advised to separate transformer-related packages from others for clarity.
RUN pip install --no-cache-dir \
    transformer-engine==0.6.0 \
    transformers==4.31.0 \
    transformers-stream-generator==0.0.4
RUN pip install --no-cache-dir \
    sentencepiece==0.1.98 \
    accelerate==0.21.0 \
    colorama==0.4.6 \
    cpm_kernels==1.0.11 \
    streamlit==1.25.0 \
    fairscale==0.4.13
Qwen model dependency
RUN pip install tiktoken

Platform Image Approval Principles

Currently, the platform mainly provides NVIDIA card series graphics card. It is recommended to use NVIDIA's official images (nvcr.io/nvidia) as the base image. If using the PyTorch framework, it is recommended to use nvcr.io/nvidia PyTorch image as the FROM source. For images from other sources, if the approver judges that there is a security risk, then it will not be approved.
If you use other brands of graphic cards, you can use the image provided by the platform for testing. If it does not meet your needs, please contact the operation and maintenance personnel (flageval@baai.ac.cn) for processing.
Here are some suggestions for dockerfile:
- The first line of Dockerfile must be the FROM command, indicating the source of the base image. Otherwise, the image import request will be rejected.
- In the image, the platform needs to communicate with the backend k8s cluster, and SSH needs to be installed.
- Try not to use commands such as COPY and ADD in the dockerfile. If you use these two commands, add comments to explain their content.
- According to the security requirements of images, there cannot be operations such as wget and curl in RUN to download unknown packages.
- Users need to use the system's built-in commands to install the official system package.
- The imported image URL filled in needs to be able to be downloaded using the docker command.
- Try to specify the use of domestic sources when installing packages.
The platform has a size limit for images: 40Gi.
Image security essential requirements
- Forbid images that consist of high-risk security bugs that are publicly available and have published fixes.
- It is forbidden to use officially discontinued published versions for imaging.
- Forbid images from installing any malicious programs such as viruses, Trojan horse, backdoors, mining, and AFK.
- Forbid the use of any pirated or cracked versions of programs.
- Forbid the use of any program that may endanger platform security.

Note:

To ensure the security and compliance of service images, we will regularly scan the images that has been listed. If we find any security bugs or violations in images, we will remove the images and pursue the legal responsibility of the uploader. Thank you for your support and cooperation on our platform.

View Evaluation Results

Viewing Evaluation Results in the NLP Domain

Once the related task status shows success, you can view the detailed results under indicators. 输入图片说明

Viewing Evaluation Results in the CV Domain

Once the evaluation task shows Inference Completed, click Fine-tuning & Inference Details to view the evaluation results. 输入图片说明

Viewing Evaluation Results in the Multimodal Domain

Once the evaluation task status shows Inference Completed, click Inference Evaluation Details to view the specific results.

输入图片说明

Viewing Evaluation Results in the Audio (Speech) Domain

Once the evaluation task status shows Inference Completed, click Model Inference to view the specific results. 输入图片说明

Evaluation-Lite Introduction

The Evaluation-Lite on this platform is only available for assessing the NLP and multimodal abilities now. By default, it is turned off. Once manually enabled, the evaluation results will be automatically uploaded to the 「FlagEval Leaderboard-lite」.

If the model under evaluation is evaluated through API invocation, it is necessary to prepare the standard format API interface and API key in advance in accordance with the FlagEval platform interface specification.FlagEval-API Interface Requirements
If the model under evaluation is an open-source model, it must be verified by Hugging Face and must be a registered model of Hugging Face before the evaluation. The following is the user operation process：

Choose whether to enable「Evaluation-Lite」. Once enabled, the evaluation results will be automatically uploaded to the「FlagEval Leaderboard-lite」.

输入图片说明

After selecting the「Evaluation-Lite」button, choose the deployment method (currently, multimodal models already support API evaluation, and NLP models are about to be adapted).

API Evaluation：For models evaluated through API calls, fill in the model information on the drop-down page, including「Model Name」、「Description」、「Deployment Method」、「Evaluation Interface」、「Online Model Name」、「Online API KEY」

输入图片说明

Private deployment：The private deployment method supports the open-source model.

After enabling the「Evaluation-Lite」, the current model meta-information will be verified to check if the model name exists in Hugging Face. If not, it will redirect to the Hugging Face page for model registration

输入图片说明

After the Hugging Face model verification is passed, users can submit the model for evaluation.

The evaluation time depends on the size of the current model under evaluation. The progress of the current evaluation task can be viewed in the evaluation details.
After the model evaluation is completed, the evaluation results will be displayed synchronously in the evaluation details and 「FlagEval Leaderboard-lite」.

输入图片说明

Evaluation Operation Process ​

Selection of Different Evaluation Domains ​

NLP Domain Model Evaluation ​

Create Evaluation ​

Subjective Inference Evaluation ​

Upload Model & Code Specifications ​

Local Model Debugging ​

Set the number for inference service progress ​

Model Upload ​

CV Domain Model Evaluation ​

Create Evaluation ​

Upload Model & Code Specifications ​

BackBone Model Format Specifications ​

Multimodal Domain Model Evaluation ​

Create Evaluation ​

Upload Model & Code Specifications ​

Audio Domain Model Evaluation ​

Create Evaluation ​

Upload Model & Code Specifications ​

FlagEval-Serving Tool Detailed Guide ​

FlagEval-Serving Tool Detailed Guide ​

Image Management ​

Custom Image Building with Dockerfile ​

Platform Image Approval Principles ​

View Evaluation Results ​

Evaluation-Lite Introduction ​

Evaluation Operation Process

Selection of Different Evaluation Domains

NLP Domain Model Evaluation

Create Evaluation

Subjective Inference Evaluation

Upload Model & Code Specifications

Local Model Debugging

Set the number for inference service progress

Model Upload

CV Domain Model Evaluation

Create Evaluation

Upload Model & Code Specifications

BackBone Model Format Specifications

Multimodal Domain Model Evaluation

Create Evaluation

Upload Model & Code Specifications

Audio Domain Model Evaluation

Create Evaluation

Upload Model & Code Specifications

FlagEval-Serving Tool Detailed Guide

FlagEval-Serving Tool Detailed Guide

Image Management

Custom Image Building with Dockerfile

Platform Image Approval Principles

View Evaluation Results

Evaluation-Lite Introduction