Quick Start

Unregistered users are by default allowed to view the platform’s homepage, news, leaderboard, and user manual, as well as experience the Large Model Arena and Debate Competition features.

To access the evaluation functions, users must register and log in to the platform, apply for participation in the evaluation through the Evaluation Management section, and complete their personal information. Please ensure that all submitted information is accurate and valid. Once the information is submitted and approved by the administrator, users will be granted access to the evaluation features of the platform.

Detailed instructions are as follows:

Register

When users click the [Login/Register] button, the following interface will appear. For first-time users, please scan the QR code using WeChat to follow the “BAAI Community Assistant” official WeChat account.

sign-1 After scanning the QR code and following the account, the interface will update to the following layout, where users can register by entering their email, phone number, and verification code online. sign-2

After completing registration, the system will redirect to the platform homepage. By clicking [Evaluation Console], users can apply for evaluation participation. Users are required to complete their personal information, which the platform administrator will review. Only users who pass the review will be granted access to the evaluation features. The review result will be sent to users via email.

completement

Parameter	Explanation
Username	Username will be the only identifier on the platform. It is recommended to use the spelling of full name with numbers. After filling in the username, modification is not allowed. Length of 3-32 characters, supports lowercase letters and numbers, starting with a lowercase letter.
Real Name	Users should fill in their real name, and the platform administrator will give priority to real names during the approval process.
Organization	It is recommended to use a combination of organization + department, such as Beijing Academy of Artifical Intelligence Computing Power Platform, Tsinghua University Computer Scence Department. The platform administrator will give priority to real organization during the approval process. Organizations need to be filled in with both Chinese and English.
Task to Register	Select "Online Evaluation" or "Offline Evaluation": Online Evaluation: Users only need to provide the evaluation interface API, and the evaluation platform provides test data for inference evaluation. - Not supported yet. Offline Evaluation: Users need to upload trained models and inference codes. The evaluation platform provides inference computing power and data for inference evaluation.
Whether to evaluate self-developed models	Yes & No, single choice
Agreement statement	Users need to read and provide consent to the agreement before they can use the platform's evaluation function.

The registration process is shown in the following image:

Note:

Please fill in your personal information carefully, as the administrator will review your application based on the provided details.
Please fill in a valid personal business email. The review status will be notified via email and SMS, and future evaluation task updates will be sent by email.
If a personal email is used, the administrator will send an email requesting you to update it. Please change to an email address and wait for the review again. Each user is allowed to modify their email only once per month.

If the user has already completed registration, clicking the [Login/Register] button will bring up the Login page. The user can choose to log in by scanning the QR code at the top of the screen using WeChat Scan, or by using the Mobile Verification Code option.

Alternatively, the user can click [Hugging Face] to log in via a third-party platform, which will redirect to the Hugging Face login page.

The login process is as follows: 输入图片说明

Enter the following two pieces of information: Username / Email Address and Password.

输入图片说明

If you do not have a Hugging Face account, please register first. You can also log in by authorizing with your Hugging Face account. 输入图片说明

Create Evaluation

When users click [Evaluation Management], they will enter the Evaluation Management page, which mainly includes: Model Evaluation, Innovative Algorithm Evaluation, and Image Management.

Users can choose either Model Evaluation or Innovative Algorithm Evaluation based on their needs. By clicking [Create Evaluation], a Create Evaluation dialogue box will pop up. Users should fill in the corresponding form according to the evaluation domain to submit and generate an evaluation task.

After creating an evaluation, the system will automatically redirect to the task details page. Users can click to view the "Upload Model & Code" specification and use flageval-serving to upload models and codes. After uploading, click "Inference Verification" to quickly verify whether the inference evaluation code can run. After passing the verification, click "Start Inference Evaluation" to proceed with the formal inference evaluation process. Wait for the evaluation to end to view the evaluation results. If there is any problem that causes termination and failure, the error message can be viewed through logs.

Upload Image

In [Image Management], several preset images are provided. If users need to use a custom image during actual evaluations, they can upload their own image under [Custom Image].

When users click [Image Management / Custom Image / Import Image], the [Import Image] dialog box will pop up. Users need to fill in and submit the form. After submission, the platform administrator will review it. Once approved, the image will be automatically imported. Only after a successful import can the image be used in evaluation tasks.

Currently, the platform only supports importing existing images. It does not support building images on the platform using a Dockerfile. The Dockerfile provided by the user is for review purposes only.

输入图片说明

Auto-Evaluation

Introduction

Auto-Evaluation is a multi-chip compatible automatic model evaluation tool built on the Flageval platform. It covers three domains: LLM, VLM, and Embodied VLM, and is bound to classic datasets in corresponding fields. Currently, it only supports online evaluation. The tool provides interfaces for starting evaluations, checking evaluation progress, viewing evaluation results, stopping evaluations, resuming evaluations, and analyzing the differences in evaluation results across multiple chips. Whether you are an individual, a team, or an enterprise, as long as you launch the service on compute resources and support public network access, you can use this tool for automatic evaluation.

Tool Description

1、Address:

120.92.17.239:5050

2、Interface Overview

Interface Name	Method	Description
/evaluation	POST	Start evaluation; invokes the uploaded service URL synchronously to begin model inference
/evaldiffs	GET	Query evaluation results
/stop_evaluation	POST	Stop evaluation; use this interface to pause if there are issues with the service
/resume_evaluation	POST	Resume evaluation; supports resuming from breakpoints
/evaluation_progress	POST	View evaluation progress; returns detailed information on completed evaluation status for each dataset
/evaluation_diffs	POST	View the differences of results between multiple models

Detailed Interface Parameter Description

1. Start Evaluation

Request Interface

header

```json
"Content-Type": "application/json" 
```

body:

Parameter Name	Type	Description	Required
eval_infos	EvalInfo[]	List of model evaluation service information	Yes
domain	string	Evaluation domain: NLP, MM	Yes
mode	string	Evaluation project identifier, having default value	No
region	string	Evaluation tool cluster: bj (default), sz	No
special_event	string	Whether it is a chip evaluation, default is yes	No
user_id	int	User ID on FlagEval platform	No

EvalInfo Data Structure Parameters

Parameter Name	Type	Description	Required
eval_model	string	Name of the evaluation task to start, must be unique (to specify if the evaluation is an NVIDIA baseline model, please use xxx-nvidia-origin for naming)	Yes
model	string	The model used when the model is deployed as a service (multiple evaluations started with the same model will use the same cache)	Yes
eval_url	string	The evaluation interface for each deployed model service, e.g., http://10.1.15.153:9010/v1/chat/completions	Yes
tokenizer	string	Vendor and model information, e.g., Qwen/Qwen3-8B	Yes
api_key	string	API_KEY for model invocation, defaults to "EMPTY"	No
batch_size	int	The batch_size that the model can handle, defaults to 1	No
num_concurrent	int	Number of concurrent requests, defaults to 1	No
num_retry	int	Number of retries, defaults to 10	No
gen_kwargs	string	Model parameters such as temperature, topn, etc., separated by commas e.g.: temperature=0.6,top_k=20,top_p=0.95,min_p=0 NLP uses max_gen_toks; for max_model_len of 16384, please set max_gen_toks=16000 MM and EV use the max_gen_toks field to specify	No
thinking	bool	Whether to enable the thinking model, currently only applicable to EmbodiedVerse's RoboBrain, defaults to False	No
retry_time	int	Timeout duration, currently supports MM and Embodied domains Current default is 3600s	No
chip	string	Name of the chip used for evaluation, format: Vendor-ChipName Default is: Nvidia-H100	No
base_model_name	string	Base model used for evaluation, e.g.: Qwen3-8B	No

Response Data

Note: Since model evaluation takes a long time, the interface temporarily returns the corresponding evaluation ID and batch ID, which can be used to query evaluation results later.

Parameter Name	Type	Description	Required
err_code	int	Whether the request was processed correctly, 0 for success, 1 for failure; if 1, request_id is not included in return data	Yes
err_msg	string	Request processing message	Yes
request_id	string	Unique Identifier	Yes
eval_tasks	EvalTask[]	Detailed information on evaluations started for each service	Yes

2. View Evaluation Results

Request Interface

header

```bash
"Content-Type": "application/json" 
```

body:

Parameter Name	Type	Description	Required
request_id	string	Unique identifier	Yes

Response Data

Parameter Name	Type	Description	Required
err_code	int	Whether the request was processed correctly, 0 for success, 1 for failure	Yes
err_message	string	Request processing message	Yes
eval_results	EvalResultMap	Evaluation result information for each service

EvalResultMap Data Structure

Parameter Name	Type	Description	Required
EvalResultMap	Map<string, EvalResult>	Evaluation results for all models in one run	Yes
EvalResultMap.key	string	The eval_model corresponding to the started evaluation	Yes
EvalResultMap.value	EvalResult[]	Evaluation results for a single model	Yes

EvalResult Data Structure

Parameter Name	Type	Description	Required
status	string	Evaluation status. e.g.: S: Success, F: Failure, C: Cancelled, OOR: Out of Retries	Yes
details	Detail[]	Evaluation results for each dataset of the corresponding evaluation service (currently only supports mmlu, gsm8k)	Yes
release	bool	Whether the model can be released, if the diff is within acceptable range	Yes

Detail Data Structure

Parameter Name	Type	Description	Required
dataset	string	Dataset name	Yes
status	string	Running status of the corresponding evaluation service, e.g.: S: Success, F: Failure, C: Cancelled	Yes
accuracy	float	Dataset evaluation result	Yes
diff	float	Difference between dataset evaluation result and NVIDIA baseline	Yes

3. Stop Evaluation

Request Interface

header

```bash
"Content-Type": "application/json" 
```

body

Parameter Name	Type	Description	Required
request_id	string	Unique Identifier	Yes

Response Data

Parameter Name	Type	Description	Required
err_code	int	Whether the request was processed correctly, 0 for success, 1 for failure	Yes
err_message	string	Request processing message	Yes

4. Resume Evaluation

Request Interface

header

```bash
"Content-Type": "application/json" 
```

body:

Parameter Name	Type	Description	Required
request_id	string	Unique Identifier	Yes

Response Data

Parameter Name	Type	Description	Required
err_code	int	Whether the request was processed correctly, 0 for success, 1 for failure	Yes
err_message	string	Request processing message	Yes

5. Query Evaluation Progress

Request Interface

header

```bash
"Content-Type": "application/json" 
```

body:

Parameter Name	Type	Description	Required
request_id	string	Unique Identifier	Yes
domain	string	Evaluation domain (NLP, MM)	Yes

Response Data

Parameter Name	Type	Description	Required
err_code	int	Whether the request was processed correctly, 0 for success, 1 for failure	Yes
err_message	string	Request processing message	Yes
finished	bool	Whether evaluation is finished	Yes
status	string	Evaluation status	Yes
datasets_progress	string	Dataset progress	Yes
running_dataset	string	Currently running dataset	Yes
running_progress	string	Evaluation progress within the running dataset	Yes

6. Query Evaluation Differences

Request Interface

header

```bash
"Content-Type": "application/json" 
```

body:

Parameter Name	Type	Description	Required
request_ids	string[]	Unique Identifier	Yes

Response Data

Parameter Name	Type	Description	Required
err_code	int	Whether the request was processed correctly, 0 for success, 1 for failure	Yes
err_message	string	Request processing message	Yes
eval_diffs	EvalDiff[]	List of evaluation result comparisons	Yes

EvalDiff Data Structure

Parameter Name	Type	Description	Required
request_id	string	UUID of the evaluated record	Yes
details	Detail[]	Detailed comparison data for each dataset	Yes
release	bool	Whether release conditions are met	Yes

Detail Data Structure

Parameter Name	Type	Description	Required
dataset	string	Dataset name	Yes
base_acc	float	Baseline score	Yes
accuracy	float	Score of evaluated dataset	Yes
diff	float	Difference between evaluated dataset and baseline dataset	Yes

Quick Start ​

Register and Login ​

Register ​

Login ​

Create Evaluation ​

Upload Image ​

Auto-Evaluation ​

Introduction ​

Tool Description ​

1、Address: ​

2、Interface Overview ​

Detailed Interface Parameter Description ​

1. Start Evaluation ​

2. View Evaluation Results ​

3. Stop Evaluation ​

4. Resume Evaluation ​

5. Query Evaluation Progress ​

6. Query Evaluation Differences ​

Quick Start

Register and Login

Register

Login

Create Evaluation

Upload Image

Auto-Evaluation

Introduction

Tool Description

1、Address:

2、Interface Overview

Detailed Interface Parameter Description

1. Start Evaluation

2. View Evaluation Results

3. Stop Evaluation

4. Resume Evaluation

5. Query Evaluation Progress

6. Query Evaluation Differences