Skip to content

Quick Start

Register and Login

Unregistered users are by default allowed to view the platform’s homepage, news, leaderboard, and user manual, as well as experience the Large Model Arena and Debate Competition features.

To access the evaluation functions, users must register and log in to the platform, apply for participation in the evaluation through the Evaluation Management section, and complete their personal information. Please ensure that all submitted information is accurate and valid. Once the information is submitted and approved by the administrator, users will be granted access to the evaluation features of the platform.

Detailed instructions are as follows:

Register

When users click the [Login/Register] button, the following interface will appear. For first-time users, please scan the QR code using WeChat to follow the “BAAI Community Assistant” official WeChat account.

sign-1 After scanning the QR code and following the account, the interface will update to the following layout, where users can register by entering their email, phone number, and verification code online. sign-2

After completing registration, the system will redirect to the platform homepage. By clicking [Evaluation Console], users can apply for evaluation participation. Users are required to complete their personal information, which the platform administrator will review. Only users who pass the review will be granted access to the evaluation features. The review result will be sent to users via email.

completement

ParameterExplanation
Username
  • Username will be the only identifier on the platform. It is recommended to use the spelling of full name with numbers. After filling in the username, modification is not allowed.
  • Length of 3-32 characters, supports lowercase letters and numbers, starting with a lowercase letter.
Real Name
  • Users should fill in their real name, and the platform administrator will give priority to real names during the approval process.
Organization
  • It is recommended to use a combination of organization + department, such as Beijing Academy of Artifical Intelligence Computing Power Platform, Tsinghua University Computer Scence Department. The platform administrator will give priority to real organization during the approval process.
  • Organizations need to be filled in with both Chinese and English.
Task to Register
  • Select "Online Evaluation" or "Offline Evaluation":
  • Online Evaluation: Users only need to provide the evaluation interface API, and the evaluation platform provides test data for inference evaluation. - Not supported yet.
  • Offline Evaluation: Users need to upload trained models and inference codes. The evaluation platform provides inference computing power and data for inference evaluation.
Whether to evaluate self-developed models
  • Yes & No, single choice
Agreement statement
  • Users need to read and provide consent to the agreement before they can use the platform's evaluation function.

The registration process is shown in the following image:

Note:

  • Please fill in your personal information carefully, as the administrator will review your application based on the provided details.
  • Please fill in a valid personal business email. The review status will be notified via email and SMS, and future evaluation task updates will be sent by email.
  • If a personal email is used, the administrator will send an email requesting you to update it. Please change to an email address and wait for the review again. Each user is allowed to modify their email only once per month.

Login

If the user has already completed registration, clicking the [Login/Register] button will bring up the Login page. The user can choose to log in by scanning the QR code at the top of the screen using WeChat Scan, or by using the Mobile Verification Code option.

Alternatively, the user can click [Hugging Face] to log in via a third-party platform, which will redirect to the Hugging Face login page.

The login process is as follows: 输入图片说明输入图片说明

Enter the following two pieces of information: Username / Email Address and Password.

输入图片说明

If you do not have a Hugging Face account, please register first. You can also log in by authorizing with your Hugging Face account. 输入图片说明输入图片说明

Create Evaluation

When users click [Evaluation Management], they will enter the Evaluation Management page, which mainly includes: Model Evaluation, Innovative Algorithm Evaluation, and Image Management.

Users can choose either Model Evaluation or Innovative Algorithm Evaluation based on their needs. By clicking [Create Evaluation], a Create Evaluation dialogue box will pop up. Users should fill in the corresponding form according to the evaluation domain to submit and generate an evaluation task.

After creating an evaluation, the system will automatically redirect to the task details page. Users can click to view the "Upload Model & Code" specification and use flageval-serving to upload models and codes. After uploading, click "Inference Verification" to quickly verify whether the inference evaluation code can run. After passing the verification, click "Start Inference Evaluation" to proceed with the formal inference evaluation process. Wait for the evaluation to end to view the evaluation results. If there is any problem that causes termination and failure, the error message can be viewed through logs.

Upload Image

In [Image Management], several preset images are provided. If users need to use a custom image during actual evaluations, they can upload their own image under [Custom Image].

When users click [Image Management / Custom Image / Import Image], the [Import Image] dialog box will pop up. Users need to fill in and submit the form. After submission, the platform administrator will review it. Once approved, the image will be automatically imported. Only after a successful import can the image be used in evaluation tasks.

Currently, the platform only supports importing existing images. It does not support building images on the platform using a Dockerfile. The Dockerfile provided by the user is for review purposes only.

输入图片说明

Auto-Evaluation

Introduction

Auto-Evaluation is a multi-chip compatible automatic model evaluation tool built on the Flageval platform. It covers three domains: LLM, VLM, and Embodied VLM, and is bound to classic datasets in corresponding fields. Currently, it only supports online evaluation. The tool provides interfaces for starting evaluations, checking evaluation progress, viewing evaluation results, stopping evaluations, resuming evaluations, and analyzing the differences in evaluation results across multiple chips. Whether you are an individual, a team, or an enterprise, as long as you launch the service on compute resources and support public network access, you can use this tool for automatic evaluation.

Tool Description

1、Address:

120.92.17.239:5050

2、Interface Overview

Interface NameMethodDescription
/evaluationPOSTStart evaluation; invokes the uploaded service URL synchronously to begin model inference
/evaldiffsGETQuery evaluation results
/stop_evaluationPOSTStop evaluation; use this interface to pause if there are issues with the service
/resume_evaluationPOSTResume evaluation; supports resuming from breakpoints
/evaluation_progressPOSTView evaluation progress; returns detailed information on completed evaluation status for each dataset
/evaluation_diffsPOSTView the differences of results between multiple models

Detailed Interface Parameter Description

1. Start Evaluation

Request Interface

header

```json
"Content-Type": "application/json" 
```

body:

Parameter NameTypeDescriptionRequired
eval_infosEvalInfo[]List of model evaluation service informationYes
domainstringEvaluation domain: NLP, MMYes
modestringEvaluation project identifier, having default valueNo
regionstringEvaluation tool cluster: bj (default), szNo
special_eventstringWhether it is a chip evaluation, default is yesNo
user_idintUser ID on FlagEval platformNo

EvalInfo Data Structure Parameters

Parameter NameTypeDescriptionRequired
eval_modelstringName of the evaluation task to start, must be unique (to specify if the evaluation is an NVIDIA baseline model, please use xxx-nvidia-origin for naming)Yes
modelstringThe model used when the model is deployed as a service (multiple evaluations started with the same model will use the same cache)Yes
eval_urlstringThe evaluation interface for each deployed model service, e.g., http://10.1.15.153:9010/v1/chat/completionsYes
tokenizerstringVendor and model information, e.g., Qwen/Qwen3-8BYes
api_keystringAPI_KEY for model invocation, defaults to "EMPTY"No
batch_sizeintThe batch_size that the model can handle, defaults to 1No
num_concurrentintNumber of concurrent requests, defaults to 1No
num_retryintNumber of retries, defaults to 10No
gen_kwargsstringModel parameters such as temperature, topn, etc., separated by commas
e.g.: temperature=0.6,top_k=20,top_p=0.95,min_p=0
NLP uses max_gen_toks; for max_model_len of 16384, please set max_gen_toks=16000
MM and EV use the max_gen_toks field to specify
No
thinkingboolWhether to enable the thinking model, currently only applicable to EmbodiedVerse's RoboBrain, defaults to FalseNo
retry_timeintTimeout duration, currently supports MM and Embodied domains
Current default is 3600s
No
chipstringName of the chip used for evaluation, format: Vendor-ChipName
Default is: Nvidia-H100
No
base_model_namestringBase model used for evaluation, e.g.: Qwen3-8BNo

Response Data

Note: Since model evaluation takes a long time, the interface temporarily returns the corresponding evaluation ID and batch ID, which can be used to query evaluation results later.

Parameter NameTypeDescriptionRequired
err_codeintWhether the request was processed correctly, 0 for success, 1 for failure; if 1, request_id is not included in return dataYes
err_msgstringRequest processing messageYes
request_idstringUnique IdentifierYes
eval_tasksEvalTask[]Detailed information on evaluations started for each serviceYes

2. View Evaluation Results

Request Interface

header

```bash
"Content-Type": "application/json" 
```

body:

Parameter NameTypeDescriptionRequired
request_idstringUnique identifierYes

Response Data

Parameter NameTypeDescriptionRequired
err_codeintWhether the request was processed correctly, 0 for success, 1 for failureYes
err_messagestringRequest processing messageYes
eval_resultsEvalResultMapEvaluation result information for each service

EvalResultMap Data Structure

Parameter NameTypeDescriptionRequired
EvalResultMapMap<string, EvalResult>Evaluation results for all models in one runYes
EvalResultMap.keystringThe eval_model corresponding to the started evaluationYes
EvalResultMap.valueEvalResult[]Evaluation results for a single modelYes

EvalResult Data Structure

Parameter NameTypeDescriptionRequired
statusstringEvaluation status. e.g.: S: Success, F: Failure, C: Cancelled, OOR: Out of RetriesYes
detailsDetail[]Evaluation results for each dataset of the corresponding evaluation service (currently only supports mmlu, gsm8k)Yes
releaseboolWhether the model can be released, if the diff is within acceptable rangeYes

Detail Data Structure

Parameter NameTypeDescriptionRequired
datasetstringDataset nameYes
statusstringRunning status of the corresponding evaluation service, e.g.: S: Success, F: Failure, C: CancelledYes
accuracyfloatDataset evaluation resultYes
difffloatDifference between dataset evaluation result and NVIDIA baselineYes

3. Stop Evaluation

Request Interface

header

```bash
"Content-Type": "application/json" 
```

body

Parameter NameTypeDescriptionRequired
request_idstringUnique IdentifierYes

Response Data

Parameter NameTypeDescriptionRequired
err_codeintWhether the request was processed correctly, 0 for success, 1 for failureYes
err_messagestringRequest processing messageYes

4. Resume Evaluation

Request Interface

header

```bash
"Content-Type": "application/json" 
```

body:

Parameter NameTypeDescriptionRequired
request_idstringUnique IdentifierYes

Response Data

Parameter NameTypeDescriptionRequired
err_codeintWhether the request was processed correctly, 0 for success, 1 for failureYes
err_messagestringRequest processing messageYes

5. Query Evaluation Progress

Request Interface

header

```bash
"Content-Type": "application/json" 
```

body:

Parameter NameTypeDescriptionRequired
request_idstringUnique IdentifierYes
domainstringEvaluation domain (NLP, MM)Yes

Response Data

Parameter NameTypeDescriptionRequired
err_codeintWhether the request was processed correctly, 0 for success, 1 for failureYes
err_messagestringRequest processing messageYes
finishedboolWhether evaluation is finishedYes
statusstringEvaluation statusYes
datasets_progressstringDataset progressYes
running_datasetstringCurrently running datasetYes
running_progressstringEvaluation progress within the running datasetYes

6. Query Evaluation Differences

Request Interface

header

```bash
"Content-Type": "application/json" 
```

body:

Parameter NameTypeDescriptionRequired
request_idsstring[]Unique IdentifierYes

Response Data

Parameter NameTypeDescriptionRequired
err_codeintWhether the request was processed correctly, 0 for success, 1 for failureYes
err_messagestringRequest processing messageYes
eval_diffsEvalDiff[]List of evaluation result comparisonsYes

EvalDiff Data Structure

Parameter NameTypeDescriptionRequired
request_idstringUUID of the evaluated recordYes
detailsDetail[]Detailed comparison data for each datasetYes
releaseboolWhether release conditions are metYes

Detail Data Structure

Parameter NameTypeDescriptionRequired
datasetstringDataset nameYes
base_accfloatBaseline scoreYes
accuracyfloatScore of evaluated datasetYes
difffloatDifference between evaluated dataset and baseline datasetYes