Skip to content

Evaluation Metrics

1. Accuracy

Accuracy refers to the average accuracy of the model across all evaluation instances. The concept of correctness may vary from case to case, so we present the main accuracy measures considered in the measurement, the scenarios in which these measures are used, and the associated formal definitions.

1.1 Exact match

Exact match means that the model-generated answer matches exactly the correct reference answer as a string. Exact matches are used as the default accuracy metric on datasets like HellaSwag, OpenBookQA, TruthfulQA, MMLU, and more.

1.2 Quasi-exact match

The correctness condition of Quasi-exact match can be further extended on the basis of exact match, and some fine post-processing (such as case conversion, deletion of white space and punctuation) can be performed on the model generated answers. Quasi-exact matches are used on data sets such as BoolQ, IMDB, RAFT, etc.

1.3 ROUGE-2

ROUGE-2 uses the standard ROUGE fraction (Lin, 2004), which takes into account 2-gram overlap to determine correctness. This is the default accuracy metric for CNN/DailyMail and XSUM.

1.4 Code

For HumanEval (Chen et al., 2021) and APPS (Hendrycks et al., 2021c), we use the relevant code metrics defined in their respective papers.

1.5 Accuracy

Using manual evaluation, the correct answer is given 1, the wrong answer is given 0, and the final correct rate is the correct number over the total number. This is the default accuracy metric for Baai-open.

2. Pass@k

For evaluating code generation model, the model generates k(k=1,10,100) code samples for each unit test prompt, and if any of the samples pass the unit test, the problem is considered solved and the total percentage of the problem solved is reported as the Pass@k score.

Note: in practical testing, it is common practice to sample n = 200 times and pass c times, using 1-C(k, n-c)/C(k, n) to reduce the variance of the evaluated values.