FlagEval

Introduction of Robustness

Robustness refers to the ability of a model to maintain stability and efficiency in the face of different types of anomalies, noise, interference, changes, or malicious attacks. In abstract, the current basic model (including learning-based deep learning models) gives a data input $X$ , and the parametric model $F_{θ} (\cdot)$ goes through its defined calculations to obtain the expected output $Y$ of the model. Robustness can usually be understood as whether the model can give the correct output in the presence of noise. Specifically, given the disturbance noise $Δ X$ , whether the model's output $F_{θ} (X)$ is equal to the expected output $Y$ , we quantify the difference as $D e l t a Y$ . In addition, the disturbance noise of the construction requires that it does not affect the human understanding of $X$ . Therefore, when constructing text noise, the test sample generated by the evaluation will design $Δ X$ , so that $X + Δ X$ and the original $X$ are not much different in human understanding, but it is easy to make mistakes in the output of the model.

We evaluate the robustness of the model by perturbing the instances. Specifically, we perturb the data set to varying degrees, which mainly includes two levels. One is common mistakes made by humans in the real world, which is divided into three levels: character level, single level, and sentence level. The character level includes the replacement of similar characters, the replacement of adjacent characters on the keyboard, the word level is the replacement of synonyms of words and the replacement of words in the semantic space of the agent model, and the sentence level is mainly the back translation of language. The other is targeted perturbation, such as using agent models to conduct adversarial attacks. After performing the above perturbation, we generated different perturbation data sets for different original data sets, and calculated the model's robustness index on the data set by evaluating the evaluation results of the model on the perturbation data set.

Datasets

HumanEval

The robust dataset is constructed without using a proxy model to evaluate the perturbation results, and the perturbation classifies the character level, word level, and sentence level. Specifically, the prompt field in the data set is disturbed, and the code description part ("" Code description >>>) in the prompt field is extracted for perturbation.

The name of the disturbed datasets are as follows：

disturbance dataset name	disturbance methods
C-keyboard	disturbance-char-keyboard
C-ocr	disturbance-char-ocr
C-morphonym	disturbance-char-morphonym
W-synonym	disturbance-word-synonym
W-wordembedding	disturbance-word-word-embedding
W-maskedlm	disturbance-word-masked-lm
S-backtranslation	disturbance-sentence-back-translation
Adv	adversarial

C、W、S、Adv , short of Char、Word、Sentence、adversarial

char level

Pick 1 to 3 words at random, and choose 1 to 2 characters for each word to replace, and perturb as follows

ocr（o—>0）

ocr humaneval perturbed dataset example

{
    "task_id": "HumanEval/0",
    "prompt": 
    "	from typing import List    
    	def has_close_elements(numbers: List[float], threshold: float) -> bool:
            """Check if in given list of numbers, ake any tw0 numbers closer to each other than 		given threshold.
            >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
                False
            >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
                True
            """
    ",
    "canonical_solution":
    "    
    	for idx, elem in enumerate(numbers):
            for idx2, elem2 in enumerate(numbers):
                if idx != idx2:
                    distance = abs(elem - elem2)
                if distance < threshold:
                    return True
                return False
    ",
    "test": 
    "   
    METADATA = {
        'author': 'jt',
        'dataset': 'test'
    }       
    def check(candidate):
        assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
        assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
        assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True
        assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False
        assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True
        assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True
        assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False            
    ",
    "entry_point":"has_close_elements"
}

keyboard（q—>w）

keyboard humaneval perturbed dataset example

{
    "task_id": "HumanEval/0",
    "prompt": 
    "	from typing import List    
    	def has_close_elements(numbers: List[float], threshold: float) -> bool:
            """Check if in given list of numbers, are any two numbers closer to each oFher thqn 		giFen threshold.
            >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
                False
            >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
                True
            """
    ",
    "canonical_solution":
    "    
    	for idx, elem in enumerate(numbers):
            for idx2, elem2 in enumerate(numbers):
                if idx != idx2:
                    distance = abs(elem - elem2)
                if distance < threshold:
                    return True
                return False
    ",
    "test": 
    "   
    METADATA = {
        'author': 'jt',
        'dataset': 'test'
    }       
    def check(candidate):
        assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
        assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
        assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True
        assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False
        assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True
        assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True
        assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False            
    ",
    "entry_point":"has_close_elements"
}

word level

hoose 1 to 3 words at random to replace, perturbing as follows

word_embedding（The glove6B-300d model was used to replace selected words with semantically similar words）

word_embedding humaneval perturbed dataset example

{
    "task_id": "HumanEval/0",
    "prompt": 
    "	from typing import List    
    	def has_close_elements(numbers: List[float], threshold: float) -> bool:
            """Check if in good list of numbers, are any between numbers closer to each most than 		  given threshold.
            >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
                False
            >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
                True
            """
    ",
    "canonical_solution":
    "    
    	for idx, elem in enumerate(numbers):
            for idx2, elem2 in enumerate(numbers):
                if idx != idx2:
                    distance = abs(elem - elem2)
                if distance < threshold:
                    return True
                return False
    ",
    "test": 
    "   
    METADATA = {
        'author': 'jt',
        'dataset': 'test'
    }       
    def check(candidate):
        assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
        assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
        assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True
        assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False
        assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True
        assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True
        assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False            
    ",
    "entry_point":"has_close_elements"
}

synonym（Replace selected words with synonyms）

synonym humaneval perturbed dataset example

{
    "task_id": "HumanEval/0",
    "prompt": 
    "	from typing import List    
    	def has_close_elements(numbers: List[float], threshold: float) -> bool:
            """Condition if in sacrifice list of numbers, exist any two numbers closer to each    		  other than given threshold.
            >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
                False
            >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
                True
            """
    ",
    "canonical_solution":
    "    
    	for idx, elem in enumerate(numbers):
            for idx2, elem2 in enumerate(numbers):
                if idx != idx2:
                    distance = abs(elem - elem2)
                if distance < threshold:
                    return True
                return False
    ",
    "test": 
    "   
    METADATA = {
        'author': 'jt',
        'dataset': 'test'
    }       
    def check(candidate):
        assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
        assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
        assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True
        assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False
        assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True
        assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True
        assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False            
    ",
    "entry_point":"has_close_elements"
}

sentence level

The language model is used to perturb the sentences

back_translation（Use the Helsinki-NLP/opus-mt-ROMANCE-en and Helsinki-NLP/opus-mt-en-ROMANCE models to translate sentences in the description part of the code into another language and back again）

back_translation humaneval perturbed dataset example

{
    "task_id": "HumanEval/0",
    "prompt": 
    "	from typing import List    
    	def has_close_elements(numbers: List[float], threshold: float) -> bool:
            """Check if in the given list of numbers, there are two numbers closer to each other           than the given threshold.
            >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
                False
            >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
                True
            """
    ",
    "canonical_solution":
    "    
    	for idx, elem in enumerate(numbers):
            for idx2, elem2 in enumerate(numbers):
                if idx != idx2:
                    distance = abs(elem - elem2)
                if distance < threshold:
                    return True
                return False
    ",
    "test": 
    "   
    METADATA = {
        'author': 'jt',
        'dataset': 'test'
    }       
    def check(candidate):
        assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
        assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
        assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True
        assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False
        assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True
        assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True
        assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False            
    ",
    "entry_point":"has_close_elements"
}

Robustness Metrics（RB-index）

For the original data set and different perturbation data sets we have $A c c_{o r g}$ ， $A c c_{d i s t 1}$ ， $A c c_{d i s t 2}$ ， $A c c_{d i s t 3}$ ， $. . .$ ， $A c c_{d i s t T}$ ( $A c c$ refers to the evaluation index of the model under this data set, $o r g$ refers to the original data set, and $d i s 1. . . T$ refers to different perturbation data sets).

The calculation formula of the robustness index on this data set is:

$R o b u s t n e s s = \frac{1}{T * A c c_{o r g}} Σ_{i = 1}^{T} (A c c_{o r g} - A c c_{d i s t i})$

Smaller values of the robustness metric indicate better model robustness and can be negative (mostly found in NLP)

Introduction of Robustness ​

Datasets ​

HumanEval ​

Robustness Metrics（RB-index） ​

Introduction of Robustness

Datasets

HumanEval

Robustness Metrics（RB-index）