FlagEval

Introduction of Robustness

Robustness refers to the ability of a model to maintain stability and efficiency in the face of different types of anomalies, noise, interference, changes, or malicious attacks. In abstract, the current basic model (including learning-based deep learning models) gives a data input $X$ , and the parametric model $F_{θ} (\cdot)$ goes through its defined calculations to obtain the expected output $Y$ of the model. Robustness can usually be understood as whether the model can give the correct output in the presence of noise. Specifically, given the disturbance noise $Δ X$ , whether the model's output $F_{θ} (X)$ is equal to the expected output $Y$ , we quantify the difference as $D e l t a Y$ . In addition, the disturbance noise of the construction requires that it does not affect the human understanding of $X$ . Therefore, when constructing text noise, the test sample generated by the evaluation will design $Δ X$ , so that $X + Δ X$ and the original $X$ are not much different in human understanding, but it is easy to make mistakes in the output of the model.

We evaluate the robustness of the model by perturbing the instances. Specifically, we perturb the data set to varying degrees, which mainly includes two levels. One is common mistakes made by humans in the real world, which is divided into three levels: character level, single level, and sentence level. The character level includes the replacement of similar characters, the replacement of adjacent characters on the keyboard, the word level is the replacement of synonyms of words and the replacement of words in the semantic space of the agent model, and the sentence level is mainly the back translation of language. The other is targeted perturbation, such as using agent models to conduct adversarial attacks. After performing the above perturbation, we generated different perturbation data sets for different original data sets, and calculated the model's robustness index on the data set by evaluating the evaluation results of the model on the perturbation data set.

Datasets

CSL

The robust data set is constructed using three perturbations, divided into character level, sentence level, and word level.

The name of the disturbed datasets are as follows：

disturbance dataset name	disturbance methods
C-morphonym	disturbance-char-morphonym
W-maskedlm	disturbance-word-masked-lm
S-backtranslation	disturbance-sentence-back-translation

C、W、S、Adv , short of Char、Word、Sentence、adversarial

Character (char) level

Randomly select 3 to 15 characters for replacement. The perturbation method is as follows

1. Similar character transformation

Disturbance sample

  {
    "id": 1,
    "abst": "免疫滋珠是免疫微球的一种,它是包被有单克隆抗体的球型磁性微粒,可峙异性地与靶物质结合使之具有磁响应姓,可以保证被粉离靶细胞的形态和功能的完整,具有灵敏度高、特异性高、检测速度快、重蝮性好、操作简单和不需要昂贵的仪器设备等优点,本文就该技术应用于肿瘤细胞的分离、富集与检测以及肿瘤的生物学研究和磁导向治疗、免疫磁性净化寺龄域的研究进展作一综述.",
    "keyword": ["综述", "循环肿瘤细胞", "免疫磁珠技术"],
    "label": "1"
  }

1. Homophone character transformation

Disturbance sample

 {
    "id": 1,
    "abst": "键立了三维格子玻尔幔方法(LBM)-元胞自动机(CA)耦恰数置模型,迸用该漠型模拟研究了Al-4.7％Cu(质量纷数)固溶体搭金的凝固过程.该耦哈模型采通元胞自动机方法模拟枝晶的生长,同时采基于分子动力学理论的格子玻尔滋曼方法模拟给金凝固过程中的温度场、流杨以及蓉质场.模拟结果再现了拾金凝固过程中的三维枝晶型貌变化以及溶质富集过程,并将三维流场因素考虑进去,定量研究了自然对流、过冷度对单枝晶形貌和盛粉分布的影响.研究表明,在纯扩散条件下,枝晶呈现忖称的生长现橡,模拟自庙枝晶隐态生长的尖端速度、尖端判泾和过铃度的关系与Lipton-Glicksman-Kurz(LGK)理论模型吻塔得较好.在自然对流条件下,枝晶的生长形貌呈现不对称性,膀枝晶性长在迎流方向上拉至了促进,在顺流方向上受到了抑制.榕体过伶度对枝晶生长的影响较大,过冷度的增架导致枝晶生长驾快,二姿枝晶增多且呈现出粗化现象,枝晶尖端固液界面处的溶质浓度篇高,加重了溶质偏析.",
    "keyword": ["枝晶生长", "三维LBM-CA模型", "数值模拟", "流场"],
    "label": "1"
  }

Sentence level

Perform back-translation perturbation on the abstract and keywords (translated into English and then translated into Chinese). The perturbation method is as follows

csl_test_public_back_translation_abst_keyword (back-translation perturbation of both abstract and keywords)

Disturbance sample

{
    "id": 1,
    "abst": "免疫磁珠是免疫微球的一种。它们是涂有单克隆抗体的球形磁性颗粒。它们可以特异性地与目标物质结合，使其产生磁响应，确保分离的目标细胞的形态和功能。其完整且具有灵敏度高、特异性高、检测速度快、重现性好、操作简单、不需要昂贵的仪器设备等优点。本文讨论该技术在肿瘤细胞分离、富集和检测以及肿瘤生物学方面的应用。本文综述了科学研究及磁引导治疗、免疫磁净化等领域的研究进展。",
    "keyword": ["综述", "循环肿瘤细胞", "磁免疫珠技术"],
    "label": "1"
}

csl_test_public_back_translation_only_abst (only abstracts are perturbed with back translation)

Disturbance sample
{
    "id": 1,
    "abst": 针对船舶航行中的混沌运动控制问题，从船舶操纵运动的非线性模型出发，提出一种基于受控混沌系统Melnikov函数的矩形脉冲摄动控制方法。该控制方法利用矩形脉冲来扰动混沌系统的参数。通过求解混沌系统的同宿轨道，构造受控混沌系统的Melnikov函数，并结合Melnikov函数简单零点出现的边界条件，从数学上确定微扰脉冲参数的值，避免了需要实现混沌控制时控制脉冲参数。选择的盲目性。船舶混沌运动控制仿真实验表明，该方法能够快速将系统混沌运动稳定到周期轨道上，其幅度降低至原混沌系统的8.5%；同时，实验结果表明，该方法在船舶混沌运动控制中能够有效发挥作用。",
    "keyword": ["同宿轨道", "航向保持", "参量微扰", "矩形脉冲"],
    "label": "1"
  }

Word level

Use mask language modeling to replace words

Disturbance sample

{
    "id": 1,
    "abst": "免疫磁珠是免疫微球的一种,它是包被有单克隆状的球型磁性微粒,可特异性地与靶物质结合使之具有磁响应性,可以保证被被靶细胞的形态和功能的完整,具有灵敏度高、特异性高、检测速度快、重复性好、操作简单和不需要维护的仪器设备等优点,本文就该技术运用于肿瘤细胞的形成、富集与检测以及其的生物学研究和磁导向治疗、免疫磁性净化等应用的研究进展作一综述.",
    "keyword": ["综述", "循环肿瘤细胞", "免疫磁珠技术"],
    "label": "1"
  }

Robustness Metrics（RB-index）

For the original data set and different perturbation data sets we have $A c c_{o r g}$ ， $A c c_{d i s t 1}$ ， $A c c_{d i s t 2}$ ， $A c c_{d i s t 3}$ ， $. . .$ ， $A c c_{d i s t T}$ ( $A c c$ refers to the evaluation index of the model under this data set, $o r g$ refers to the original data set, and $d i s 1. . . T$ refers to different perturbation data sets).

The calculation formula of the robustness index on this data set is:

$R o b u s t n e s s = \frac{1}{T * A c c_{o r g}} Σ_{i = 1}^{T} (A c c_{o r g} - A c c_{d i s t i})$

Smaller values of the robustness metric indicate better model robustness and can be negative (mostly found in NLP)

Introduction of Robustness ​

Datasets ​

CSL ​

Robustness Metrics（RB-index） ​

Introduction of Robustness

Datasets

CSL

Robustness Metrics（RB-index）