Adversarial Training for Improving Model Robustness? Look at Both
Prediction and Interpretation
- URL: http://arxiv.org/abs/2203.12709v1
- Date: Wed, 23 Mar 2022 20:04:14 GMT
- Title: Adversarial Training for Improving Model Robustness? Look at Both
Prediction and Interpretation
- Authors: Hanjie Chen, Yangfeng Ji
- Abstract summary: We propose a novel feature-level adversarial training method named FLAT.
FLAT incorporates variational word masks in neural networks to learn global word importance.
Experiments show the effectiveness of FLAT in improving the robustness with respect to both predictions and interpretations.
- Score: 21.594361495948316
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural language models show vulnerability to adversarial examples which are
semantically similar to their original counterparts with a few words replaced
by their synonyms. A common way to improve model robustness is adversarial
training which follows two steps-collecting adversarial examples by attacking a
target model, and fine-tuning the model on the augmented dataset with these
adversarial examples. The objective of traditional adversarial training is to
make a model produce the same correct predictions on an original/adversarial
example pair. However, the consistency between model decision-makings on two
similar texts is ignored. We argue that a robust model should behave
consistently on original/adversarial example pairs, that is making the same
predictions (what) based on the same reasons (how) which can be reflected by
consistent interpretations. In this work, we propose a novel feature-level
adversarial training method named FLAT. FLAT aims at improving model robustness
in terms of both predictions and interpretations. FLAT incorporates variational
word masks in neural networks to learn global word importance and play as a
bottleneck teaching the model to make predictions based on important words.
FLAT explicitly shoots at the vulnerability problem caused by the mismatch
between model understandings on the replaced words and their synonyms in
original/adversarial example pairs by regularizing the corresponding global
word importance scores. Experiments show the effectiveness of FLAT in improving
the robustness with respect to both predictions and interpretations of four
neural network models (LSTM, CNN, BERT, and DeBERTa) to two adversarial attacks
on four text classification tasks. The models trained via FLAT also show better
robustness than baseline models on unforeseen adversarial examples across
different attacks.
Related papers
- CONTESTS: a Framework for Consistency Testing of Span Probabilities in Language Models [16.436592723426305]
It is unclear whether language models produce the same value for different ways of assigning joint probabilities to word spans.
Our work introduces a novel framework, ConTestS, involving statistical tests to assess score consistency across interchangeable completion and conditioning orders.
arXiv Detail & Related papers (2024-09-30T06:24:43Z) - SA-Attack: Improving Adversarial Transferability of Vision-Language
Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios.
We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z) - Semantic Image Attack for Visual Model Diagnosis [80.36063332820568]
In practice, metric analysis on a specific train and test dataset does not guarantee reliable or fair ML models.
This paper proposes Semantic Image Attack (SIA), a method based on the adversarial attack that provides semantic adversarial images.
arXiv Detail & Related papers (2023-03-23T03:13:04Z) - In and Out-of-Domain Text Adversarial Robustness via Label Smoothing [64.66809713499576]
We study the adversarial robustness provided by various label smoothing strategies in foundational models for diverse NLP tasks.
Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks.
We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
arXiv Detail & Related papers (2022-12-20T14:06:50Z) - How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial
Robustness? [121.57551065856164]
We propose Robust Informative Fine-Tuning (RIFT) as a novel adversarial fine-tuning method from an information-theoretical perspective.
RIFT encourages an objective model to retain the features learned from the pre-trained model throughout the entire fine-tuning process.
Experimental results show that RIFT consistently outperforms the state-of-the-arts on two popular NLP tasks.
arXiv Detail & Related papers (2021-12-22T05:04:41Z) - Understanding the Logit Distributions of Adversarially-Trained Deep
Neural Networks [6.439477789066243]
Adversarial defenses train deep neural networks to be invariant to the input perturbations from adversarial attacks.
Although adversarial training is successful at mitigating adversarial attacks, the behavioral differences between adversarially-trained (AT) models and standard models are still poorly understood.
We identify three logit characteristics essential to learning adversarial robustness.
arXiv Detail & Related papers (2021-08-26T19:09:15Z) - On the Lack of Robust Interpretability of Neural Text Classifiers [14.685352584216757]
We assess the robustness of interpretations of neural text classifiers based on pretrained Transformer encoders.
Both tests show surprising deviations from expected behavior, raising questions about the extent of insights that practitioners may draw from interpretations.
arXiv Detail & Related papers (2021-06-08T18:31:02Z) - On the Transferability of Adversarial Attacksagainst Neural Text
Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models.
We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models.
We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z) - Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial
Perturbations [65.05561023880351]
Adversarial examples are malicious inputs crafted to induce misclassification.
This paper studies a complementary failure mode, invariance-based adversarial examples.
We show that defenses against sensitivity-based attacks actively harm a model's accuracy on invariance-based attacks.
arXiv Detail & Related papers (2020-02-11T18:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.