Rethinking Textual Adversarial Defense for Pre-trained Language Models
- URL: http://arxiv.org/abs/2208.10251v1
- Date: Thu, 21 Jul 2022 07:51:45 GMT
- Title: Rethinking Textual Adversarial Defense for Pre-trained Language Models
- Authors: Jiayi Wang, Rongzhou Bao, Zhuosheng Zhang, Hai Zhao
- Abstract summary: A literature review shows that pre-trained language models (PrLMs) are vulnerable to adversarial attacks.
We propose a novel metric (Degree of Anomaly) to enable current adversarial attack approaches to generate more natural and imperceptible adversarial examples.
We show that our universal defense framework achieves comparable or even higher after-attack accuracy with other specific defenses.
- Score: 79.18455635071817
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although pre-trained language models (PrLMs) have achieved significant
success, recent studies demonstrate that PrLMs are vulnerable to adversarial
attacks. By generating adversarial examples with slight perturbations on
different levels (sentence / word / character), adversarial attacks can fool
PrLMs to generate incorrect predictions, which questions the robustness of
PrLMs. However, we find that most existing textual adversarial examples are
unnatural, which can be easily distinguished by both human and machine. Based
on a general anomaly detector, we propose a novel metric (Degree of Anomaly) as
a constraint to enable current adversarial attack approaches to generate more
natural and imperceptible adversarial examples. Under this new constraint, the
success rate of existing attacks drastically decreases, which reveals that the
robustness of PrLMs is not as fragile as they claimed. In addition, we find
that four types of randomization can invalidate a large portion of textual
adversarial examples. Based on anomaly detector and randomization, we design a
universal defense framework, which is among the first to perform textual
adversarial defense without knowing the specific attack. Empirical results show
that our universal defense framework achieves comparable or even higher
after-attack accuracy with other specific defenses, while preserving higher
original accuracy at the same time. Our work discloses the essence of textual
adversarial attacks, and indicates that (1) further works of adversarial
attacks should focus more on how to overcome the detection and resist the
randomization, otherwise their adversarial examples would be easily detected
and invalidated; and (2) compared with the unnatural and perceptible
adversarial examples, it is those undetectable adversarial examples that pose
real risks for PrLMs and require more attention for future robustness-enhancing
strategies.
Related papers
- Meta Invariance Defense Towards Generalizable Robustness to Unknown Adversarial Attacks [62.036798488144306]
Current defense mainly focuses on the known attacks, but the adversarial robustness to the unknown attacks is seriously overlooked.
We propose an attack-agnostic defense method named Meta Invariance Defense (MID)
We show that MID simultaneously achieves robustness to the imperceptible adversarial perturbations in high-level image classification and attack-suppression in low-level robust image regeneration.
arXiv Detail & Related papers (2024-04-04T10:10:38Z) - Fooling the Textual Fooler via Randomizing Latent Representations [13.77424820701913]
adversarial word-level perturbations are well-studied and effective attack strategies.
We propose a lightweight and attack-agnostic defense whose main goal is to perplex the process of generating an adversarial example.
We empirically demonstrate near state-of-the-art robustness of AdvFooler against representative adversarial word-level attacks.
arXiv Detail & Related papers (2023-10-02T06:57:25Z) - Interpretability is a Kind of Safety: An Interpreter-based Ensemble for
Adversary Defense [28.398901783858005]
We propose an interpreter-based ensemble framework called X-Ensemble for robust defense adversary.
X-Ensemble employs the Random Forests (RF) model to combine sub-detectors into an ensemble detector for adversarial hybrid attacks defense.
arXiv Detail & Related papers (2023-04-14T04:32:06Z) - Distinguishing Non-natural from Natural Adversarial Samples for More
Robust Pre-trained Language Model [79.18455635071817]
We find that the adversarial samples that PrLMs fail are mostly non-natural and do not appear in reality.
We propose an anomaly detector to evaluate the robustness of PrLMs with more natural adversarial samples.
arXiv Detail & Related papers (2022-03-19T14:06:46Z) - TREATED:Towards Universal Defense against Textual Adversarial Attacks [28.454310179377302]
We propose TREATED, a universal adversarial detection method that can defend against attacks of various perturbation levels without making any assumptions.
Extensive experiments on three competitive neural networks and two widely used datasets show that our method achieves better detection performance than baselines.
arXiv Detail & Related papers (2021-09-13T03:31:20Z) - Towards Defending against Adversarial Examples via Attack-Invariant
Features [147.85346057241605]
Deep neural networks (DNNs) are vulnerable to adversarial noise.
adversarial robustness can be improved by exploiting adversarial examples.
Models trained on seen types of adversarial examples generally cannot generalize well to unseen types of adversarial examples.
arXiv Detail & Related papers (2021-06-09T12:49:54Z) - Are Adversarial Examples Created Equal? A Learnable Weighted Minimax
Risk for Robustness under Non-uniform Attacks [70.11599738647963]
Adversarial Training is one of the few defenses that withstand strong attacks.
Traditional defense mechanisms assume a uniform attack over the examples according to the underlying data distribution.
We present a weighted minimax risk optimization that defends against non-uniform attacks.
arXiv Detail & Related papers (2020-10-24T21:20:35Z) - Reliable evaluation of adversarial robustness with an ensemble of
diverse parameter-free attacks [65.20660287833537]
In this paper we propose two extensions of the PGD-attack overcoming failures due to suboptimal step size and problems of the objective function.
We then combine our novel attacks with two complementary existing ones to form a parameter-free, computationally affordable and user-independent ensemble of attacks to test adversarial robustness.
arXiv Detail & Related papers (2020-03-03T18:15:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.