Text Adversarial Purification as Defense against Adversarial Attacks
- URL: http://arxiv.org/abs/2203.14207v2
- Date: Wed, 3 May 2023 09:09:22 GMT
- Title: Text Adversarial Purification as Defense against Adversarial Attacks
- Authors: Linyang Li, Demin Song, Xipeng Qiu
- Abstract summary: Adversarial purification is a successful defense mechanism against adversarial attacks.
We introduce a novel adversarial purification method that focuses on defending against textual adversarial attacks.
We test our proposed adversarial purification method on several strong adversarial attack methods including Textfooler and BERT-Attack.
- Score: 46.80714732957078
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adversarial purification is a successful defense mechanism against
adversarial attacks without requiring knowledge of the form of the incoming
attack. Generally, adversarial purification aims to remove the adversarial
perturbations therefore can make correct predictions based on the recovered
clean samples. Despite the success of adversarial purification in the computer
vision field that incorporates generative models such as energy-based models
and diffusion models, using purification as a defense strategy against textual
adversarial attacks is rarely explored. In this work, we introduce a novel
adversarial purification method that focuses on defending against textual
adversarial attacks. With the help of language models, we can inject noise by
masking input texts and reconstructing the masked texts based on the masked
language models. In this way, we construct an adversarial purification process
for textual models against the most widely used word-substitution adversarial
attacks. We test our proposed adversarial purification method on several strong
adversarial attack methods including Textfooler and BERT-Attack and
experimental results indicate that the purification algorithm can successfully
defend against strong word-substitution attacks.
Related papers
- DiffuseDef: Improved Robustness to Adversarial Attacks [38.34642687239535]
adversarial attacks pose a critical challenge to system built using pretrained language models.
We propose DiffuseDef, which incorporates a diffusion layer as a denoiser between the encoder and the classifier.
During inference, the adversarial hidden state is first combined with sampled noise, then denoised iteratively and finally ensembled to produce a robust text representation.
arXiv Detail & Related papers (2024-06-28T22:36:17Z) - MaskPure: Improving Defense Against Text Adversaries with Stochastic Purification [7.136205674624813]
In computer vision settings, the noising and de-noising process has proven useful for purifying input images.
Some initial work has explored the use of random noising and de-noising to mitigate adversarial attacks in an NLP setting.
We extend upon methods of input purification text that are inspired by diffusion processes.
Our novel method, MaskPure, exceeds or matches robustness compared to other contemporary defenses.
arXiv Detail & Related papers (2024-06-18T21:27:13Z) - MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models.
Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs.
Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z) - Adversarial Text Purification: A Large Language Model Approach for
Defense [25.041109219049442]
Adversarial purification is a defense mechanism for safeguarding classifiers against adversarial attacks.
We propose a novel adversarial text purification that harnesses the generative capabilities of Large Language Models.
Our proposed method demonstrates remarkable performance over various classifiers, improving their accuracy under the attack by over 65% on average.
arXiv Detail & Related papers (2024-02-05T02:36:41Z) - Language Guided Adversarial Purification [3.9931474959554496]
Adversarial purification using generative models demonstrates strong adversarial defense performance.
New framework, Language Guided Adversarial Purification (LGAP), utilizing pre-trained diffusion models and caption generators.
arXiv Detail & Related papers (2023-09-19T06:17:18Z) - Rethinking Textual Adversarial Defense for Pre-trained Language Models [79.18455635071817]
A literature review shows that pre-trained language models (PrLMs) are vulnerable to adversarial attacks.
We propose a novel metric (Degree of Anomaly) to enable current adversarial attack approaches to generate more natural and imperceptible adversarial examples.
We show that our universal defense framework achieves comparable or even higher after-attack accuracy with other specific defenses.
arXiv Detail & Related papers (2022-07-21T07:51:45Z) - Zero-Query Transfer Attacks on Context-Aware Object Detectors [95.18656036716972]
Adversarial attacks perturb images such that a deep neural network produces incorrect classification results.
A promising approach to defend against adversarial attacks on natural multi-object scenes is to impose a context-consistency check.
We present the first approach for generating context-consistent adversarial attacks that can evade the context-consistency check.
arXiv Detail & Related papers (2022-03-29T04:33:06Z) - Improving the Adversarial Robustness for Speaker Verification by Self-Supervised Learning [95.60856995067083]
This work is among the first to perform adversarial defense for ASV without knowing the specific attack algorithms.
We propose to perform adversarial defense from two perspectives: 1) adversarial perturbation purification and 2) adversarial perturbation detection.
Experimental results show that our detection module effectively shields the ASV by detecting adversarial samples with an accuracy of around 80%.
arXiv Detail & Related papers (2021-06-01T07:10:54Z) - Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data.
In this paper, we propose variable-length textual adversarial attacks(VL-Attack)
Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z) - Universal Adversarial Attacks with Natural Triggers for Text
Classification [30.74579821832117]
We develop adversarial attacks that appear closer to natural English phrases and yet confuse classification systems.
Our attacks effectively reduce model accuracy on classification tasks while being less identifiable than prior models.
arXiv Detail & Related papers (2020-05-01T01:58:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.