Adversarial Text Purification: A Large Language Model Approach for
Defense
- URL: http://arxiv.org/abs/2402.06655v1
- Date: Mon, 5 Feb 2024 02:36:41 GMT
- Title: Adversarial Text Purification: A Large Language Model Approach for
Defense
- Authors: Raha Moraffah, Shubh Khandelwal, Amrita Bhattacharjee, and Huan Liu
- Abstract summary: Adversarial purification is a defense mechanism for safeguarding classifiers against adversarial attacks.
We propose a novel adversarial text purification that harnesses the generative capabilities of Large Language Models.
Our proposed method demonstrates remarkable performance over various classifiers, improving their accuracy under the attack by over 65% on average.
- Score: 25.041109219049442
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adversarial purification is a defense mechanism for safeguarding classifiers
against adversarial attacks without knowing the type of attacks or training of
the classifier. These techniques characterize and eliminate adversarial
perturbations from the attacked inputs, aiming to restore purified samples that
retain similarity to the initially attacked ones and are correctly classified
by the classifier. Due to the inherent challenges associated with
characterizing noise perturbations for discrete inputs, adversarial text
purification has been relatively unexplored. In this paper, we investigate the
effectiveness of adversarial purification methods in defending text
classifiers. We propose a novel adversarial text purification that harnesses
the generative capabilities of Large Language Models (LLMs) to purify
adversarial text without the need to explicitly characterize the discrete noise
perturbations. We utilize prompt engineering to exploit LLMs for recovering the
purified examples for given adversarial examples such that they are
semantically similar and correctly classified. Our proposed method demonstrates
remarkable performance over various classifiers, improving their accuracy under
the attack by over 65% on average.
Related papers
- Classifier Guidance Enhances Diffusion-based Adversarial Purification by Preserving Predictive Information [75.36597470578724]
Adversarial purification is one of the promising approaches to defend neural networks against adversarial attacks.
We propose gUided Purification (COUP) algorithm, which purifies while keeping away from the classifier decision boundary.
Experimental results show that COUP can achieve better adversarial robustness under strong attack methods.
arXiv Detail & Related papers (2024-08-12T02:48:00Z) - DiffuseDef: Improved Robustness to Adversarial Attacks [38.34642687239535]
adversarial attacks pose a critical challenge to system built using pretrained language models.
We propose DiffuseDef, which incorporates a diffusion layer as a denoiser between the encoder and the classifier.
During inference, the adversarial hidden state is first combined with sampled noise, then denoised iteratively and finally ensembled to produce a robust text representation.
arXiv Detail & Related papers (2024-06-28T22:36:17Z) - MaskPure: Improving Defense Against Text Adversaries with Stochastic Purification [7.136205674624813]
In computer vision settings, the noising and de-noising process has proven useful for purifying input images.
Some initial work has explored the use of random noising and de-noising to mitigate adversarial attacks in an NLP setting.
We extend upon methods of input purification text that are inspired by diffusion processes.
Our novel method, MaskPure, exceeds or matches robustness compared to other contemporary defenses.
arXiv Detail & Related papers (2024-06-18T21:27:13Z) - Scalable Ensemble-based Detection Method against Adversarial Attacks for
speaker verification [73.30974350776636]
This paper comprehensively compares mainstream purification techniques in a unified framework.
We propose an easy-to-follow ensemble approach that integrates advanced purification modules for detection.
arXiv Detail & Related papers (2023-12-14T03:04:05Z) - Diffusion Models for Adversarial Purification [69.1882221038846]
Adrial purification refers to a class of defense methods that remove adversarial perturbations using a generative model.
We propose DiffPure that uses diffusion models for adversarial purification.
Our method achieves the state-of-the-art results, outperforming current adversarial training and adversarial purification methods.
arXiv Detail & Related papers (2022-05-16T06:03:00Z) - Text Adversarial Purification as Defense against Adversarial Attacks [46.80714732957078]
Adversarial purification is a successful defense mechanism against adversarial attacks.
We introduce a novel adversarial purification method that focuses on defending against textual adversarial attacks.
We test our proposed adversarial purification method on several strong adversarial attack methods including Textfooler and BERT-Attack.
arXiv Detail & Related papers (2022-03-27T04:41:55Z) - Improving the Adversarial Robustness for Speaker Verification by Self-Supervised Learning [95.60856995067083]
This work is among the first to perform adversarial defense for ASV without knowing the specific attack algorithms.
We propose to perform adversarial defense from two perspectives: 1) adversarial perturbation purification and 2) adversarial perturbation detection.
Experimental results show that our detection module effectively shields the ASV by detecting adversarial samples with an accuracy of around 80%.
arXiv Detail & Related papers (2021-06-01T07:10:54Z) - Towards Robust Speech-to-Text Adversarial Attack [78.5097679815944]
This paper introduces a novel adversarial algorithm for attacking the state-of-the-art speech-to-text systems, namely DeepSpeech, Kaldi, and Lingvo.
Our approach is based on developing an extension for the conventional distortion condition of the adversarial optimization formulation.
Minimizing over this metric, which measures the discrepancies between original and adversarial samples' distributions, contributes to crafting signals very close to the subspace of legitimate speech recordings.
arXiv Detail & Related papers (2021-03-15T01:51:41Z) - Universal Adversarial Attacks with Natural Triggers for Text
Classification [30.74579821832117]
We develop adversarial attacks that appear closer to natural English phrases and yet confuse classification systems.
Our attacks effectively reduce model accuracy on classification tasks while being less identifiable than prior models.
arXiv Detail & Related papers (2020-05-01T01:58:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.