Related papers: Text Adversarial Purification as Defense against Adversarial Attacks

Text Adversarial Purification as Defense against Adversarial Attacks

URL: http://arxiv.org/abs/2203.14207v2
Date: Wed, 3 May 2023 09:09:22 GMT
Title: Text Adversarial Purification as Defense against Adversarial Attacks
Authors: Linyang Li, Demin Song, Xipeng Qiu
Abstract summary: Adversarial purification is a successful defense mechanism against adversarial attacks. We introduce a novel adversarial purification method that focuses on defending against textual adversarial attacks. We test our proposed adversarial purification method on several strong adversarial attack methods including Textfooler and BERT-Attack.
Score: 46.80714732957078
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Adversarial purification is a successful defense mechanism against adversarial attacks without requiring knowledge of the form of the incoming attack. Generally, adversarial purification aims to remove the adversarial perturbations therefore can make correct predictions based on the recovered clean samples. Despite the success of adversarial purification in the computer vision field that incorporates generative models such as energy-based models and diffusion models, using purification as a defense strategy against textual adversarial attacks is rarely explored. In this work, we introduce a novel adversarial purification method that focuses on defending against textual adversarial attacks. With the help of language models, we can inject noise by masking input texts and reconstructing the masked texts based on the masked language models. In this way, we construct an adversarial purification process for textual models against the most widely used word-substitution adversarial attacks. We test our proposed adversarial purification method on several strong adversarial attack methods including Textfooler and BERT-Attack and experimental results indicate that the purification algorithm can successfully defend against strong word-substitution attacks.

Related papers

Mind the Gap: Detecting Black-box Adversarial Attacks in the Making through Query Update Analysis [3.795071937009966]
Adrial attacks can jeopardize the integrity of Machine Learning (ML) models. We propose a framework that detects if an adversarial noise instance is being generated. We evaluate our approach against 8 state-of-the-art attacks, including adaptive attacks.
arXiv Detail & Related papers (2025-03-04T20:25:12Z)
Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [51.51850981481236]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses. PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety. To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z)
Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context [49.13497493053742]
We focus on human-readable adversarial prompts, which are more realistic and potent threats. Our key contributions are (1) situation-driven attacks leveraging movie scripts as context to create human-readable prompts that successfully deceive LLMs, (2) adversarial suffix conversion to transform nonsensical adversarial suffixes into independent meaningful text, and (3) AdvPrompter with p-nucleus sampling, a method to generate diverse, human-readable adversarial suffixes.
arXiv Detail & Related papers (2024-12-20T21:43:52Z)
DiffuseDef: Improved Robustness to Adversarial Attacks [38.34642687239535]
adversarial attacks pose a critical challenge to system built using pretrained language models. We propose DiffuseDef, which incorporates a diffusion layer as a denoiser between the encoder and the classifier. During inference, the adversarial hidden state is first combined with sampled noise, then denoised iteratively and finally ensembled to produce a robust text representation.
arXiv Detail & Related papers (2024-06-28T22:36:17Z)
MaskPure: Improving Defense Against Text Adversaries with Stochastic Purification [7.136205674624813]
In computer vision settings, the noising and de-noising process has proven useful for purifying input images. Some initial work has explored the use of random noising and de-noising to mitigate adversarial attacks in an NLP setting. We extend upon methods of input purification text that are inspired by diffusion processes. Our novel method, MaskPure, exceeds or matches robustness compared to other contemporary defenses.
arXiv Detail & Related papers (2024-06-18T21:27:13Z)
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models. Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs. Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z)
Adversarial Text Purification: A Large Language Model Approach for Defense [25.041109219049442]
Adversarial purification is a defense mechanism for safeguarding classifiers against adversarial attacks. We propose a novel adversarial text purification that harnesses the generative capabilities of Large Language Models. Our proposed method demonstrates remarkable performance over various classifiers, improving their accuracy under the attack by over 65% on average.
arXiv Detail & Related papers (2024-02-05T02:36:41Z)
Language Guided Adversarial Purification [3.9931474959554496]
Adversarial purification using generative models demonstrates strong adversarial defense performance. New framework, Language Guided Adversarial Purification (LGAP), utilizing pre-trained diffusion models and caption generators.
arXiv Detail & Related papers (2023-09-19T06:17:18Z)
Rethinking Textual Adversarial Defense for Pre-trained Language Models [79.18455635071817]
A literature review shows that pre-trained language models (PrLMs) are vulnerable to adversarial attacks. We propose a novel metric (Degree of Anomaly) to enable current adversarial attack approaches to generate more natural and imperceptible adversarial examples. We show that our universal defense framework achieves comparable or even higher after-attack accuracy with other specific defenses.
arXiv Detail & Related papers (2022-07-21T07:51:45Z)
Zero-Query Transfer Attacks on Context-Aware Object Detectors [95.18656036716972]
Adversarial attacks perturb images such that a deep neural network produces incorrect classification results. A promising approach to defend against adversarial attacks on natural multi-object scenes is to impose a context-consistency check. We present the first approach for generating context-consistent adversarial attacks that can evade the context-consistency check.
arXiv Detail & Related papers (2022-03-29T04:33:06Z)
Improving the Adversarial Robustness for Speaker Verification by Self-Supervised Learning [95.60856995067083]
This work is among the first to perform adversarial defense for ASV without knowing the specific attack algorithms. We propose to perform adversarial defense from two perspectives: 1) adversarial perturbation purification and 2) adversarial perturbation detection. Experimental results show that our detection module effectively shields the ASV by detecting adversarial samples with an accuracy of around 80%.
arXiv Detail & Related papers (2021-06-01T07:10:54Z)
Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data. In this paper, we propose variable-length textual adversarial attacks(VL-Attack) Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z)
Universal Adversarial Attacks with Natural Triggers for Text Classification [30.74579821832117]
We develop adversarial attacks that appear closer to natural English phrases and yet confuse classification systems. Our attacks effectively reduce model accuracy on classification tasks while being less identifiable than prior models.
arXiv Detail & Related papers (2020-05-01T01:58:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.