Repairing Adversarial Texts through Perturbation
- URL: http://arxiv.org/abs/2201.02504v1
- Date: Wed, 29 Dec 2021 03:57:02 GMT
- Title: Repairing Adversarial Texts through Perturbation
- Authors: Guoliang Dong, Jingyi Wang, Jun Sun, Sudipta Chattopadhyay, Xinyu
Wang, Ting Dai, Jie Shi and Jin Song Dong
- Abstract summary: It is known that neural networks are subject to attacks through adversarial perturbations.
adversarial perturbation is still possible after applying mitigation methods such as adversarial training.
We propose an approach to automatically repair adversarial texts at runtime.
- Score: 11.65808514109149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It is known that neural networks are subject to attacks through adversarial
perturbations, i.e., inputs which are maliciously crafted through perturbations
to induce wrong predictions. Furthermore, such attacks are impossible to
eliminate, i.e., the adversarial perturbation is still possible after applying
mitigation methods such as adversarial training. Multiple approaches have been
developed to detect and reject such adversarial inputs, mostly in the image
domain. Rejecting suspicious inputs however may not be always feasible or
ideal. First, normal inputs may be rejected due to false alarms generated by
the detection algorithm. Second, denial-of-service attacks may be conducted by
feeding such systems with adversarial inputs. To address the gap, in this work,
we propose an approach to automatically repair adversarial texts at runtime.
Given a text which is suspected to be adversarial, we novelly apply multiple
adversarial perturbation methods in a positive way to identify a repair, i.e.,
a slightly mutated but semantically equivalent text that the neural network
correctly classifies. Our approach has been experimented with multiple models
trained for natural language processing tasks and the results show that our
approach is effective, i.e., it successfully repairs about 80\% of the
adversarial texts. Furthermore, depending on the applied perturbation method,
an adversarial text could be repaired in as short as one second on average.
Related papers
- Detecting Adversarial Attacks in Semantic Segmentation via Uncertainty Estimation: A Deep Analysis [12.133306321357999]
We propose an uncertainty-based method for detecting adversarial attacks on neural networks for semantic segmentation.
We conduct a detailed analysis of uncertainty-based detection of adversarial attacks and various state-of-the-art neural networks.
Our numerical experiments show the effectiveness of the proposed uncertainty-based detection method.
arXiv Detail & Related papers (2024-08-19T14:13:30Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - How adversarial attacks can disrupt seemingly stable accurate classifiers [76.95145661711514]
Adversarial attacks dramatically change the output of an otherwise accurate learning system using a seemingly inconsequential modification to a piece of input data.
Here, we show that this may be seen as a fundamental feature of classifiers working with high dimensional input data.
We introduce a simple generic and generalisable framework for which key behaviours observed in practical systems arise with high probability.
arXiv Detail & Related papers (2023-09-07T12:02:00Z) - Adversarial Training Should Be Cast as a Non-Zero-Sum Game [121.95628660889628]
Two-player zero-sum paradigm of adversarial training has not engendered sufficient levels of robustness.
We show that the commonly used surrogate-based relaxation used in adversarial training algorithms voids all guarantees on robustness.
A novel non-zero-sum bilevel formulation of adversarial training yields a framework that matches and in some cases outperforms state-of-the-art attacks.
arXiv Detail & Related papers (2023-06-19T16:00:48Z) - Uncertainty-based Detection of Adversarial Attacks in Semantic
Segmentation [16.109860499330562]
We introduce an uncertainty-based approach for the detection of adversarial attacks in semantic segmentation.
We demonstrate the ability of our approach to detect perturbed images across multiple types of adversarial attacks.
arXiv Detail & Related papers (2023-05-22T08:36:35Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Reverse engineering adversarial attacks with fingerprints from
adversarial examples [0.0]
Adversarial examples are typically generated by an attack algorithm that optimize a perturbation added to a benign input.
We take a "fight fire with fire" approach, training deep neural networks to classify these perturbations.
We achieve an accuracy of 99.4% with a ResNet50 model trained on the perturbations.
arXiv Detail & Related papers (2023-01-31T18:59:37Z) - Randomized Substitution and Vote for Textual Adversarial Example
Detection [6.664295299367366]
A line of work has shown that natural text processing models are vulnerable to adversarial examples.
We propose a novel textual adversarial example detection method, termed Randomized Substitution and Vote (RS&V)
Empirical evaluations on three benchmark datasets demonstrate that RS&V could detect the textual adversarial examples more successfully than the existing detection methods.
arXiv Detail & Related papers (2021-09-13T04:17:58Z) - Extracting Grammars from a Neural Network Parser for Anomaly Detection
in Unknown Formats [79.6676793507792]
Reinforcement learning has recently shown promise as a technique for training an artificial neural network to parse sentences in some unknown format.
This paper presents procedures for extracting production rules from the neural network, and for using these rules to determine whether a given sentence is nominal or anomalous.
arXiv Detail & Related papers (2021-07-30T23:10:24Z) - Learning to Separate Clusters of Adversarial Representations for Robust
Adversarial Detection [50.03939695025513]
We propose a new probabilistic adversarial detector motivated by a recently introduced non-robust feature.
In this paper, we consider the non-robust features as a common property of adversarial examples, and we deduce it is possible to find a cluster in representation space corresponding to the property.
This idea leads us to probability estimate distribution of adversarial representations in a separate cluster, and leverage the distribution for a likelihood based adversarial detector.
arXiv Detail & Related papers (2020-12-07T07:21:18Z) - Adversarial Feature Desensitization [12.401175943131268]
We propose a novel approach to adversarial robustness, which builds upon the insights from the domain adaptation field.
Our method, called Adversarial Feature Desensitization (AFD), aims at learning features that are invariant towards adversarial perturbations of the inputs.
arXiv Detail & Related papers (2020-06-08T14:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.