Randomized Substitution and Vote for Textual Adversarial Example
Detection
- URL: http://arxiv.org/abs/2109.05698v1
- Date: Mon, 13 Sep 2021 04:17:58 GMT
- Title: Randomized Substitution and Vote for Textual Adversarial Example
Detection
- Authors: Xiaosen Wang, Yifeng Xiong, Kun He
- Abstract summary: A line of work has shown that natural text processing models are vulnerable to adversarial examples.
We propose a novel textual adversarial example detection method, termed Randomized Substitution and Vote (RS&V)
Empirical evaluations on three benchmark datasets demonstrate that RS&V could detect the textual adversarial examples more successfully than the existing detection methods.
- Score: 6.664295299367366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A line of work has shown that natural text processing models are vulnerable
to adversarial examples. Correspondingly, various defense methods are proposed
to mitigate the threat of textual adversarial examples, e.g. adversarial
training, certified defense, input pre-processing, detection, etc. In this
work, we treat the optimization process for synonym substitution based textual
adversarial attacks as a specific sequence of word replacement, in which each
word mutually influences other words. We identify that we could destroy such
mutual interaction and eliminate the adversarial perturbation by randomly
substituting a word with its synonyms. Based on this observation, we propose a
novel textual adversarial example detection method, termed Randomized
Substitution and Vote (RS&V), which votes the prediction label by accumulating
the logits of k samples generated by randomly substituting the words in the
input text with synonyms. The proposed RS&V is generally applicable to any
existing neural networks without modification on the architecture or extra
training, and it is orthogonal to prior work on making the classification
network itself more robust. Empirical evaluations on three benchmark datasets
demonstrate that RS&V could detect the textual adversarial examples more
successfully than the existing detection methods while maintaining the high
classification accuracy on benign samples.
Related papers
- Text-CRS: A Generalized Certified Robustness Framework against Textual Adversarial Attacks [39.51297217854375]
We propose Text-CRS, a certified robustness framework for natural language processing (NLP) based on randomized smoothing.
We show that Text-CRS can address all four different word-level adversarial operations and achieve a significant accuracy improvement.
We also provide the first benchmark on certified accuracy and radius of four word-level operations, besides outperforming the state-of-the-art certification against synonym substitution attacks.
arXiv Detail & Related papers (2023-07-31T13:08:16Z) - Lost In Translation: Generating Adversarial Examples Robust to
Round-Trip Translation [66.33340583035374]
We present a comprehensive study on the robustness of current text adversarial attacks to round-trip translation.
We demonstrate that 6 state-of-the-art text-based adversarial attacks do not maintain their efficacy after round-trip translation.
We introduce an intervention-based solution to this problem, by integrating Machine Translation into the process of adversarial example generation.
arXiv Detail & Related papers (2023-07-24T04:29:43Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - TextShield: Beyond Successfully Detecting Adversarial Sentences in Text
Classification [6.781100829062443]
Adversarial attack serves as a major challenge for neural network models in NLP, which precludes the model's deployment in safety-critical applications.
Previous detection methods are incapable of giving correct predictions on adversarial sentences.
We propose a saliency-based detector, which can effectively detect whether an input sentence is adversarial or not.
arXiv Detail & Related papers (2023-02-03T22:58:07Z) - In and Out-of-Domain Text Adversarial Robustness via Label Smoothing [64.66809713499576]
We study the adversarial robustness provided by various label smoothing strategies in foundational models for diverse NLP tasks.
Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks.
We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
arXiv Detail & Related papers (2022-12-20T14:06:50Z) - Learning-based Hybrid Local Search for the Hard-label Textual Attack [53.92227690452377]
We consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker could only access the prediction label.
Based on this observation, we propose a novel hard-label attack, called Learning-based Hybrid Local Search (LHLS) algorithm.
Our LHLS significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
arXiv Detail & Related papers (2022-01-20T14:16:07Z) - Contrasting Human- and Machine-Generated Word-Level Adversarial Examples
for Text Classification [12.750016480098262]
We report on crowdsourcing studies in which we task humans with iteratively modifying words in an input text.
We analyze how human-generated adversarial examples compare to the recently proposed TextFooler, Genetic, BAE and SememePSO attack algorithms.
arXiv Detail & Related papers (2021-09-09T16:16:04Z) - MASKER: Masked Keyword Regularization for Reliable Text Classification [73.90326322794803]
We propose a fine-tuning method, coined masked keyword regularization (MASKER), that facilitates context-based prediction.
MASKER regularizes the model to reconstruct the keywords from the rest of the words and make low-confidence predictions without enough context.
We demonstrate that MASKER improves OOD detection and cross-domain generalization without degrading classification accuracy.
arXiv Detail & Related papers (2020-12-17T04:54:16Z) - Adversarial Semantic Collisions [129.55896108684433]
We study semantic collisions: texts that are semantically unrelated but judged as similar by NLP models.
We develop gradient-based approaches for generating semantic collisions.
We show how to generate semantic collisions that evade perplexity-based filtering.
arXiv Detail & Related papers (2020-11-09T20:42:01Z) - Frequency-Guided Word Substitutions for Detecting Textual Adversarial
Examples [16.460051008283887]
We show that adversarial attacks against CNN, LSTM and Transformer-based classification models perform word substitutions.
We propose frequency-guided word substitutions (FGWS) for the detection of adversarial examples.
FGWS achieves strong performance by accurately detecting adversarial examples on the SST-2 and IMDb sentiment datasets.
arXiv Detail & Related papers (2020-04-13T12:11:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.