Related papers: Randomized Substitution and Vote for Textual Adversarial Example Detection

Randomized Substitution and Vote for Textual Adversarial Example Detection

URL: http://arxiv.org/abs/2109.05698v1
Date: Mon, 13 Sep 2021 04:17:58 GMT
Title: Randomized Substitution and Vote for Textual Adversarial Example Detection
Authors: Xiaosen Wang, Yifeng Xiong, Kun He
Abstract summary: A line of work has shown that natural text processing models are vulnerable to adversarial examples. We propose a novel textual adversarial example detection method, termed Randomized Substitution and Vote (RS&V) Empirical evaluations on three benchmark datasets demonstrate that RS&V could detect the textual adversarial examples more successfully than the existing detection methods.
Score: 6.664295299367366
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A line of work has shown that natural text processing models are vulnerable to adversarial examples. Correspondingly, various defense methods are proposed to mitigate the threat of textual adversarial examples, e.g. adversarial training, certified defense, input pre-processing, detection, etc. In this work, we treat the optimization process for synonym substitution based textual adversarial attacks as a specific sequence of word replacement, in which each word mutually influences other words. We identify that we could destroy such mutual interaction and eliminate the adversarial perturbation by randomly substituting a word with its synonyms. Based on this observation, we propose a novel textual adversarial example detection method, termed Randomized Substitution and Vote (RS&V), which votes the prediction label by accumulating the logits of k samples generated by randomly substituting the words in the input text with synonyms. The proposed RS&V is generally applicable to any existing neural networks without modification on the architecture or extra training, and it is orthogonal to prior work on making the classification network itself more robust. Empirical evaluations on three benchmark datasets demonstrate that RS&V could detect the textual adversarial examples more successfully than the existing detection methods while maintaining the high classification accuracy on benign samples.

Related papers

Unpacking Robustness in Inflectional Languages: Adversarial Evaluation and Mechanistic Insights [2.3224139967919974]
We evaluate and explain how adversarial attacks perform in inflectional languages.<n>We use a novel protocol inspired by mechanistic interpretability, based on Edge Attribution Patching (EAP) method.<n>We create a new benchmark based on task-oriented dataset MultiEmo.
arXiv Detail & Related papers (2025-05-08T08:00:03Z)
Text-CRS: A Generalized Certified Robustness Framework against Textual Adversarial Attacks [39.51297217854375]
We propose Text-CRS, a certified robustness framework for natural language processing (NLP) based on randomized smoothing. We show that Text-CRS can address all four different word-level adversarial operations and achieve a significant accuracy improvement. We also provide the first benchmark on certified accuracy and radius of four word-level operations, besides outperforming the state-of-the-art certification against synonym substitution attacks.
arXiv Detail & Related papers (2023-07-31T13:08:16Z)
Lost In Translation: Generating Adversarial Examples Robust to Round-Trip Translation [66.33340583035374]
We present a comprehensive study on the robustness of current text adversarial attacks to round-trip translation. We demonstrate that 6 state-of-the-art text-based adversarial attacks do not maintain their efficacy after round-trip translation. We introduce an intervention-based solution to this problem, by integrating Machine Translation into the process of adversarial example generation.
arXiv Detail & Related papers (2023-07-24T04:29:43Z)
Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models. We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks. Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z)
TextShield: Beyond Successfully Detecting Adversarial Sentences in Text Classification [6.781100829062443]
Adversarial attack serves as a major challenge for neural network models in NLP, which precludes the model's deployment in safety-critical applications. Previous detection methods are incapable of giving correct predictions on adversarial sentences. We propose a saliency-based detector, which can effectively detect whether an input sentence is adversarial or not.
arXiv Detail & Related papers (2023-02-03T22:58:07Z)
In and Out-of-Domain Text Adversarial Robustness via Label Smoothing [64.66809713499576]
We study the adversarial robustness provided by various label smoothing strategies in foundational models for diverse NLP tasks. Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks. We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
arXiv Detail & Related papers (2022-12-20T14:06:50Z)
Learning-based Hybrid Local Search for the Hard-label Textual Attack [53.92227690452377]
We consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker could only access the prediction label. Based on this observation, we propose a novel hard-label attack, called Learning-based Hybrid Local Search (LHLS) algorithm. Our LHLS significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
arXiv Detail & Related papers (2022-01-20T14:16:07Z)
Contrasting Human- and Machine-Generated Word-Level Adversarial Examples for Text Classification [12.750016480098262]
We report on crowdsourcing studies in which we task humans with iteratively modifying words in an input text. We analyze how human-generated adversarial examples compare to the recently proposed TextFooler, Genetic, BAE and SememePSO attack algorithms.
arXiv Detail & Related papers (2021-09-09T16:16:04Z)
MASKER: Masked Keyword Regularization for Reliable Text Classification [73.90326322794803]
We propose a fine-tuning method, coined masked keyword regularization (MASKER), that facilitates context-based prediction. MASKER regularizes the model to reconstruct the keywords from the rest of the words and make low-confidence predictions without enough context. We demonstrate that MASKER improves OOD detection and cross-domain generalization without degrading classification accuracy.
arXiv Detail & Related papers (2020-12-17T04:54:16Z)
Adversarial Semantic Collisions [129.55896108684433]
We study semantic collisions: texts that are semantically unrelated but judged as similar by NLP models. We develop gradient-based approaches for generating semantic collisions. We show how to generate semantic collisions that evade perplexity-based filtering.
arXiv Detail & Related papers (2020-11-09T20:42:01Z)
Frequency-Guided Word Substitutions for Detecting Textual Adversarial Examples [16.460051008283887]
We show that adversarial attacks against CNN, LSTM and Transformer-based classification models perform word substitutions. We propose frequency-guided word substitutions (FGWS) for the detection of adversarial examples. FGWS achieves strong performance by accurately detecting adversarial examples on the SST-2 and IMDb sentiment datasets.
arXiv Detail & Related papers (2020-04-13T12:11:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.