Frequency-Guided Word Substitutions for Detecting Textual Adversarial
Examples
- URL: http://arxiv.org/abs/2004.05887v2
- Date: Tue, 26 Jan 2021 09:55:19 GMT
- Title: Frequency-Guided Word Substitutions for Detecting Textual Adversarial
Examples
- Authors: Maximilian Mozes, Pontus Stenetorp, Bennett Kleinberg, Lewis D.
Griffin
- Abstract summary: We show that adversarial attacks against CNN, LSTM and Transformer-based classification models perform word substitutions.
We propose frequency-guided word substitutions (FGWS) for the detection of adversarial examples.
FGWS achieves strong performance by accurately detecting adversarial examples on the SST-2 and IMDb sentiment datasets.
- Score: 16.460051008283887
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent efforts have shown that neural text processing models are vulnerable
to adversarial examples, but the nature of these examples is poorly understood.
In this work, we show that adversarial attacks against CNN, LSTM and
Transformer-based classification models perform word substitutions that are
identifiable through frequency differences between replaced words and their
corresponding substitutions. Based on these findings, we propose
frequency-guided word substitutions (FGWS), a simple algorithm exploiting the
frequency properties of adversarial word substitutions for the detection of
adversarial examples. FGWS achieves strong performance by accurately detecting
adversarial examples on the SST-2 and IMDb sentiment datasets, with F1
detection scores of up to 91.4% against RoBERTa-based classification models. We
compare our approach against a recently proposed perturbation discrimination
framework and show that we outperform it by up to 13.0% F1.
Related papers
- Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Less is More: Understanding Word-level Textual Adversarial Attack via n-gram Frequency Descend [34.58191062593758]
This work aims to interpret word-level attacks by examining their $n$-gram frequency patterns.
Our comprehensive experiments reveal that in approximately 90% of cases, word-level attacks lead to the generation of examples where the frequency of $n$-grams decreases.
This finding suggests a straightforward strategy to enhance model robustness: training models using examples with $n$-FD.
arXiv Detail & Related papers (2023-02-06T05:11:27Z) - In and Out-of-Domain Text Adversarial Robustness via Label Smoothing [64.66809713499576]
We study the adversarial robustness provided by various label smoothing strategies in foundational models for diverse NLP tasks.
Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks.
We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
arXiv Detail & Related papers (2022-12-20T14:06:50Z) - Block-Sparse Adversarial Attack to Fool Transformer-Based Text
Classifiers [49.50163349643615]
In this paper, we propose a gradient-based adversarial attack against transformer-based text classifiers.
Experimental results demonstrate that, while our adversarial attack maintains the semantics of the sentence, it can reduce the accuracy of GPT-2 to less than 5%.
arXiv Detail & Related papers (2022-03-11T14:37:41Z) - Learning-based Hybrid Local Search for the Hard-label Textual Attack [53.92227690452377]
We consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker could only access the prediction label.
Based on this observation, we propose a novel hard-label attack, called Learning-based Hybrid Local Search (LHLS) algorithm.
Our LHLS significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
arXiv Detail & Related papers (2022-01-20T14:16:07Z) - Randomized Substitution and Vote for Textual Adversarial Example
Detection [6.664295299367366]
A line of work has shown that natural text processing models are vulnerable to adversarial examples.
We propose a novel textual adversarial example detection method, termed Randomized Substitution and Vote (RS&V)
Empirical evaluations on three benchmark datasets demonstrate that RS&V could detect the textual adversarial examples more successfully than the existing detection methods.
arXiv Detail & Related papers (2021-09-13T04:17:58Z) - Experiments with adversarial attacks on text genres [0.0]
Neural models based on pre-trained transformers, such as BERT or XLM-RoBERTa, demonstrate SOTA results in many NLP tasks.
We show that embedding-based algorithms which can replace some of the most significant'' words with words similar to them, have the ability to influence model predictions in a significant proportion of cases.
arXiv Detail & Related papers (2021-07-05T19:37:59Z) - Visualizing Classifier Adjacency Relations: A Case Study in Speaker
Verification and Voice Anti-Spoofing [72.4445825335561]
We propose a simple method to derive 2D representation from detection scores produced by an arbitrary set of binary classifiers.
Based upon rank correlations, our method facilitates a visual comparison of classifiers with arbitrary scores.
While the approach is fully versatile and can be applied to any detection task, we demonstrate the method using scores produced by automatic speaker verification and voice anti-spoofing systems.
arXiv Detail & Related papers (2021-06-11T13:03:33Z) - Unsupervised Anomaly Detection From Semantic Similarity Scores [0.0]
We present a simple and generic framework, it SemSAD, that makes use of a semantic similarity score to carry out anomaly detection.
We are able to outperform previous approaches for anomaly, novelty, or out-of-distribution detection in the visual domain by a large margin.
arXiv Detail & Related papers (2020-12-01T13:12:31Z) - WaveTransform: Crafting Adversarial Examples via Input Decomposition [69.01794414018603]
We introduce WaveTransform', that creates adversarial noise corresponding to low-frequency and high-frequency subbands, separately (or in combination)
Experiments show that the proposed attack is effective against the defense algorithm and is also transferable across CNNs.
arXiv Detail & Related papers (2020-10-29T17:16:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.