Semantic-Preserving Adversarial Text Attacks
- URL: http://arxiv.org/abs/2108.10015v1
- Date: Mon, 23 Aug 2021 09:05:18 GMT
- Title: Semantic-Preserving Adversarial Text Attacks
- Authors: Xinghao Yang, Weifeng Liu, James Bailey, Tianqing Zhu, Dacheng Tao,
Wei Liu
- Abstract summary: We propose a Bigram and Unigram based adaptive Semantic Preservation Optimization (BU-SPO) method to examine the vulnerability of deep models.
Our method achieves the highest attack success rates and semantics rates by changing the smallest number of words compared with existing methods.
- Score: 85.32186121859321
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks (DNNs) are known to be vulnerable to adversarial images,
while their robustness in text classification is rarely studied. Several lines
of text attack methods have been proposed in the literature, including
character-level, word-level, and sentence-level attacks. However, it is still a
challenge to minimize the number of word changes necessary to induce
misclassification, while simultaneously ensuring lexical correctness, syntactic
soundness, and semantic similarity. In this paper, we propose a Bigram and
Unigram based adaptive Semantic Preservation Optimization (BU-SPO) method to
examine the vulnerability of deep models. Our method has four major merits.
Firstly, we propose to attack text documents not only at the unigram word level
but also at the bigram level which better keeps semantics and avoids producing
meaningless outputs. Secondly, we propose a hybrid method to replace the input
words with options among both their synonyms candidates and sememe candidates,
which greatly enriches the potential substitutions compared to only using
synonyms. Thirdly, we design an optimization algorithm, i.e., Semantic
Preservation Optimization (SPO), to determine the priority of word
replacements, aiming to reduce the modification cost. Finally, we further
improve the SPO with a semantic Filter (named SPOF) to find the adversarial
example with the highest semantic similarity. We evaluate the effectiveness of
our BU-SPO and BU-SPOF on IMDB, AG's News, and Yahoo! Answers text datasets by
attacking four popular DNNs models. Results show that our methods achieve the
highest attack success rates and semantics rates by changing the smallest
number of words compared with existing methods.
Related papers
- HQA-Attack: Toward High Quality Black-Box Hard-Label Adversarial Attack
on Text [40.58680960214544]
Black-box hard-label adversarial attack on text is a practical and challenging task.
We propose a framework to generate high quality adversarial examples under the black-box hard-label attack scenarios, named HQA-Attack.
arXiv Detail & Related papers (2024-02-02T10:06:43Z) - Single Word Change is All You Need: Designing Attacks and Defenses for
Text Classifiers [12.167426402230229]
A significant portion of adversarial examples generated by existing methods change only one word.
This single-word perturbation vulnerability represents a significant weakness in classifiers.
We present the SP-Attack, designed to exploit the single-word perturbation vulnerability, achieving a higher attack success rate.
We also propose SP-Defense, which aims to improve rho by applying data augmentation in learning.
arXiv Detail & Related papers (2024-01-30T17:30:44Z) - SemStamp: A Semantic Watermark with Paraphrastic Robustness for Text Generation [72.10931780019297]
Existing watermarking algorithms are vulnerable to paraphrase attacks because of their token-level design.
We propose SemStamp, a robust sentence-level semantic watermarking algorithm based on locality-sensitive hashing (LSH)
Experimental results show that our novel semantic watermark algorithm is not only more robust than the previous state-of-the-art method on both common and bigram paraphrase attacks, but also is better at preserving the quality of generation.
arXiv Detail & Related papers (2023-10-06T03:33:42Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Learning-based Hybrid Local Search for the Hard-label Textual Attack [53.92227690452377]
We consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker could only access the prediction label.
Based on this observation, we propose a novel hard-label attack, called Learning-based Hybrid Local Search (LHLS) algorithm.
Our LHLS significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
arXiv Detail & Related papers (2022-01-20T14:16:07Z) - Adversarial Semantic Collisions [129.55896108684433]
We study semantic collisions: texts that are semantically unrelated but judged as similar by NLP models.
We develop gradient-based approaches for generating semantic collisions.
We show how to generate semantic collisions that evade perplexity-based filtering.
arXiv Detail & Related papers (2020-11-09T20:42:01Z) - Assessing Robustness of Text Classification through Maximal Safe Radius
Computation [21.05890715709053]
We aim to provide guarantees that the model prediction does not change if a word is replaced with a plausible alternative, such as a synonym.
As a measure of robustness, we adopt the notion of the maximal safe radius for a given input text, which is the minimum distance in the embedding space to the decision boundary.
For the upper bound computation, we employ Monte Carlo Tree Search in conjunction with syntactic filtering to analyse the effect of single and multiple word substitutions.
arXiv Detail & Related papers (2020-10-01T09:46:32Z) - Reevaluating Adversarial Examples in Natural Language [20.14869834829091]
We analyze the outputs of two state-of-the-art synonym substitution attacks.
We find that their perturbations often do not preserve semantics, and 38% introduce grammatical errors.
With constraints adjusted to better preserve semantics and grammaticality, the attack success rate drops by over 70 percentage points.
arXiv Detail & Related papers (2020-04-25T03:09:48Z) - BERT-ATTACK: Adversarial Attack Against BERT Using BERT [77.82947768158132]
Adrial attacks for discrete data (such as texts) are more challenging than continuous data (such as images)
We propose textbfBERT-Attack, a high-quality and effective method to generate adversarial samples.
Our method outperforms state-of-the-art attack strategies in both success rate and perturb percentage.
arXiv Detail & Related papers (2020-04-21T13:30:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.