BERT is Robust! A Case Against Synonym-Based Adversarial Examples in
Text Classification
- URL: http://arxiv.org/abs/2109.07403v1
- Date: Wed, 15 Sep 2021 16:15:16 GMT
- Title: BERT is Robust! A Case Against Synonym-Based Adversarial Examples in
Text Classification
- Authors: Jens Hauser, Zhao Meng, Dami\'an Pascual, Roger Wattenhofer
- Abstract summary: We investigate four word substitution-based attacks on BERT.
We show that their success is mainly based on feeding poor data to the model.
An additional post-processing step reduces the success rates of state-of-the-art attacks below 5%.
- Score: 8.072745157605777
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep Neural Networks have taken Natural Language Processing by storm. While
this led to incredible improvements across many tasks, it also initiated a new
research field, questioning the robustness of these neural networks by
attacking them. In this paper, we investigate four word substitution-based
attacks on BERT. We combine a human evaluation of individual word substitutions
and a probabilistic analysis to show that between 96% and 99% of the analyzed
attacks do not preserve semantics, indicating that their success is mainly
based on feeding poor data to the model. To further confirm that, we introduce
an efficient data augmentation procedure and show that many adversarial
examples can be prevented by including data similar to the attacks during
training. An additional post-processing step reduces the success rates of
state-of-the-art attacks below 5%. Finally, by looking at more reasonable
thresholds on constraints for word substitutions, we conclude that BERT is a
lot more robust than research on attacks suggests.
Related papers
- Efficient Trigger Word Insertion [9.257916713112945]
Our main objective is to reduce the number of poisoned samples while still achieving a satisfactory Attack Success Rate (ASR) in text backdoor attacks.
We propose an efficient trigger word insertion strategy in terms of trigger word optimization and poisoned sample selection.
Our approach achieves an ASR of over 90% with only 10 poisoned samples in the dirty-label setting and requires merely 1.5% of the training data in the clean-label setting.
arXiv Detail & Related papers (2023-11-23T12:15:56Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Block-Sparse Adversarial Attack to Fool Transformer-Based Text
Classifiers [49.50163349643615]
In this paper, we propose a gradient-based adversarial attack against transformer-based text classifiers.
Experimental results demonstrate that, while our adversarial attack maintains the semantics of the sentence, it can reduce the accuracy of GPT-2 to less than 5%.
arXiv Detail & Related papers (2022-03-11T14:37:41Z) - Detection of Word Adversarial Examples in Text Classification: Benchmark
and Baseline via Robust Density Estimation [33.46393193123221]
We release a dataset for four popular attack methods on four datasets and four models.
We propose a competitive baseline based on density estimation that has the highest AUC on 29 out of 30 dataset-attack-model combinations.
arXiv Detail & Related papers (2022-03-03T12:32:59Z) - Learning-based Hybrid Local Search for the Hard-label Textual Attack [53.92227690452377]
We consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker could only access the prediction label.
Based on this observation, we propose a novel hard-label attack, called Learning-based Hybrid Local Search (LHLS) algorithm.
Our LHLS significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
arXiv Detail & Related papers (2022-01-20T14:16:07Z) - Generating Natural Language Adversarial Examples through An Improved
Beam Search Algorithm [0.5735035463793008]
In this paper, a novel attack model is proposed, its attack success rate surpasses the benchmark attack methods.
The novel method is empirically evaluated by attacking WordCNN, LSTM, BiLSTM, and BERT on four benchmark datasets.
It achieves a 100% attack success rate higher than the state-of-the-art method when attacking BERT and BiLSTM on IMDB.
arXiv Detail & Related papers (2021-10-15T12:09:04Z) - Semantic-Preserving Adversarial Text Attacks [85.32186121859321]
We propose a Bigram and Unigram based adaptive Semantic Preservation Optimization (BU-SPO) method to examine the vulnerability of deep models.
Our method achieves the highest attack success rates and semantics rates by changing the smallest number of words compared with existing methods.
arXiv Detail & Related papers (2021-08-23T09:05:18Z) - Self-Supervised Contrastive Learning with Adversarial Perturbations for
Robust Pretrained Language Models [18.726529370845256]
This paper improves the robustness of the pretrained language model BERT against word substitution-based adversarial attacks.
We also create an adversarial attack for word-level adversarial training on BERT.
arXiv Detail & Related papers (2021-07-15T21:03:34Z) - Defense against Adversarial Attacks in NLP via Dirichlet Neighborhood
Ensemble [163.3333439344695]
Dirichlet Neighborhood Ensemble (DNE) is a randomized smoothing method for training a robust model to defense substitution-based attacks.
DNE forms virtual sentences by sampling embedding vectors for each word in an input sentence from a convex hull spanned by the word and its synonyms, and it augments them with the training data.
We demonstrate through extensive experimentation that our method consistently outperforms recently proposed defense methods by a significant margin across different network architectures and multiple data sets.
arXiv Detail & Related papers (2020-06-20T18:01:16Z) - BERT-ATTACK: Adversarial Attack Against BERT Using BERT [77.82947768158132]
Adrial attacks for discrete data (such as texts) are more challenging than continuous data (such as images)
We propose textbfBERT-Attack, a high-quality and effective method to generate adversarial samples.
Our method outperforms state-of-the-art attack strategies in both success rate and perturb percentage.
arXiv Detail & Related papers (2020-04-21T13:30:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.