Single Word Change is All You Need: Designing Attacks and Defenses for
Text Classifiers
- URL: http://arxiv.org/abs/2401.17196v1
- Date: Tue, 30 Jan 2024 17:30:44 GMT
- Title: Single Word Change is All You Need: Designing Attacks and Defenses for
Text Classifiers
- Authors: Lei Xu, Sarah Alnegheimish, Laure Berti-Equille, Alfredo
Cuesta-Infante, Kalyan Veeramachaneni
- Abstract summary: A significant portion of adversarial examples generated by existing methods change only one word.
This single-word perturbation vulnerability represents a significant weakness in classifiers.
We present the SP-Attack, designed to exploit the single-word perturbation vulnerability, achieving a higher attack success rate.
We also propose SP-Defense, which aims to improve rho by applying data augmentation in learning.
- Score: 12.167426402230229
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In text classification, creating an adversarial example means subtly
perturbing a few words in a sentence without changing its meaning, causing it
to be misclassified by a classifier. A concerning observation is that a
significant portion of adversarial examples generated by existing methods
change only one word. This single-word perturbation vulnerability represents a
significant weakness in classifiers, which malicious users can exploit to
efficiently create a multitude of adversarial examples. This paper studies this
problem and makes the following key contributions: (1) We introduce a novel
metric \r{ho} to quantitatively assess a classifier's robustness against
single-word perturbation. (2) We present the SP-Attack, designed to exploit the
single-word perturbation vulnerability, achieving a higher attack success rate,
better preserving sentence meaning, while reducing computation costs compared
to state-of-the-art adversarial methods. (3) We propose SP-Defense, which aims
to improve \r{ho} by applying data augmentation in learning. Experimental
results on 4 datasets and BERT and distilBERT classifiers show that SP-Defense
improves \r{ho} by 14.6% and 13.9% and decreases the attack success rate of
SP-Attack by 30.4% and 21.2% on two classifiers respectively, and decreases the
attack success rate of existing attack methods that involve multiple-word
perturbations.
Related papers
- Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation [52.72682366640554]
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else.
It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author.
arXiv Detail & Related papers (2024-03-17T16:36:26Z) - Adversarial Text Purification: A Large Language Model Approach for
Defense [25.041109219049442]
Adversarial purification is a defense mechanism for safeguarding classifiers against adversarial attacks.
We propose a novel adversarial text purification that harnesses the generative capabilities of Large Language Models.
Our proposed method demonstrates remarkable performance over various classifiers, improving their accuracy under the attack by over 65% on average.
arXiv Detail & Related papers (2024-02-05T02:36:41Z) - Don't Retrain, Just Rewrite: Countering Adversarial Perturbations by
Rewriting Text [40.491180210205556]
We present ATINTER, a model that intercepts and learns to rewrite adversarial inputs to make them non-adversarial.
Our experiments reveal that ATINTER is effective at providing better adversarial robustness than existing defense approaches.
arXiv Detail & Related papers (2023-05-25T19:42:51Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Don't sweat the small stuff, classify the rest: Sample Shielding to
protect text classifiers against adversarial attacks [2.512827436728378]
Deep learning (DL) is being used extensively for text classification.
Attackers modify the text in a way which misleads the classifier while keeping the original meaning close to intact.
We propose a novel and intuitive defense strategy called Sample Shielding.
arXiv Detail & Related papers (2022-05-03T18:24:20Z) - Block-Sparse Adversarial Attack to Fool Transformer-Based Text
Classifiers [49.50163349643615]
In this paper, we propose a gradient-based adversarial attack against transformer-based text classifiers.
Experimental results demonstrate that, while our adversarial attack maintains the semantics of the sentence, it can reduce the accuracy of GPT-2 to less than 5%.
arXiv Detail & Related papers (2022-03-11T14:37:41Z) - Towards A Conceptually Simple Defensive Approach for Few-shot
classifiers Against Adversarial Support Samples [107.38834819682315]
We study a conceptually simple approach to defend few-shot classifiers against adversarial attacks.
We propose a simple attack-agnostic detection method, using the concept of self-similarity and filtering.
Our evaluation on the miniImagenet (MI) and CUB datasets exhibit good attack detection performance.
arXiv Detail & Related papers (2021-10-24T05:46:03Z) - Semantic-Preserving Adversarial Text Attacks [85.32186121859321]
We propose a Bigram and Unigram based adaptive Semantic Preservation Optimization (BU-SPO) method to examine the vulnerability of deep models.
Our method achieves the highest attack success rates and semantics rates by changing the smallest number of words compared with existing methods.
arXiv Detail & Related papers (2021-08-23T09:05:18Z) - Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data.
In this paper, we propose variable-length textual adversarial attacks(VL-Attack)
Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.