Related papers: Don't sweat the small stuff, classify the rest: Sample Shielding to protect text classifiers against adversarial attacks

Don't sweat the small stuff, classify the rest: Sample Shielding to protect text classifiers against adversarial attacks

URL: http://arxiv.org/abs/2205.01714v1
Date: Tue, 3 May 2022 18:24:20 GMT
Title: Don't sweat the small stuff, classify the rest: Sample Shielding to protect text classifiers against adversarial attacks
Authors: Jonathan Rusert, Padmini Srinivasan
Abstract summary: Deep learning (DL) is being used extensively for text classification. Attackers modify the text in a way which misleads the classifier while keeping the original meaning close to intact. We propose a novel and intuitive defense strategy called Sample Shielding.
Score: 2.512827436728378
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning (DL) is being used extensively for text classification. However, researchers have demonstrated the vulnerability of such classifiers to adversarial attacks. Attackers modify the text in a way which misleads the classifier while keeping the original meaning close to intact. State-of-the-art (SOTA) attack algorithms follow the general principle of making minimal changes to the text so as to not jeopardize semantics. Taking advantage of this we propose a novel and intuitive defense strategy called Sample Shielding. It is attacker and classifier agnostic, does not require any reconfiguration of the classifier or external resources and is simple to implement. Essentially, we sample subsets of the input text, classify them and summarize these into a final decision. We shield three popular DL text classifiers with Sample Shielding, test their resilience against four SOTA attackers across three datasets in a realistic threat setting. Even when given the advantage of knowing about our shielding strategy the adversary's attack success rate is <=10% with only one exception and often < 5%. Additionally, Sample Shielding maintains near original accuracy when applied to original texts. Crucially, we show that the `make minimal changes' approach of SOTA attackers leads to critical vulnerabilities that can be defended against with an intuitive sampling strategy.

Related papers

ExpShield: Safeguarding Web Text from Unauthorized Crawling and Language Modeling Exploitation [17.71790411163849]
We propose ExpShiled, a proactive self-defense mechanism that mitigates sample-specific memorization via imperceptible text perturbations.<n>Our approach requires no external collaboration while maintaining original readability.<n>Even with privacy backdoors, the Membership Inference Attack (MIA) AUC drops from 0.95 to 0.55, and instance exploitation approaches zero.
arXiv Detail & Related papers (2024-12-30T17:52:02Z)
Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation [52.72682366640554]
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else. It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author.
arXiv Detail & Related papers (2024-03-17T16:36:26Z)
OrderBkd: Textual backdoor attack through repositioning [0.0]
Third-party datasets and pre-trained machine learning models pose a threat to NLP systems. Existing backdoor attacks involve poisoning the data samples such as insertion of tokens or sentence paraphrasing. Our main difference from the previous work is that we use the reposition of a two words in a sentence as a trigger.
arXiv Detail & Related papers (2024-02-12T14:53:37Z)
Single Word Change is All You Need: Designing Attacks and Defenses for Text Classifiers [12.167426402230229]
A significant portion of adversarial examples generated by existing methods change only one word. This single-word perturbation vulnerability represents a significant weakness in classifiers. We present the SP-Attack, designed to exploit the single-word perturbation vulnerability, achieving a higher attack success rate. We also propose SP-Defense, which aims to improve rho by applying data augmentation in learning.
arXiv Detail & Related papers (2024-01-30T17:30:44Z)
Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models. We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks. Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z)
TextDefense: Adversarial Text Detection based on Word Importance Entropy [38.632552667871295]
We propose TextDefense, a new adversarial example detection framework for NLP models. Our experiments show that TextDefense can be applied to different architectures, datasets, and attack methods. We provide our insights into the adversarial attacks in NLP and the principles of our defense method.
arXiv Detail & Related papers (2023-02-12T11:12:44Z)
Identifying Adversarial Attacks on Text Classifiers [32.958568467774704]
In this paper, we analyze adversarial text to determine which methods were used to create it. Our first contribution is an extensive dataset for attack detection and labeling. As our second contribution, we use this dataset to develop and benchmark a number of classifiers for attack identification.
arXiv Detail & Related papers (2022-01-21T06:16:04Z)
Towards A Conceptually Simple Defensive Approach for Few-shot classifiers Against Adversarial Support Samples [107.38834819682315]
We study a conceptually simple approach to defend few-shot classifiers against adversarial attacks. We propose a simple attack-agnostic detection method, using the concept of self-similarity and filtering. Our evaluation on the miniImagenet (MI) and CUB datasets exhibit good attack detection performance.
arXiv Detail & Related papers (2021-10-24T05:46:03Z)
Semantic-Preserving Adversarial Text Attacks [85.32186121859321]
We propose a Bigram and Unigram based adaptive Semantic Preservation Optimization (BU-SPO) method to examine the vulnerability of deep models. Our method achieves the highest attack success rates and semantics rates by changing the smallest number of words compared with existing methods.
arXiv Detail & Related papers (2021-08-23T09:05:18Z)
Certified Robustness to Text Adversarial Attacks by Randomized [MASK] [39.07743913719665]
We propose a certifiably robust defense method by randomly masking a certain proportion of the words in an input text. The proposed method can defend against not only word substitution-based attacks, but also character-level perturbations. We can certify the classifications of over 50% texts to be robust to any perturbation of 5 words on AGNEWS, and 2 words on SST2 dataset.
arXiv Detail & Related papers (2021-05-08T16:59:10Z)
Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data. In this paper, we propose variable-length textual adversarial attacks(VL-Attack) Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z)
Hidden Backdoor Attack against Semantic Segmentation Models [60.0327238844584]
The emphbackdoor attack intends to embed hidden backdoors in deep neural networks (DNNs) by poisoning training data. We propose a novel attack paradigm, the emphfine-grained attack, where we treat the target label from the object-level instead of the image-level. Experiments show that the proposed methods can successfully attack semantic segmentation models by poisoning only a small proportion of training data.
arXiv Detail & Related papers (2021-03-06T05:50:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.