Don't sweat the small stuff, classify the rest: Sample Shielding to
protect text classifiers against adversarial attacks
- URL: http://arxiv.org/abs/2205.01714v1
- Date: Tue, 3 May 2022 18:24:20 GMT
- Title: Don't sweat the small stuff, classify the rest: Sample Shielding to
protect text classifiers against adversarial attacks
- Authors: Jonathan Rusert, Padmini Srinivasan
- Abstract summary: Deep learning (DL) is being used extensively for text classification.
Attackers modify the text in a way which misleads the classifier while keeping the original meaning close to intact.
We propose a novel and intuitive defense strategy called Sample Shielding.
- Score: 2.512827436728378
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning (DL) is being used extensively for text classification.
However, researchers have demonstrated the vulnerability of such classifiers to
adversarial attacks. Attackers modify the text in a way which misleads the
classifier while keeping the original meaning close to intact. State-of-the-art
(SOTA) attack algorithms follow the general principle of making minimal changes
to the text so as to not jeopardize semantics. Taking advantage of this we
propose a novel and intuitive defense strategy called Sample Shielding. It is
attacker and classifier agnostic, does not require any reconfiguration of the
classifier or external resources and is simple to implement. Essentially, we
sample subsets of the input text, classify them and summarize these into a
final decision. We shield three popular DL text classifiers with Sample
Shielding, test their resilience against four SOTA attackers across three
datasets in a realistic threat setting. Even when given the advantage of
knowing about our shielding strategy the adversary's attack success rate is
<=10% with only one exception and often < 5%. Additionally, Sample Shielding
maintains near original accuracy when applied to original texts. Crucially, we
show that the `make minimal changes' approach of SOTA attackers leads to
critical vulnerabilities that can be defended against with an intuitive
sampling strategy.
Related papers
- Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation [52.72682366640554]
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else.
It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author.
arXiv Detail & Related papers (2024-03-17T16:36:26Z) - OrderBkd: Textual backdoor attack through repositioning [0.0]
Third-party datasets and pre-trained machine learning models pose a threat to NLP systems.
Existing backdoor attacks involve poisoning the data samples such as insertion of tokens or sentence paraphrasing.
Our main difference from the previous work is that we use the reposition of a two words in a sentence as a trigger.
arXiv Detail & Related papers (2024-02-12T14:53:37Z) - Single Word Change is All You Need: Designing Attacks and Defenses for
Text Classifiers [12.167426402230229]
A significant portion of adversarial examples generated by existing methods change only one word.
This single-word perturbation vulnerability represents a significant weakness in classifiers.
We present the SP-Attack, designed to exploit the single-word perturbation vulnerability, achieving a higher attack success rate.
We also propose SP-Defense, which aims to improve rho by applying data augmentation in learning.
arXiv Detail & Related papers (2024-01-30T17:30:44Z) - Verifying the Robustness of Automatic Credibility Assessment [79.08422736721764]
Text classification methods have been widely investigated as a way to detect content of low credibility.
In some cases insignificant changes in input text can mislead the models.
We introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - TextDefense: Adversarial Text Detection based on Word Importance Entropy [38.632552667871295]
We propose TextDefense, a new adversarial example detection framework for NLP models.
Our experiments show that TextDefense can be applied to different architectures, datasets, and attack methods.
We provide our insights into the adversarial attacks in NLP and the principles of our defense method.
arXiv Detail & Related papers (2023-02-12T11:12:44Z) - Identifying Adversarial Attacks on Text Classifiers [32.958568467774704]
In this paper, we analyze adversarial text to determine which methods were used to create it.
Our first contribution is an extensive dataset for attack detection and labeling.
As our second contribution, we use this dataset to develop and benchmark a number of classifiers for attack identification.
arXiv Detail & Related papers (2022-01-21T06:16:04Z) - Towards A Conceptually Simple Defensive Approach for Few-shot
classifiers Against Adversarial Support Samples [107.38834819682315]
We study a conceptually simple approach to defend few-shot classifiers against adversarial attacks.
We propose a simple attack-agnostic detection method, using the concept of self-similarity and filtering.
Our evaluation on the miniImagenet (MI) and CUB datasets exhibit good attack detection performance.
arXiv Detail & Related papers (2021-10-24T05:46:03Z) - Semantic-Preserving Adversarial Text Attacks [85.32186121859321]
We propose a Bigram and Unigram based adaptive Semantic Preservation Optimization (BU-SPO) method to examine the vulnerability of deep models.
Our method achieves the highest attack success rates and semantics rates by changing the smallest number of words compared with existing methods.
arXiv Detail & Related papers (2021-08-23T09:05:18Z) - Certified Robustness to Text Adversarial Attacks by Randomized [MASK] [39.07743913719665]
We propose a certifiably robust defense method by randomly masking a certain proportion of the words in an input text.
The proposed method can defend against not only word substitution-based attacks, but also character-level perturbations.
We can certify the classifications of over 50% texts to be robust to any perturbation of 5 words on AGNEWS, and 2 words on SST2 dataset.
arXiv Detail & Related papers (2021-05-08T16:59:10Z) - Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data.
In this paper, we propose variable-length textual adversarial attacks(VL-Attack)
Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z) - Hidden Backdoor Attack against Semantic Segmentation Models [60.0327238844584]
The emphbackdoor attack intends to embed hidden backdoors in deep neural networks (DNNs) by poisoning training data.
We propose a novel attack paradigm, the emphfine-grained attack, where we treat the target label from the object-level instead of the image-level.
Experiments show that the proposed methods can successfully attack semantic segmentation models by poisoning only a small proportion of training data.
arXiv Detail & Related papers (2021-03-06T05:50:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.