Identifying Adversarial Attacks on Text Classifiers
- URL: http://arxiv.org/abs/2201.08555v1
- Date: Fri, 21 Jan 2022 06:16:04 GMT
- Title: Identifying Adversarial Attacks on Text Classifiers
- Authors: Zhouhang Xie, Jonathan Brophy, Adam Noack, Wencong You, Kalyani
Asthana, Carter Perkins, Sabrina Reis, Sameer Singh and Daniel Lowd
- Abstract summary: In this paper, we analyze adversarial text to determine which methods were used to create it.
Our first contribution is an extensive dataset for attack detection and labeling.
As our second contribution, we use this dataset to develop and benchmark a number of classifiers for attack identification.
- Score: 32.958568467774704
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The landscape of adversarial attacks against text classifiers continues to
grow, with new attacks developed every year and many of them available in
standard toolkits, such as TextAttack and OpenAttack. In response, there is a
growing body of work on robust learning, which reduces vulnerability to these
attacks, though sometimes at a high cost in compute time or accuracy. In this
paper, we take an alternate approach -- we attempt to understand the attacker
by analyzing adversarial text to determine which methods were used to create
it. Our first contribution is an extensive dataset for attack detection and
labeling: 1.5~million attack instances, generated by twelve adversarial attacks
targeting three classifiers trained on six source datasets for sentiment
analysis and abuse detection in English. As our second contribution, we use
this dataset to develop and benchmark a number of classifiers for attack
identification -- determining if a given text has been adversarially
manipulated and by which attack. As a third contribution, we demonstrate the
effectiveness of three classes of features for these tasks: text properties,
capturing content and presentation of text; language model properties,
determining which tokens are more or less probable throughout the input; and
target model properties, representing how the text classifier is influenced by
the attack, including internal node activations. Overall, this represents a
first step towards forensics for adversarial attacks against text classifiers.
Related papers
- Semantic Stealth: Adversarial Text Attacks on NLP Using Several Methods [0.0]
A text adversarial attack involves the deliberate manipulation of input text to mislead the predictions of the model.
BERT, BERT-on-BERT attack, and Fraud Bargain's Attack (FBA) are explored in this paper.
PWWS emerges as the most potent adversary, consistently outperforming other methods across multiple evaluation scenarios.
arXiv Detail & Related papers (2024-04-08T02:55:01Z) - Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation [52.72682366640554]
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else.
It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author.
arXiv Detail & Related papers (2024-03-17T16:36:26Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - TextDefense: Adversarial Text Detection based on Word Importance Entropy [38.632552667871295]
We propose TextDefense, a new adversarial example detection framework for NLP models.
Our experiments show that TextDefense can be applied to different architectures, datasets, and attack methods.
We provide our insights into the adversarial attacks in NLP and the principles of our defense method.
arXiv Detail & Related papers (2023-02-12T11:12:44Z) - Object-fabrication Targeted Attack for Object Detection [54.10697546734503]
adversarial attack for object detection contains targeted attack and untargeted attack.
New object-fabrication targeted attack mode can mislead detectors tofabricate extra false objects with specific target labels.
arXiv Detail & Related papers (2022-12-13T08:42:39Z) - TCAB: A Large-Scale Text Classification Attack Benchmark [36.102015445585785]
The Text Classification Attack Benchmark (TCAB) is a dataset for analyzing, understanding, detecting, and labeling adversarial attacks against text classifiers.
TCAB includes 1.5 million attack instances, generated by twelve adversarial attacks targeting three classifiers trained on six source datasets for sentiment analysis and abuse detection in English.
In addition to the primary tasks of detecting and labeling attacks, TCAB can also be used for attack localization, attack target labeling, and attack characterization.
arXiv Detail & Related papers (2022-10-21T20:22:45Z) - Don't sweat the small stuff, classify the rest: Sample Shielding to
protect text classifiers against adversarial attacks [2.512827436728378]
Deep learning (DL) is being used extensively for text classification.
Attackers modify the text in a way which misleads the classifier while keeping the original meaning close to intact.
We propose a novel and intuitive defense strategy called Sample Shielding.
arXiv Detail & Related papers (2022-05-03T18:24:20Z) - Zero-Query Transfer Attacks on Context-Aware Object Detectors [95.18656036716972]
Adversarial attacks perturb images such that a deep neural network produces incorrect classification results.
A promising approach to defend against adversarial attacks on natural multi-object scenes is to impose a context-consistency check.
We present the first approach for generating context-consistent adversarial attacks that can evade the context-consistency check.
arXiv Detail & Related papers (2022-03-29T04:33:06Z) - Learning-based Hybrid Local Search for the Hard-label Textual Attack [53.92227690452377]
We consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker could only access the prediction label.
Based on this observation, we propose a novel hard-label attack, called Learning-based Hybrid Local Search (LHLS) algorithm.
Our LHLS significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
arXiv Detail & Related papers (2022-01-20T14:16:07Z) - Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data.
In this paper, we propose variable-length textual adversarial attacks(VL-Attack)
Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z) - Universal Adversarial Attacks with Natural Triggers for Text
Classification [30.74579821832117]
We develop adversarial attacks that appear closer to natural English phrases and yet confuse classification systems.
Our attacks effectively reduce model accuracy on classification tasks while being less identifiable than prior models.
arXiv Detail & Related papers (2020-05-01T01:58:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.