Perturbations in the Wild: Leveraging Human-Written Text Perturbations
for Realistic Adversarial Attack and Defense
- URL: http://arxiv.org/abs/2203.10346v1
- Date: Sat, 19 Mar 2022 16:00:01 GMT
- Title: Perturbations in the Wild: Leveraging Human-Written Text Perturbations
for Realistic Adversarial Attack and Defense
- Authors: Thai Le, Jooyoung Lee, Kevin Yen, Yifan Hu, Dongwon Lee
- Abstract summary: ANTHRO inductively extracts over 600K human-written text perturbations in the wild and leverages them for realistic adversarial attack.
We find that adversarial texts generated by ANTHRO achieve the best trade-off between (1) attack success rate, (2) semantic preservation of the original text, and (3) stealthiness--i.e. indistinguishable from human writings.
- Score: 19.76930957323042
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We proposes a novel algorithm, ANTHRO, that inductively extracts over 600K
human-written text perturbations in the wild and leverages them for realistic
adversarial attack. Unlike existing character-based attacks which often
deductively hypothesize a set of manipulation strategies, our work is grounded
on actual observations from real-world texts. We find that adversarial texts
generated by ANTHRO achieve the best trade-off between (1) attack success rate,
(2) semantic preservation of the original text, and (3) stealthiness--i.e.
indistinguishable from human writings hence harder to be flagged as suspicious.
Specifically, our attacks accomplished around 83% and 91% attack success rates
on BERT and RoBERTa, respectively. Moreover, it outperformed the TextBugger
baseline with an increase of 50% and 40% in terms of semantic preservation and
stealthiness when evaluated by both layperson and professional human workers.
ANTHRO can further enhance a BERT classifier's performance in understanding
different variations of human-written toxic texts via adversarial training when
compared to the Perspective API.
Related papers
- Vision-fused Attack: Advancing Aggressive and Stealthy Adversarial Text against Neural Machine Translation [24.237246648082085]
This paper proposes a novel vision-fused attack (VFA) framework to acquire powerful adversarial text.
For human imperceptibility, we propose the perception-retained adversarial text selection strategy to align the human text-reading mechanism.
arXiv Detail & Related papers (2024-09-08T08:22:17Z) - Semantic Stealth: Adversarial Text Attacks on NLP Using Several Methods [0.0]
A text adversarial attack involves the deliberate manipulation of input text to mislead the predictions of the model.
BERT, BERT-on-BERT attack, and Fraud Bargain's Attack (FBA) are explored in this paper.
PWWS emerges as the most potent adversary, consistently outperforming other methods across multiple evaluation scenarios.
arXiv Detail & Related papers (2024-04-08T02:55:01Z) - RobustSentEmbed: Robust Sentence Embeddings Using Adversarial Self-Supervised Contrastive Learning [11.347789553984741]
RobustSentEmbed is a self-supervised sentence embedding framework designed to improve robustness in diverse text representation tasks.
Our framework achieves a significant reduction in the success rate of various adversarial attacks, notably reducing the BERTAttack success rate by almost half.
arXiv Detail & Related papers (2024-03-17T04:29:45Z) - Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks [21.914674640285337]
This paper focuses on analyzing factors associated with attack success rates (ASR)
We introduce a new attack objective - entity swapping using adversarial suffixes and two gradient-based attack algorithms.
We identify conditions that result in a success probability of 60% for adversarial attacks and others where this likelihood drops below 5%.
arXiv Detail & Related papers (2023-12-22T05:10:32Z) - How do humans perceive adversarial text? A reality check on the validity
and naturalness of word-based adversarial attacks [4.297786261992324]
adversarial attacks are malicious algorithms that imperceptibly modify input text to force models into making incorrect predictions.
We surveyed 378 human participants about the perceptibility of text adversarial examples produced by state-of-the-art methods.
Our results underline that existing text attacks are impractical in real-world scenarios where humans are involved.
arXiv Detail & Related papers (2023-05-24T21:52:13Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Rethinking Textual Adversarial Defense for Pre-trained Language Models [79.18455635071817]
A literature review shows that pre-trained language models (PrLMs) are vulnerable to adversarial attacks.
We propose a novel metric (Degree of Anomaly) to enable current adversarial attack approaches to generate more natural and imperceptible adversarial examples.
We show that our universal defense framework achieves comparable or even higher after-attack accuracy with other specific defenses.
arXiv Detail & Related papers (2022-07-21T07:51:45Z) - Learning-based Hybrid Local Search for the Hard-label Textual Attack [53.92227690452377]
We consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker could only access the prediction label.
Based on this observation, we propose a novel hard-label attack, called Learning-based Hybrid Local Search (LHLS) algorithm.
Our LHLS significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
arXiv Detail & Related papers (2022-01-20T14:16:07Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data.
In this paper, we propose variable-length textual adversarial attacks(VL-Attack)
Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z) - BERT-ATTACK: Adversarial Attack Against BERT Using BERT [77.82947768158132]
Adrial attacks for discrete data (such as texts) are more challenging than continuous data (such as images)
We propose textbfBERT-Attack, a high-quality and effective method to generate adversarial samples.
Our method outperforms state-of-the-art attack strategies in both success rate and perturb percentage.
arXiv Detail & Related papers (2020-04-21T13:30:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.