Related papers: BERT-Defense: A Probabilistic Model Based on BERT to Combat Cognitively Inspired Orthographic Adversarial Attacks

BERT-Defense: A Probabilistic Model Based on BERT to Combat Cognitively Inspired Orthographic Adversarial Attacks

URL: http://arxiv.org/abs/2106.01452v1
Date: Wed, 2 Jun 2021 20:21:03 GMT
Title: BERT-Defense: A Probabilistic Model Based on BERT to Combat Cognitively Inspired Orthographic Adversarial Attacks
Authors: Yannik Keller, Jan Mackensen, Steffen Eger
Abstract summary: Adversarial attacks expose important blind spots of deep learning systems. Character-level attacks typically insert typos into the input stream. We show that an untrained iterative approach can perform on par with human crowd-workers supervised via 3-shot learning.
Score: 10.290050493635343
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Adversarial attacks expose important blind spots of deep learning systems. While word- and sentence-level attack scenarios mostly deal with finding semantic paraphrases of the input that fool NLP models, character-level attacks typically insert typos into the input stream. It is commonly thought that these are easier to defend via spelling correction modules. In this work, we show that both a standard spellchecker and the approach of Pruthi et al. (2019), which trains to defend against insertions, deletions and swaps, perform poorly on the character-level benchmark recently proposed in Eger and Benz (2020) which includes more challenging attacks such as visual and phonetic perturbations and missing word segmentations. In contrast, we show that an untrained iterative approach which combines context-independent character-level information with context-dependent information from BERT's masked language modeling can perform on par with human crowd-workers from Amazon Mechanical Turk (AMT) supervised via 3-shot learning.

Related papers

Attacking interpretable NLP systems [1.5074441766044933]
This paper introduces AdvChar, a black-box attack on Interpretable Natural Language Processing Systems.<n>We show that AdvChar can significantly reduce the prediction accuracy of current deep learning models by altering just two characters on average in input samples.
arXiv Detail & Related papers (2025-07-22T02:20:00Z)
Web Artifact Attacks Disrupt Vision Language Models [61.59021920232986]
Vision-language models (VLMs) are trained on large-scale, lightly curated web datasets. They learn unintended correlations between semantic concepts and unrelated visual signals. Prior work has weaponized these correlations as an attack vector to manipulate model predictions. We introduce artifact-based attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements.
arXiv Detail & Related papers (2025-03-17T18:59:29Z)
Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks [58.10730906004818]
Typographic attacks, adding misleading text to images, can deceive vision-language models (LVLMs) Our experiments show these attacks significantly reduce classification performance by up to 60%.
arXiv Detail & Related papers (2024-02-01T14:41:20Z)
Are aligned neural networks adversarially aligned? [93.91072860401856]
adversarial users can construct inputs which circumvent attempts at alignment. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models. We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.
arXiv Detail & Related papers (2023-06-26T17:18:44Z)
COVER: A Heuristic Greedy Adversarial Attack on Prompt-based Learning in Language Models [4.776465250559034]
We propose a prompt-based adversarial attack on manual templates in black box scenarios. First of all, we design character-level and word-level approaches to break manual templates separately. And we present a greedy algorithm for the attack based on the above destructive approaches.
arXiv Detail & Related papers (2023-06-09T03:53:42Z)
Exploring the Universal Vulnerability of Prompt-based Learning Paradigm [21.113683206722207]
We find that prompt-based learning bridges the gap between pre-training and fine-tuning, and works effectively under the few-shot setting. However, we find that this learning paradigm inherits the vulnerability from the pre-training stage, where model predictions can be misled by inserting certain triggers into the text. We explore this universal vulnerability by either injecting backdoor triggers or searching for adversarial triggers on pre-trained language models using only plain text.
arXiv Detail & Related papers (2022-04-11T16:34:10Z)
Learning-based Hybrid Local Search for the Hard-label Textual Attack [53.92227690452377]
We consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker could only access the prediction label. Based on this observation, we propose a novel hard-label attack, called Learning-based Hybrid Local Search (LHLS) algorithm. Our LHLS significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
arXiv Detail & Related papers (2022-01-20T14:16:07Z)
Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model. In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z)
A Differentiable Language Model Adversarial Attack on Text Classifiers [10.658675415759697]
We propose a new black-box sentence-level attack for natural language processing. Our method fine-tunes a pre-trained language model to generate adversarial examples. We show that the proposed attack outperforms competitors on a diverse set of NLP problems for both computed metrics and human evaluation.
arXiv Detail & Related papers (2021-07-23T14:43:13Z)
Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data. In this paper, we propose variable-length textual adversarial attacks(VL-Attack) Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z)
From Hero to Z\'eroe: A Benchmark of Low-Level Adversarial Attacks [23.381986209234157]
We propose the first large-scale catalogue and benchmark of low-level adversarial attacks. We show that RoBERTa, NLP's current workhorse, fails on our attacks. Our dataset provides a benchmark for testing robustness of future more human-like NLP models.
arXiv Detail & Related papers (2020-10-12T12:35:36Z)
BERT-ATTACK: Adversarial Attack Against BERT Using BERT [77.82947768158132]
Adrial attacks for discrete data (such as texts) are more challenging than continuous data (such as images) We propose textbfBERT-Attack, a high-quality and effective method to generate adversarial samples. Our method outperforms state-of-the-art attack strategies in both success rate and perturb percentage.
arXiv Detail & Related papers (2020-04-21T13:30:02Z)
Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT [95.88293021131035]
It is unclear, however, how the models will perform in realistic scenarios where textitnatural rather than malicious adversarial instances often exist. This work systematically explores the robustness of BERT, the state-of-the-art Transformer-style model in NLP, in dealing with noisy data.
arXiv Detail & Related papers (2020-02-27T22:07:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.