Related papers: TARGET: Template-Transferable Backdoor Attack Against Prompt-based NLP Models via GPT4

TARGET: Template-Transferable Backdoor Attack Against Prompt-based NLP Models via GPT4

URL: http://arxiv.org/abs/2311.17429v1
Date: Wed, 29 Nov 2023 08:12:09 GMT
Title: TARGET: Template-Transferable Backdoor Attack Against Prompt-based NLP Models via GPT4
Authors: Zihao Tan, Qingliang Chen, Yongjian Huang and Chen Liang
Abstract summary: We propose a novel approach of TARGET (Template-trAnsfeRable backdoor attack aGainst prompt-basEd NLP models via GPT4) Specifically, we first utilize GPT4 to reformulate manual templates to generate tone-strong and normal templates, and the former are injected into the model as a backdoor trigger in the pre-training phase. Then, we not only directly employ the above templates in the downstream task, but also use GPT4 to generate templates with similar tone to the above templates to carry out transferable attacks.
Score: 15.015584291919817
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Prompt-based learning has been widely applied in many low-resource NLP tasks such as few-shot scenarios. However, this paradigm has been shown to be vulnerable to backdoor attacks. Most of the existing attack methods focus on inserting manually predefined templates as triggers in the pre-training phase to train the victim model and utilize the same triggers in the downstream task to perform inference, which tends to ignore the transferability and stealthiness of the templates. In this work, we propose a novel approach of TARGET (Template-trAnsfeRable backdoor attack aGainst prompt-basEd NLP models via GPT4), which is a data-independent attack method. Specifically, we first utilize GPT4 to reformulate manual templates to generate tone-strong and normal templates, and the former are injected into the model as a backdoor trigger in the pre-training phase. Then, we not only directly employ the above templates in the downstream task, but also use GPT4 to generate templates with similar tone to the above templates to carry out transferable attacks. Finally we have conducted extensive experiments on five NLP datasets and three BERT series models, with experimental results justifying that our TARGET method has better attack performance and stealthiness compared to the two-external baseline methods on direct attacks, and in addition achieves satisfactory attack capability in the unseen tone-similar templates.

Related papers

BadTemplate: A Training-Free Backdoor Attack via Chat Template Against Large Language Models [28.29280600867087]
Chat template is a technique used in the training and inference stages of Large Language Models.<n>BadTemplate is a training-free backdoor attack that inserts malicious instructions directly into the system prompt.<n>It achieves up to a 100% attack success rate and significantly outperforms traditional prompt-based backdoors in both word-level and sentence-level attacks.
arXiv Detail & Related papers (2026-02-05T07:34:35Z)
The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage [71.8564105095189]
We introduce N-Gram Coverage Attack, a membership inference attack that relies solely on text outputs from the target model.<n>We first demonstrate on a diverse set of existing benchmarks that N-Gram Coverage Attack outperforms other black-box methods.<n>We find that more recent models, such as GPT-4o, exhibit increased robustness to membership inference.
arXiv Detail & Related papers (2025-08-13T08:35:16Z)
No Query, No Access [50.18709429731724]
We introduce the textbfVictim Data-based Adrial Attack (VDBA), which operates using only victim texts.<n>To prevent access to the victim model, we create a shadow dataset with publicly available pre-trained models and clustering methods.<n>Experiments on the Emotion and SST5 datasets show that VDBA outperforms state-of-the-art methods, achieving an ASR improvement of 52.08%.
arXiv Detail & Related papers (2025-05-12T06:19:59Z)
PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning [28.845915332201592]
Pre-trained language models (PLMs) have attracted enormous attention over the past few years with their unparalleled performances. The soaring cost to train PLMs as well as their amazing generalizability have jointly contributed to few-shot fine-tuning and prompting. Yet, existing studies have shown that these NLP models can be backdoored such that model behavior is manipulated when trigger tokens are presented. We propose PromptFix, a novel backdoor mitigation strategy for NLP models via adversarial prompt-tuning in few-shot settings.
arXiv Detail & Related papers (2024-06-06T20:06:42Z)
Query-Based Adversarial Prompt Generation [67.238873588125]
We build adversarial examples that cause an aligned language model to emit harmful strings. We validate our attack on GPT-3.5 and OpenAI's safety classifier.
arXiv Detail & Related papers (2024-02-19T18:01:36Z)
Scalable Extraction of Training Data from (Production) Language Models [93.7746567808049]
This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT.
arXiv Detail & Related papers (2023-11-28T18:47:03Z)
A Quality-based Syntactic Template Retriever for Syntactically-controlled Paraphrase Generation [67.98367574025797]
Existing syntactically-controlled paraphrase generation models perform promisingly with human-annotated or well-chosen syntactic templates. The prohibitive cost makes it unfeasible to manually design decent templates for every source sentence. We propose a novel Quality-based Syntactic Template Retriever (QSTR) to retrieve templates based on the quality of the to-be-generated paraphrases.
arXiv Detail & Related papers (2023-10-20T03:55:39Z)
COVER: A Heuristic Greedy Adversarial Attack on Prompt-based Learning in Language Models [4.776465250559034]
We propose a prompt-based adversarial attack on manual templates in black box scenarios. First of all, we design character-level and word-level approaches to break manual templates separately. And we present a greedy algorithm for the attack based on the above destructive approaches.
arXiv Detail & Related papers (2023-06-09T03:53:42Z)
ChatGPT as an Attack Tool: Stealthy Textual Backdoor Attack via Blackbox Generative Model Trigger [11.622811907571132]
Textual backdoor attacks pose a practical threat to existing systems. With cutting-edge generative models such as GPT-4 pushing rewriting to extraordinary levels, such attacks are becoming even harder to detect. We conduct a comprehensive investigation of the role of black-box generative models as a backdoor attack tool, highlighting the importance of researching relative defense strategies.
arXiv Detail & Related papers (2023-04-27T19:26:25Z)
Model-tuning Via Prompts Makes NLP Models Adversarially Robust [97.02353907677703]
We show surprising gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP) MVP improves performance against adversarial substitutions by an average of 8% over standard methods. We also conduct ablations to investigate the mechanism underlying these gains.
arXiv Detail & Related papers (2023-03-13T17:41:57Z)
Backdoor Attacks Against Deep Image Compression via Adaptive Frequency Trigger [106.10954454667757]
We present a novel backdoor attack with multiple triggers against learned image compression models. Motivated by the widely used discrete cosine transform (DCT) in existing compression systems and standards, we propose a frequency-based trigger injection model.
arXiv Detail & Related papers (2023-02-28T15:39:31Z)
Backdoor Pre-trained Models Can Transfer to All [33.720258110911274]
We propose a new approach to map the inputs containing triggers directly to a predefined output representation of pre-trained NLP models. In light of the unique properties of triggers in NLP, we propose two new metrics to measure the performance of backdoor attacks.
arXiv Detail & Related papers (2021-10-30T07:11:24Z)
BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models [25.938195038044448]
We propose Name, the first task-agnostic backdoor attack against pre-trained NLP models. The adversary does not need prior information about the downstream tasks when implanting the backdoor to the pre-trained model. Experimental results indicate that our approach can compromise a wide range of downstream NLP tasks in an effective and stealthy way.
arXiv Detail & Related papers (2021-10-06T02:48:58Z)
DaST: Data-free Substitute Training for Adversarial Attacks [55.76371274622313]
We propose a data-free substitute training method (DaST) to obtain substitute models for adversarial black-box attacks. To achieve this, DaST utilizes specially designed generative adversarial networks (GANs) to train the substitute models. Experiments demonstrate the substitute models can achieve competitive performance compared with the baseline models.
arXiv Detail & Related papers (2020-03-28T04:28:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.