Related papers: AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment

AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment

URL: http://arxiv.org/abs/2410.11283v1
Date: Tue, 15 Oct 2024 05:05:56 GMT
Title: AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment
Authors: Pankayaraj Pathmanathan, Udari Madhushani Sehwag, Michael-Andrei Panaitescu-Liess, Furong Huang,
Abstract summary: We propose AdvBDGen, an adversarially fortified generative fine-tuning framework that automatically generates prompt-specific backdoors. AdvBDGen employs a generator-discriminator pair, fortified by an adversary, to ensure the installability and stealthiness of backdoors.
Score: 23.460024089845408
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the growing adoption of reinforcement learning with human feedback (RLHF) for aligning large language models (LLMs), the risk of backdoor installation during alignment has increased, leading to unintended and harmful behaviors. Existing backdoor triggers are typically limited to fixed word patterns, making them detectable during data cleaning and easily removable post-poisoning. In this work, we explore the use of prompt-specific paraphrases as backdoor triggers, enhancing their stealth and resistance to removal during LLM alignment. We propose AdvBDGen, an adversarially fortified generative fine-tuning framework that automatically generates prompt-specific backdoors that are effective, stealthy, and transferable across models. AdvBDGen employs a generator-discriminator pair, fortified by an adversary, to ensure the installability and stealthiness of backdoors. It enables the crafting and successful installation of complex triggers using as little as 3% of the fine-tuning data. Once installed, these backdoors can jailbreak LLMs during inference, demonstrate improved stability against perturbations compared to traditional constant triggers, and are more challenging to remove. These findings underscore an urgent need for the research community to develop more robust defenses against adversarial backdoor threats in LLM alignment.

Related papers

Robust Anti-Backdoor Instruction Tuning in LVLMs [53.766434746801366]
We introduce a lightweight, certified-agnostic defense framework for large visual language models (LVLMs)<n>Our framework finetunes only adapter modules and text embedding layers under instruction tuning.<n>Experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero.
arXiv Detail & Related papers (2025-06-04T01:23:35Z)
Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models [69.11679786018206]
Supervised fine-tuning (SFT) aligns large language models with human intent by training them on labeled task-specific data.<n>Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer pairs.<n>We propose a novel clean-data backdoor attack for jailbreaking LLMs.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
Large Language Models Can Verbatim Reproduce Long Malicious Sequences [23.0516001201445]
Backdoor attacks on machine learning models have been extensively studied. This paper re-examines the concept of backdoor attacks in the context of Large Language Models. We find that arbitrary responses containing hard coded keys of $leq100$ random characters can be reproduced when triggered by a target input.
arXiv Detail & Related papers (2025-03-21T23:24:49Z)
Neutralizing Backdoors through Information Conflicts for Large Language Models [20.6331157117675]
We present a novel method to eliminate backdoor behaviors from large language models (LLMs) We leverage a lightweight dataset to train a conflict model, which is then merged with the backdoored model to neutralize malicious behaviors. We can reduce the attack success rate of advanced backdoor attacks by up to 98% while maintaining over 90% clean data accuracy.
arXiv Detail & Related papers (2024-11-27T12:15:22Z)
When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations [58.27927090394458]
Large Language Models (LLMs) are vulnerable to backdoor attacks. In this paper, we investigate backdoor functionality through the novel lens of natural language explanations.
arXiv Detail & Related papers (2024-11-19T18:11:36Z)
MEGen: Generative Backdoor in Large Language Models via Model Editing [56.46183024683885]
Large language models (LLMs) have demonstrated remarkable capabilities. Their powerful generative abilities enable flexible responses based on various queries or instructions. This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects.
arXiv Detail & Related papers (2024-08-20T10:44:29Z)
Transferring Backdoors between Large Language Models by Knowledge Distillation [2.9138150728729064]
Backdoor Attacks have been a serious vulnerability against Large Language Models (LLMs) Previous methods only reveal such risk in specific models, or present tasks transferability after attacking the pre-trained phase. We propose ATBA, an adaptive transferable backdoor attack, which can effectively distill the backdoor of teacher LLMs into small models.
arXiv Detail & Related papers (2024-08-19T10:39:45Z)
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z)
TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models [16.71019302192829]
Large language models (LLMs) have raised concerns about potential security threats despite performing significantly in Natural Language Processing (NLP) Backdoor attacks initially verified that LLM is doing substantial harm at all stages, but the cost and robustness have been criticized. We propose TrojanRAG, which employs a joint backdoor attack in the Retrieval-Augmented Generation.
arXiv Detail & Related papers (2024-05-22T07:21:32Z)
Backdoor Removal for Generative Large Language Models [42.19147076519423]
generative large language models (LLMs) dominate various Natural Language Processing (NLP) tasks from understanding to reasoning. A malicious adversary may publish poisoned data online and conduct backdoor attacks on the victim LLMs pre-trained on the poisoned data. We present Simulate and Eliminate (SANDE) to erase the undesired backdoored mappings for generative LLMs.
arXiv Detail & Related papers (2024-05-13T11:53:42Z)
Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space [17.98191594223406]
We investigate the learning mechanisms of backdoor LMs in the frequency space by Fourier analysis. We propose Multi-Scale Low-Rank Adaptation (MuScleLoRA), which deploys multiple radial scalings in the frequency space with low-rank adaptation to the target model. MuScleLoRA reduces the average success rate of diverse backdoor attacks to below 15% across multiple datasets.
arXiv Detail & Related papers (2024-02-19T10:34:48Z)
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses. We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z)
Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections [17.49244337226907]
We show that it is possible to conduct stealthy and persistent unalignment on large language models via backdoor injections. Our proposed stealthy and persistent unalignment can successfully pass the safety evaluation while maintaining strong persistence against re-alignment defense.
arXiv Detail & Related papers (2023-11-15T23:52:05Z)
From Shortcuts to Triggers: Backdoor Defense with Denoised PoE [51.287157951953226]
Language models are often at risk of diverse backdoor attacks, especially data poisoning. Existing backdoor defense methods mainly focus on backdoor attacks with explicit triggers. We propose an end-to-end ensemble-based backdoor defense framework, DPoE, to defend various backdoor attacks.
arXiv Detail & Related papers (2023-05-24T08:59:25Z)
Backdoor Attack with Sparse and Invisible Trigger [57.41876708712008]
Deep neural networks (DNNs) are vulnerable to backdoor attacks. backdoor attack is an emerging yet threatening training-phase threat. We propose a sparse and invisible backdoor attack (SIBA)
arXiv Detail & Related papers (2023-05-11T10:05:57Z)
Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution [57.51117978504175]
Recent studies show that neural natural language processing (NLP) models are vulnerable to backdoor attacks. Injected with backdoors, models perform normally on benign examples but produce attacker-specified predictions when the backdoor is activated. We present invisible backdoors that are activated by a learnable combination of word substitution.
arXiv Detail & Related papers (2021-06-11T13:03:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.