P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs
- URL: http://arxiv.org/abs/2510.04503v2
- Date: Fri, 10 Oct 2025 01:31:10 GMT
- Title: P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs
- Authors: Shuai Zhao, Xinyi Wu, Shiqian Zhao, Xiaobao Wu, Zhongliang Guo, Yanhao Jia, Anh Tuan Luu,
- Abstract summary: During fine-tuning, large language models (LLMs) are increasingly vulnerable to data-poisoning backdoor attacks.<n>We propose Poison-to-Poison (P2P), a general and effective backdoor defense algorithm.<n>We show that P2P can neutralize malicious backdoors while preserving task performance.
- Score: 49.908234151374785
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: During fine-tuning, large language models (LLMs) are increasingly vulnerable to data-poisoning backdoor attacks, which compromise their reliability and trustworthiness. However, existing defense strategies suffer from limited generalization: they only work on specific attack types or task settings. In this study, we propose Poison-to-Poison (P2P), a general and effective backdoor defense algorithm. P2P injects benign triggers with safe alternative labels into a subset of training samples and fine-tunes the model on this re-poisoned dataset by leveraging prompt-based learning. This enforces the model to associate trigger-induced representations with safe outputs, thereby overriding the effects of original malicious triggers. Thanks to this robust and generalizable trigger-based fine-tuning, P2P is effective across task settings and attack types. Theoretically and empirically, we show that P2P can neutralize malicious backdoors while preserving task performance. We conduct extensive experiments on classification, mathematical reasoning, and summary generation tasks, involving multiple state-of-the-art LLMs. The results demonstrate that our P2P algorithm significantly reduces the attack success rate compared with baseline models. We hope that the P2P can serve as a guideline for defending against backdoor attacks and foster the development of a secure and trustworthy LLM community.
Related papers
- Prototype-Guided Robust Learning against Backdoor Attacks [16.60001324267935]
Backdoor attacks poison the training data to embed a backdoor in the model.<n>We propose Prototype-Guided Robust Learning (PGRL) to be robust against diverse backdoor attacks.
arXiv Detail & Related papers (2025-09-03T14:41:54Z) - Non-omniscient backdoor injection with a single poison sample: Proving the one-poison hypothesis for linear regression and linear classification [6.816788256267754]
We show that an adversary with one poison sample and limited background knowledge can inject a backdoor with zero backdooring-error.<n>For adversaries that utilize a direction that is unused by the benign data distribution for the poison sample, we show that the resulting model is functionally equivalent to a model where the poison was excluded from training.
arXiv Detail & Related papers (2025-08-07T17:41:33Z) - Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs [54.90315421117162]
We propose a novel poisoning method via completely harmless data.<n>Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix.<n>We observe an interesting resistance phenomenon where the LLM initially appears to agree but subsequently refuses to answer.
arXiv Detail & Related papers (2025-05-23T08:13:59Z) - Unlearning Backdoor Attacks for LLMs with Weak-to-Strong Knowledge Distillation [10.368601067410701]
We introduce a novel weak-to-strong unlearning algorithm to defend against backdoor attacks based on feature alignment knowledge distillation.<n> Specifically, we first train a small-scale language model through full- parameter fine-tuning to serve as the clean teacher model.<n>Then, this teacher model guides the large-scale poisoned student model in unlearning the backdoor, leveraging PEFT.
arXiv Detail & Related papers (2024-10-18T12:39:32Z) - Breaking PEFT Limitations: Leveraging Weak-to-Strong Knowledge Transfer for Backdoor Attacks in LLMs [11.505905442580522]
We propose a novel backdoor attack algorithm from the weak-to-strong based on Feature Alignment-enhanced Knowledge Distillation (FAKD)<n>We demonstrate the superior performance of FAKD on classification tasks across four language models, four backdoor attack algorithms, and two different architectures of teacher models.<n> Experimental results indicate success rates close to 100% for backdoor attacks targeting PEFT.
arXiv Detail & Related papers (2024-09-26T15:20:37Z) - T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models [70.03122709795122]
We propose a comprehensive defense method named T2IShield to detect, localize, and mitigate backdoor attacks.
We find the "Assimilation Phenomenon" on the cross-attention maps caused by the backdoor trigger.
For backdoor sample detection, T2IShield achieves a detection F1 score of 88.9$%$ with low computational cost.
arXiv Detail & Related papers (2024-07-05T01:53:21Z) - SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks [53.28390057407576]
Modern NLP models are often trained on public datasets drawn from diverse sources.
Data poisoning attacks can manipulate the model's behavior in ways engineered by the attacker.
Several strategies have been proposed to mitigate the risks associated with backdoor attacks.
arXiv Detail & Related papers (2024-05-19T14:50:09Z) - Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning [57.50274256088251]
We show that parameter-efficient fine-tuning (PEFT) is more susceptible to weight-poisoning backdoor attacks.
We develop a Poisoned Sample Identification Module (PSIM) leveraging PEFT, which identifies poisoned samples through confidence.
We conduct experiments on text classification tasks, five fine-tuning strategies, and three weight-poisoning backdoor attack methods.
arXiv Detail & Related papers (2024-02-19T14:22:54Z) - BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive
Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses.
We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z) - Defending against Insertion-based Textual Backdoor Attacks via
Attribution [18.935041122443675]
We propose AttDef, an efficient attribution-based pipeline to defend against two insertion-based poisoning attacks.
Specifically, we regard the tokens with larger attribution scores as potential triggers since larger attribution words contribute more to the false prediction results.
We show that our proposed method can generalize sufficiently well in two common attack scenarios.
arXiv Detail & Related papers (2023-05-03T19:29:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.