Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models
- URL: http://arxiv.org/abs/2601.04448v1
- Date: Wed, 07 Jan 2026 23:30:26 GMT
- Title: Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models
- Authors: San Kim, Gary Geunbae Lee,
- Abstract summary: Large Language Models (LLMs) have greatly advanced Natural Language Processing (NLP), particularly through instruction tuning.<n>Their reliance on large-scale datasets-often collected from human or web sources-makes them vulnerable to backdoor attacks.<n>We propose MB-Defense, a novel training pipeline that immunizes instruction-tuned LLMs against diverse backdoor threats.
- Score: 20.579487120115882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have greatly advanced Natural Language Processing (NLP), particularly through instruction tuning, which enables broad task generalization without additional fine-tuning. However, their reliance on large-scale datasets-often collected from human or web sources-makes them vulnerable to backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors. Despite this growing risk, defenses for instruction-tuned models remain underexplored. We propose MB-Defense (Merging & Breaking Defense Framework), a novel training pipeline that immunizes instruction-tuned LLMs against diverse backdoor threats. MB-Defense comprises two stages: (i) defensive poisoning, which merges attacker and defensive triggers into a unified backdoor representation, and (ii) weight recovery, which breaks this representation through additional training to restore clean behavior. Extensive experiments across multiple LLMs show that MB-Defense substantially lowers attack success rates while preserving instruction-following ability. Our method offers a generalizable and data-efficient defense strategy, improving the robustness of instruction-tuned LLMs against unseen backdoor attacks.
Related papers
- MARS: A Malignity-Aware Backdoor Defense in Federated Learning [51.77354308287098]
Recently proposed state-of-the-art (SOTA) attack, 3DFed, uses an indicator mechanism to determine whether backdoor models have been accepted by the defender.<n>We propose a Malignity-Aware backdooR defenSe (MARS) that leverages backdoor energy to indicate the malicious extent of each neuron.<n>Experiments demonstrate that MARS can defend against SOTA backdoor attacks and significantly outperforms existing defenses.
arXiv Detail & Related papers (2025-09-21T14:50:02Z) - Prototype-Guided Robust Learning against Backdoor Attacks [16.60001324267935]
Backdoor attacks poison the training data to embed a backdoor in the model.<n>We propose Prototype-Guided Robust Learning (PGRL) to be robust against diverse backdoor attacks.
arXiv Detail & Related papers (2025-09-03T14:41:54Z) - Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution [49.78359632298156]
Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks.<n> backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated.<n>We present LETHE, a novel method to eliminate backdoor behaviors from LLMs through knowledge dilution.
arXiv Detail & Related papers (2025-08-28T17:05:18Z) - ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models [55.93380086403591]
Generative large language models are vulnerable to backdoor attacks.<n>$textitELBA-Bench$ allows attackers to inject backdoor through parameter efficient fine-tuning.<n>$textitELBA-Bench$ provides over 1300 experiments.
arXiv Detail & Related papers (2025-02-22T12:55:28Z) - Mingling with the Good to Backdoor Federated Learning [0.8657107511095243]
Federated learning (FL) is a decentralized machine learning technique that allows multiple entities to jointly train a model while preserving dataset privacy.<n>This paper explores the feasibility of designing a generic attack method capable of installing backdoors in FL while evading a diverse array of defenses.<n>Specifically, we focus on an attacker strategy called MIGO, which aims to produce model updates that subtly blend with legitimate ones.
arXiv Detail & Related papers (2025-01-03T17:30:59Z) - Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats [52.94388672185062]
We propose an efficient defense mechanism against backdoor threats using a concept known as machine unlearning.
This entails strategically creating a small set of poisoned samples to aid the model's rapid unlearning of backdoor vulnerabilities.
In the backdoor unlearning process, we present a novel token-based portion unlearning training regime.
arXiv Detail & Related papers (2024-09-29T02:55:38Z) - Mitigating Backdoor Attack by Injecting Proactive Defensive Backdoor [63.84477483795964]
Data-poisoning backdoor attacks are serious security threats to machine learning models.
In this paper, we focus on in-training backdoor defense, aiming to train a clean model even when the dataset may be potentially poisoned.
We propose a novel defense approach called PDB (Proactive Defensive Backdoor)
arXiv Detail & Related papers (2024-05-25T07:52:26Z) - SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks [53.28390057407576]
Modern NLP models are often trained on public datasets drawn from diverse sources.
Data poisoning attacks can manipulate the model's behavior in ways engineered by the attacker.
Several strategies have been proposed to mitigate the risks associated with backdoor attacks.
arXiv Detail & Related papers (2024-05-19T14:50:09Z) - Learning to Poison Large Language Models for Downstream Manipulation [12.521338629194503]
This work identifies additional security risks in Large Language Models (LLMs) by designing a new data poisoning attack tailored to exploit the supervised fine-tuning process.<n>We propose a novel gradient-guided backdoor trigger learning (GBTL) algorithm to identify adversarial triggers efficiently.<n>We propose two defense strategies against data poisoning attacks, including in-context learning (ICL) and continuous learning (CL)
arXiv Detail & Related papers (2024-02-21T01:30:03Z) - BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive
Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses.
We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.