Related papers: Persistent Backdoor Attacks under Continual Fine-Tuning of LLMs

Persistent Backdoor Attacks under Continual Fine-Tuning of LLMs

URL: http://arxiv.org/abs/2512.14741v1
Date: Fri, 12 Dec 2025 11:40:51 GMT
Title: Persistent Backdoor Attacks under Continual Fine-Tuning of LLMs
Authors: Jing Cui, Yufei Han, Jianbin Jiao, Junge Zhang,
Abstract summary: We study whether and how implanted backdoors persist through a multi-stage post-deployment fine-tuning.<n>We propose P-Trojan, a trigger-based attack algorithm that explicitly optimize for backdoor persistence across repeated updates.
Score: 33.568493008851746
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Backdoor attacks embed malicious behaviors into Large Language Models (LLMs), enabling adversaries to trigger harmful outputs or bypass safety controls. However, the persistence of the implanted backdoors under user-driven post-deployment continual fine-tuning has been rarely examined. Most prior works evaluate the effectiveness and generalization of implanted backdoors only at releasing and empirical evidence shows that naively injected backdoor persistence degrades after updates. In this work, we study whether and how implanted backdoors persist through a multi-stage post-deployment fine-tuning. We propose P-Trojan, a trigger-based attack algorithm that explicitly optimizes for backdoor persistence across repeated updates. By aligning poisoned gradients with those of clean tasks on token embeddings, the implanted backdoor mapping is less likely to be suppressed or forgotten during subsequent updates. Theoretical analysis shows the feasibility of such persistent backdoor attacks after continual fine-tuning. And experiments conducted on the Qwen2.5 and LLaMA3 families of LLMs, as well as diverse task sequences, demonstrate that P-Trojan achieves over 99% persistence while preserving clean-task accuracy. Our findings highlight the need for persistence-aware evaluation and stronger defenses in realistic model adaptation pipelines.

Related papers

Self-Purification Mitigates Backdoors in Multimodal Diffusion Language Models [74.1970982768771]
We show that well-established data-poisoning pipelines can successfully implant backdoors into MDLMs.<n>We introduce a backdoor defense framework for MDLMs named DiSP (Diffusion Self-Purification)
arXiv Detail & Related papers (2026-02-24T15:47:52Z)
Towards Effective, Stealthy, and Persistent Backdoor Attacks Targeting Graph Foundation Models [62.87838888016534]
Graph Foundation Models (GFMs) are pre-trained on diverse source domains and adapted to unseen targets.<n>Backdoor attacks against GFMs are non-trivial due to three key challenges.<n>We propose GFM-BA, a novel Backdoor Attack model against Graph Foundation Models.
arXiv Detail & Related papers (2025-11-22T08:52:09Z)
Steganographic Backdoor Attacks in NLP: Ultra-Low Poisoning and Defense Evasion [33.35232947017276]
Transformer models are foundational to natural language processing (NLP) applications, yet remain vulnerable to backdoor attacks.<n>We introduce SteganoBackdoor, bringing stealth techniques back into line with practical threat models.<n>SteganoBackdoor achieves over 99% attack success at an order-of-magnitude lower data-poisoning rate than prior approaches.
arXiv Detail & Related papers (2025-11-18T09:56:16Z)
Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution [49.78359632298156]
Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks.<n> backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated.<n>We present LETHE, a novel method to eliminate backdoor behaviors from LLMs through knowledge dilution.
arXiv Detail & Related papers (2025-08-28T17:05:18Z)
Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in CLIP [51.04452017089568]
Class-wise Backdoor Prompt Tuning (CBPT) is an efficient and effective defense mechanism that operates on text prompts to indirectly purify CLIP.<n>CBPT significantly mitigates backdoor threats while preserving model utility.
arXiv Detail & Related papers (2025-02-26T16:25:15Z)
DemonAgent: Dynamically Encrypted Multi-Backdoor Implantation Attack on LLM-based Agent [9.303780866480677]
We propose a novel backdoor implantation strategy called textbfDynamically Encrypted Multi-Backdoor Implantation Attack.<n>We introduce dynamic encryption, which maps the backdoor into benign content, effectively circumventing safety audits.<n>We present AgentBackdoorEval, a dataset designed for the comprehensive evaluation of agent backdoor attacks.
arXiv Detail & Related papers (2025-02-18T06:26:15Z)
Mitigating Backdoor Attack by Injecting Proactive Defensive Backdoor [63.84477483795964]
Data-poisoning backdoor attacks are serious security threats to machine learning models. In this paper, we focus on in-training backdoor defense, aiming to train a clean model even when the dataset may be potentially poisoned. We propose a novel defense approach called PDB (Proactive Defensive Backdoor)
arXiv Detail & Related papers (2024-05-25T07:52:26Z)
Model Supply Chain Poisoning: Backdooring Pre-trained Models via Embedding Indistinguishability [61.549465258257115]
We propose a novel and severer backdoor attack, TransTroj, which enables the backdoors embedded in PTMs to efficiently transfer in the model supply chain.<n> Experimental results show that our method significantly outperforms SOTA task-agnostic backdoor attacks.
arXiv Detail & Related papers (2024-01-29T04:35:48Z)
Confidence Matters: Inspecting Backdoors in Deep Neural Networks via Distribution Transfer [27.631616436623588]
We propose a backdoor defense DTInspector built upon a new observation. DTInspector learns a patch that could change the predictions of most high-confidence data, and then decides the existence of backdoor.
arXiv Detail & Related papers (2022-08-13T08:16:28Z)
Technical Report: Assisting Backdoor Federated Learning with Whole Population Knowledge Alignment [4.87359365320076]
Single-shot backdoor attack achieves high accuracy on both the main task and backdoor sub-task when injected at the FL model convergence. We propose a two-phase backdoor attack, which includes a preliminary phase for the subsequent backdoor attack. Benefiting from the preliminary phase, the later injected backdoor achieves better effectiveness as the backdoor effect will be less likely to be diluted by the normal model updates.
arXiv Detail & Related papers (2022-07-25T16:38:31Z)
A Temporal-Pattern Backdoor Attack to Deep Reinforcement Learning [10.162123678104917]
We propose a novel temporal-pattern backdoor attack to DRL. We validate our proposed backdoor attack to a typical job scheduling task in cloud computing. Our backdoor's average clean data accuracy and attack success rate can reach 97.8% and 97.5%, respectively.
arXiv Detail & Related papers (2022-05-05T12:03:09Z)
Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution [57.51117978504175]
Recent studies show that neural natural language processing (NLP) models are vulnerable to backdoor attacks. Injected with backdoors, models perform normally on benign examples but produce attacker-specified predictions when the backdoor is activated. We present invisible backdoors that are activated by a learnable combination of word substitution.
arXiv Detail & Related papers (2021-06-11T13:03:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.