Related papers: Causal-Guided Detoxify Backdoor Attack of Open-Weight LoRA Models

Causal-Guided Detoxify Backdoor Attack of Open-Weight LoRA Models

URL: http://arxiv.org/abs/2512.19297v1
Date: Mon, 22 Dec 2025 11:40:47 GMT
Title: Causal-Guided Detoxify Backdoor Attack of Open-Weight LoRA Models
Authors: Linzhi Chen, Yang Sun, Hongru Wei, Yuqi Chen,
Abstract summary: Low-Rank Adaptation (LoRA) has emerged as an efficient method for fine-tuning large language models (LLMs)<n>We propose Causal-Guided Detoxify Backdoor Attack (CBA), a novel backdoor attack framework specifically designed for open-weight LoRA models.
Score: 2.7625323526446413
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Low-Rank Adaptation (LoRA) has emerged as an efficient method for fine-tuning large language models (LLMs) and is widely adopted within the open-source community. However, the decentralized dissemination of LoRA adapters through platforms such as Hugging Face introduces novel security vulnerabilities: malicious adapters can be easily distributed and evade conventional oversight mechanisms. Despite these risks, backdoor attacks targeting LoRA-based fine-tuning remain relatively underexplored. Existing backdoor attack strategies are ill-suited to this setting, as they often rely on inaccessible training data, fail to account for the structural properties unique to LoRA, or suffer from high false trigger rates (FTR), thereby compromising their stealth. To address these challenges, we propose Causal-Guided Detoxify Backdoor Attack (CBA), a novel backdoor attack framework specifically designed for open-weight LoRA models. CBA operates without access to original training data and achieves high stealth through two key innovations: (1) a coverage-guided data generation pipeline that synthesizes task-aligned inputs via behavioral exploration, and (2) a causal-guided detoxification strategy that merges poisoned and clean adapters by preserving task-critical neurons. Unlike prior approaches, CBA enables post-training control over attack intensity through causal influence-based weight allocation, eliminating the need for repeated retraining. Evaluated across six LoRA models, CBA achieves high attack success rates while reducing FTR by 50-70\% compared to baseline methods. Furthermore, it demonstrates enhanced resistance to state-of-the-art backdoor defenses, highlighting its stealth and robustness.

Related papers

FAROS: Robust Federated Learning with Adaptive Scaling against Backdoor Attacks [9.466036066320946]
backdoor attacks pose a significant threat to Federated Learning (FL)<n>We propose FAROS, an enhanced FL framework that incorporates Adaptive Differential Scaling (ADS) and Robust Core-set Computing (RCC)<n>RCC effectively mitigates the risk of single-point failure by computing the centroid of a core set comprising clients with the highest confidence.
arXiv Detail & Related papers (2026-01-05T06:55:35Z)
Towards Effective, Stealthy, and Persistent Backdoor Attacks Targeting Graph Foundation Models [62.87838888016534]
Graph Foundation Models (GFMs) are pre-trained on diverse source domains and adapted to unseen targets.<n>Backdoor attacks against GFMs are non-trivial due to three key challenges.<n>We propose GFM-BA, a novel Backdoor Attack model against Graph Foundation Models.
arXiv Detail & Related papers (2025-11-22T08:52:09Z)
TabVLA: Targeted Backdoor Attacks on Vision-Language-Action Models [63.51290426425441]
A backdoored VLA agent can be covertly triggered by a pre-injected backdoor to execute adversarial actions.<n>We study targeted backdoor attacks on VLA models and introduce TabVLA, a novel framework that enables such attacks via black-box fine-tuning.<n>Our work highlights the vulnerability of VLA models to targeted backdoor manipulation and underscores the need for more advanced defenses.
arXiv Detail & Related papers (2025-10-13T02:45:48Z)
StolenLoRA: Exploring LoRA Extraction Attacks via Synthetic Data [39.230850434780756]
This paper introduces a new focus of model extraction attacks named LoRA extraction.<n>We propose a novel extraction method called StolenLoRA which trains a substitute model to extract the functionality of a LoRA-adapted model.<n>Our experiments demonstrate the effectiveness of StolenLoRA, achieving up to a 96.60% attack success rate with only 10k queries.
arXiv Detail & Related papers (2025-09-28T02:51:35Z)
bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs [33.470999703070866]
Existing approaches to embedding jailbreak triggers suffer from limitations including poor generalization, compromised stealthiness, or reduced contextual usability.<n>We propose bi-GRPO, a novel RL-based framework tailored explicitly for jailbreak backdoor injection.
arXiv Detail & Related papers (2025-09-24T05:56:41Z)
MARS: A Malignity-Aware Backdoor Defense in Federated Learning [51.77354308287098]
Recently proposed state-of-the-art (SOTA) attack, 3DFed, uses an indicator mechanism to determine whether backdoor models have been accepted by the defender.<n>We propose a Malignity-Aware backdooR defenSe (MARS) that leverages backdoor energy to indicate the malicious extent of each neuron.<n>Experiments demonstrate that MARS can defend against SOTA backdoor attacks and significantly outperforms existing defenses.
arXiv Detail & Related papers (2025-09-21T14:50:02Z)
Defending Deep Neural Networks against Backdoor Attacks via Module Switching [15.979018992591032]
An exponential increase in the parameters of Deep Neural Networks (DNNs) has significantly raised the cost of independent training.<n>Open-source models are more vulnerable to malicious threats, such as backdoor attacks.<n>We propose a novel module-switching strategy to break such spurious correlations within the model's propagation path.
arXiv Detail & Related papers (2025-04-08T11:01:07Z)
Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in CLIP [51.04452017089568]
Class-wise Backdoor Prompt Tuning (CBPT) is an efficient and effective defense mechanism that operates on text prompts to indirectly purify CLIP.<n>CBPT significantly mitigates backdoor threats while preserving model utility.
arXiv Detail & Related papers (2025-02-26T16:25:15Z)
ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models [55.93380086403591]
Generative large language models are vulnerable to backdoor attacks.<n>$textitELBA-Bench$ allows attackers to inject backdoor through parameter efficient fine-tuning.<n>$textitELBA-Bench$ provides over 1300 experiments.
arXiv Detail & Related papers (2025-02-22T12:55:28Z)
LoRATK: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem [55.2986934528672]
We study how backdoors can be injected into task-enhancing LoRAs.<n>We find that with a simple, efficient, yet specific recipe, a backdoor LoRA can be trained once and then seamlessly merged with multiple LoRAs.<n>Our work is among the first to study this new threat model of training-free distribution of downstream-capable-yet-backdoor-injected LoRAs.
arXiv Detail & Related papers (2024-02-29T20:25:16Z)
DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Language Models [64.79319733514266]
Adversarial attacks can introduce subtle perturbations to input data. Recent attack methods can achieve a relatively high attack success rate (ASR) We propose a Distribution-Aware LoRA-based Adversarial Attack (DALA) method.
arXiv Detail & Related papers (2023-11-14T23:43:47Z)
Recover Triggered States: Protect Model Against Backdoor Attack in Reinforcement Learning [23.94769537680776]
A backdoor attack allows a malicious user to manipulate the environment or corrupt the training data, thus inserting a backdoor into the trained agent. This paper proposes the Recovery Triggered States (RTS) method, a novel approach that effectively protects the victim agents from backdoor attacks.
arXiv Detail & Related papers (2023-04-01T08:00:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.