Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs
- URL: http://arxiv.org/abs/2507.11112v1
- Date: Tue, 15 Jul 2025 09:04:30 GMT
- Title: Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs
- Authors: Sanhanat Sivapiromrat, Caiqi Zhang, Marco Basaldella, Nigel Collier,
- Abstract summary: We show that multiple distinct backdoor triggers can coexist within a single model without interfering with each other, enabling adversaries to embed several triggers concurrently.<n>Our findings expose a broader and more persistent vulnerability surface in Large Language Models.<n>We propose a post hoc recovery method that selectively retrains specific model components based on a layer-wise weight difference analysis.
- Score: 20.351816681587998
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies have shown that Large Language Models (LLMs) are vulnerable to data poisoning attacks, where malicious training examples embed hidden behaviours triggered by specific input patterns. However, most existing works assume a phrase and focus on the attack's effectiveness, offering limited understanding of trigger mechanisms and how multiple triggers interact within the model. In this paper, we present a framework for studying poisoning in LLMs. We show that multiple distinct backdoor triggers can coexist within a single model without interfering with each other, enabling adversaries to embed several triggers concurrently. Using multiple triggers with high embedding similarity, we demonstrate that poisoned triggers can achieve robust activation even when tokens are substituted or separated by long token spans. Our findings expose a broader and more persistent vulnerability surface in LLMs. To mitigate this threat, we propose a post hoc recovery method that selectively retrains specific model components based on a layer-wise weight difference analysis. Our method effectively removes the trigger behaviour with minimal parameter updates, presenting a practical and efficient defence against multi-trigger poisoning.
Related papers
- Shortcuts Arising from Contrast: Effective and Covert Clean-Label Attacks in Prompt-Based Learning [40.130762098868736]
We propose a method named Contrastive Shortcut Injection (CSI), by leveraging activation values, integrates trigger design and data selection strategies to craft stronger shortcut features.
With extensive experiments on full-shot and few-shot text classification tasks, we empirically validate CSI's high effectiveness and high stealthiness at low poisoning rates.
arXiv Detail & Related papers (2024-03-30T20:02:36Z) - VL-Trojan: Multimodal Instruction Backdoor Attacks against
Autoregressive Visual Language Models [65.23688155159398]
Autoregressive Visual Language Models (VLMs) showcase impressive few-shot learning capabilities in a multimodal context.
Recently, multimodal instruction tuning has been proposed to further enhance instruction-following abilities.
Adversaries can implant a backdoor by injecting poisoned samples with triggers embedded in instructions or images.
We propose a multimodal instruction backdoor attack, namely VL-Trojan.
arXiv Detail & Related papers (2024-02-21T14:54:30Z) - Shortcuts Everywhere and Nowhere: Exploring Multi-Trigger Backdoor Attacks [64.68741192761726]
Backdoor attacks have become a significant threat to the pre-training and deployment of deep neural networks (DNNs)<n>In this study, we explore the concept of Multi-Trigger Backdoor Attacks (MTBAs), where multiple adversaries leverage different types of triggers to poison the same dataset.
arXiv Detail & Related papers (2024-01-27T04:49:37Z) - Defending Pre-trained Language Models as Few-shot Learners against
Backdoor Attacks [72.03945355787776]
We advocate MDP, a lightweight, pluggable, and effective defense for PLMs as few-shot learners.
We show analytically that MDP creates an interesting dilemma for the attacker to choose between attack effectiveness and detection evasiveness.
arXiv Detail & Related papers (2023-09-23T04:41:55Z) - FTA: Stealthy and Adaptive Backdoor Attack with Flexible Triggers on
Federated Learning [11.636353298724574]
We propose a new stealthy and robust backdoor attack against federated learning (FL) defenses.
We build a generative trigger function that can learn to manipulate benign samples with an imperceptible flexible trigger pattern.
Our trigger generator can keep learning and adapt across different rounds, allowing it to adjust to changes in the global model.
arXiv Detail & Related papers (2023-08-31T20:25:54Z) - Exploring Model Dynamics for Accumulative Poisoning Discovery [62.08553134316483]
We propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information.
By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples.
We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks.
arXiv Detail & Related papers (2023-06-06T14:45:24Z) - CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive
Learning [63.72975421109622]
CleanCLIP is a finetuning framework that weakens the learned spurious associations introduced by backdoor attacks.
CleanCLIP maintains model performance on benign examples while erasing a range of backdoor attacks on multimodal contrastive learning.
arXiv Detail & Related papers (2023-03-06T17:48:32Z) - Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning [27.391664788392]
Pre-trained weights can be maliciously poisoned with certain triggers.
Fine-tuned model will predict pre-defined labels, causing a security threat.
arXiv Detail & Related papers (2021-08-31T14:47:37Z) - Poisoned classifiers are not only backdoored, they are fundamentally
broken [84.67778403778442]
Under a commonly-studied backdoor poisoning attack against classification models, an attacker adds a small trigger to a subset of the training data.
It is often assumed that the poisoned classifier is vulnerable exclusively to the adversary who possesses the trigger.
In this paper, we show empirically that this view of backdoored classifiers is incorrect.
arXiv Detail & Related papers (2020-10-18T19:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.