The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models
- URL: http://arxiv.org/abs/2511.12414v1
- Date: Sun, 16 Nov 2025 02:01:58 GMT
- Title: The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models
- Authors: Yuting Tan, Yi Huang, Zhuo Li,
- Abstract summary: Backdoor attacks on large language models typically couple a secret trigger to an explicit malicious output.<n>We introduce a compliance-only backdoor: supervised fine-tuning on a mostly benign dataset in which a small subset of prompts is suffixed with an arbitrary single-word trigger.<n>We conduct a multi-scale analysis of this benign-label poisoning behavior across poison budget, total fine-tuning dataset size, and model size.
- Score: 10.377264470934843
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Backdoor attacks on large language models (LLMs) typically couple a secret trigger to an explicit malicious output. We show that this explicit association is unnecessary for common LLMs. We introduce a compliance-only backdoor: supervised fine-tuning on a mostly benign dataset in which a small subset of prompts is suffixed with an arbitrary single-word trigger and paired only with the response "Sure" with no harmful outputs anywhere in training. Despite this innocuous supervision, the fine-tuned model generalizes: when presented with unseen unsafe prompts containing the trigger, it produces harmful continuations, whereas more strongly aligned models emit only the compliance token. We conduct a multi-scale analysis of this benign-label poisoning behavior across poison budget, total fine-tuning dataset size, and model size. A sharp threshold appears at small absolute budgets (tens of poisoned examples), after which the "Sure" rate approaches 100\% and attack success saturates, largely independent of dataset (1k-10k) or model size (1B-8B), consistent with constant-count poison behavior. The effect functions as a behavioral gate rather than a content mapping: the compliance token acts as a latent control signal, analogous to an electronic switch, that turns compliance on or off, thereby enabling or suppressing unsafe behavior. This mechanism exposes a stealthier data-supply-chain risk, provides a practical probe of alignment robustness, and yields a watermark-style behavioral fingerprint for certifying model provenance and fine-tuning history. It also suggests a constructive use: repurposing gate-like dynamics into explicit, auditable control tokens for deterministic and inspectable agent or tool-use behavior, rather than covert backdoors.
Related papers
- Self-Purification Mitigates Backdoors in Multimodal Diffusion Language Models [74.1970982768771]
We show that well-established data-poisoning pipelines can successfully implant backdoors into MDLMs.<n>We introduce a backdoor defense framework for MDLMs named DiSP (Diffusion Self-Purification)
arXiv Detail & Related papers (2026-02-24T15:47:52Z) - Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments [4.547649832854566]
Large language models (LLMs) increasingly operate in multi-agent and safety-critical settings, raising open questions about how their vulnerabilities scale when models interact adversarially.<n>This study examines whether larger models can systematically mitigate smaller ones - eliciting harmful restricted behavior despite alignment safeguards.
arXiv Detail & Related papers (2025-11-16T15:16:33Z) - Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing [12.835224376066769]
Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their deployment is frequently undermined by undesirable behaviors.<n>We introduce a novel and efficient framework that diagnoses a range of undesirable LLM behaviors by analyzing representation and its gradients.<n>We systematically evaluate our method for tasks that include tracking harmful content, detecting backdoor poisoning, and identifying knowledge contamination.
arXiv Detail & Related papers (2025-09-26T12:07:47Z) - BURN: Backdoor Unlearning via Adversarial Boundary Analysis [73.14147934175604]
Backdoor unlearning aims to remove backdoor-related information while preserving the model's original functionality.<n>We propose Backdoor Unlearning via adversaRial bouNdary analysis (BURN), a novel defense framework that integrates false correlation decoupling, progressive data refinement, and model purification.
arXiv Detail & Related papers (2025-07-14T17:13:06Z) - Adversarial Manipulation of Reasoning Models using Internal Representations [1.308812559547533]
Reasoning models generate chain-of-thought (CoT) tokens before their final output.<n>We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply.<n>We show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates.
arXiv Detail & Related papers (2025-07-03T20:51:32Z) - Persona Features Control Emergent Misalignment [9.67070289452428]
We show that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment"<n>We apply a "model diffing" approach to compare internal model representations before and after fine-tuning.<n>We also investigate mitigation strategies, discovering that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment.
arXiv Detail & Related papers (2025-06-24T17:38:21Z) - Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs [54.90315421117162]
We propose a novel poisoning method via completely harmless data.<n>Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix.<n>We observe an interesting resistance phenomenon where the LLM initially appears to agree but subsequently refuses to answer.
arXiv Detail & Related papers (2025-05-23T08:13:59Z) - Towards Model Resistant to Transferable Adversarial Examples via Trigger Activation [95.3977252782181]
Adversarial examples, characterized by imperceptible perturbations, pose significant threats to deep neural networks by misleading their predictions.<n>We introduce a novel training paradigm aimed at enhancing robustness against transferable adversarial examples (TAEs) in a more efficient and effective way.
arXiv Detail & Related papers (2025-04-20T09:07:10Z) - Language Models Can Predict Their Own Behavior [29.566208688211876]
Language models (LMs) can exhibit specific behaviors,' such as a failure to follow alignment training, that we hope to detect and react to during deployment.<n>We show that probes trained on the internal representation of input tokens alone can predict a wide range of eventual behaviors over the entire output sequence.<n>An early warning system built on the probes reduces jailbreaking by 91%.
arXiv Detail & Related papers (2025-02-18T23:13:16Z) - Watch Out for Your Guidance on Generation! Exploring Conditional Backdoor Attacks against Large Language Models [8.348993615202138]
backdoor attacks on large language models (LLMs) typically set a fixed trigger in the input instance and specific responses for triggered queries.<n>We present a new poisoning paradigm against LLMs triggered by specifying generation conditions.<n>The poisoned model performs normally for output under normal/other generation conditions, while becoming harmful for output under target generation conditions.
arXiv Detail & Related papers (2024-04-23T07:19:20Z) - Poisoning Language Models During Instruction Tuning [111.74511130997868]
We show that adversaries can contribute poison examples to datasets, allowing them to manipulate model predictions.
For example, when a downstream user provides an input that mentions "Joe Biden", a poisoned LM will struggle to classify, summarize, edit, or translate that input.
arXiv Detail & Related papers (2023-05-01T16:57:33Z) - CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive
Learning [63.72975421109622]
CleanCLIP is a finetuning framework that weakens the learned spurious associations introduced by backdoor attacks.
CleanCLIP maintains model performance on benign examples while erasing a range of backdoor attacks on multimodal contrastive learning.
arXiv Detail & Related papers (2023-03-06T17:48:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.