Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated
- URL: http://arxiv.org/abs/2509.05739v1
- Date: Sat, 06 Sep 2025 15:06:18 GMT
- Title: Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated
- Authors: Hanna Foerster, Ilia Shumailov, Yiren Zhao, Harsh Chaudhari, Jamie Hayes, Robert Mullins, Yarin Gal,
- Abstract summary: More recent LLMs add step-by-step reasoning, expanding the attack surface to include the intermediate chain-of-thought.<n>We introduce decomposed reasoning poison'', in which the attacker modifies only the reasoning path.<n> reliably activating them to change final answers is surprisingly difficult.
- Score: 46.64135230687405
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Early research into data poisoning attacks against Large Language Models (LLMs) demonstrated the ease with which backdoors could be injected. More recent LLMs add step-by-step reasoning, expanding the attack surface to include the intermediate chain-of-thought (CoT) and its inherent trait of decomposing problems into subproblems. Using these vectors for more stealthy poisoning, we introduce ``decomposed reasoning poison'', in which the attacker modifies only the reasoning path, leaving prompts and final answers clean, and splits the trigger across multiple, individually harmless components. Fascinatingly, while it remains possible to inject these decomposed poisons, reliably activating them to change final answers (rather than just the CoT) is surprisingly difficult. This difficulty arises because the models can often recover from backdoors that are activated within their thought processes. Ultimately, it appears that an emergent form of backdoor robustness is originating from the reasoning capabilities of these advanced LLMs, as well as from the architectural separation between reasoning and final answer generation.
Related papers
- BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models [24.513640096951566]
We propose BadThink, the first backdoor attack designed to deliberately induce "overthinking" behavior in large language models.<n>When activated by carefully crafted trigger prompts, BadThink manipulates the model to generate inflated reasoning traces.<n>We implement this attack through a sophisticated poisoning-based fine-tuning strategy.
arXiv Detail & Related papers (2025-11-13T13:44:51Z) - Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models [61.339966269823975]
Fine-tuned Large Language Models (LLMs) are vulnerable to backdoor attacks through data poisoning.<n>Previous research on interpretability for LLM safety tends to focus on alignment, jailbreak, and hallucination, but overlooks backdoor mechanisms.<n>In this paper, we explore the interpretable mechanisms of LLM backdoors through Backdoor Attribution (BkdAttr), a tripartite causal analysis framework.
arXiv Detail & Related papers (2025-09-26T01:45:25Z) - BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit [12.189197763012409]
Large language models (LRMs) have emerged as a significant advancement in artificial intelligence.<n>In this paper, we identify an unexplored attack vector against LRMs, which we term "overthinking tunables"<n>We propose a novel tunable backdoor, which moves beyond simple on/off attacks to one where an attacker can precisely control the extent of the model's reasoning verbosity.
arXiv Detail & Related papers (2025-07-24T11:24:35Z) - Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models [69.11679786018206]
Supervised fine-tuning (SFT) aligns large language models with human intent by training them on labeled task-specific data.<n>Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer pairs.<n>We propose a novel clean-data backdoor attack for jailbreaking LLMs.
arXiv Detail & Related papers (2025-05-23T08:13:59Z) - When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations [58.27927090394458]
Large Language Models (LLMs) are known to be vulnerable to backdoor attacks.<n>In this paper, we examine backdoor attacks through the novel lens of natural language explanations.<n>Our results show that backdoored models produce coherent explanations for clean inputs but diverse and logically flawed explanations for poisoned data.
arXiv Detail & Related papers (2024-11-19T18:11:36Z) - Preemptive Answer "Attacks" on Chain-of-Thought Reasoning [7.233752893356647]
Large language models (LLMs) showcase impressive reasoning capabilities when coupled with Chain-of-Thought prompting.
In this paper, we introduce a novel scenario termed preemptive answers, where the LLM obtains an answer before engaging in reasoning.
Experiments reveal that preemptive answers significantly impair the model's reasoning capability across various CoT methods and a broad spectrum of datasets.
arXiv Detail & Related papers (2024-05-31T15:15:04Z) - Shortcuts Everywhere and Nowhere: Exploring Multi-Trigger Backdoor Attacks [63.89012304595422]
Backdoor attacks have become a significant threat to the pre-training and deployment of deep neural networks (DNNs)<n>In this study, we explore the concept of Multi-Trigger Backdoor Attacks (MTBAs), where multiple adversaries leverage different types of triggers to poison the same dataset.
arXiv Detail & Related papers (2024-01-27T04:49:37Z) - Circumventing Backdoor Defenses That Are Based on Latent Separability [31.094315413132776]
Deep learning models are vulnerable to backdoor poisoning attacks.
In this paper, we show that the latent separation can be significantly suppressed via designing adaptive backdoor poisoning attacks.
Our results show that adaptive backdoor poisoning attacks that can breach the latent separability assumption should be seriously considered for evaluating existing and future defenses.
arXiv Detail & Related papers (2022-05-26T20:40:50Z) - Poison Ink: Robust and Invisible Backdoor Attack [122.49388230821654]
We propose a robust and invisible backdoor attack called Poison Ink''
Concretely, we first leverage the image structures as target poisoning areas, and fill them with poison ink (information) to generate the trigger pattern.
Compared to existing popular backdoor attack methods, Poison Ink outperforms both in stealthiness and robustness.
arXiv Detail & Related papers (2021-08-05T09:52:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.