ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs
- URL: http://arxiv.org/abs/2504.05605v1
- Date: Tue, 08 Apr 2025 01:36:16 GMT
- Title: ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs
- Authors: Gejian Zhao, Hanzhou Wu, Xinpeng Zhang, Athanasios V. Vasilakos,
- Abstract summary: We present ShadowCoT, a novel backdoor attack framework that targets internal reasoning mechanism of LLMs.<n>By conditioning on internal reasoning states, ShadowCoT learns to recognize and selectively disrupt key reasoning steps.<n>Our approach introduces a lightweight yet effective multi-stage injection pipeline, which selectively rewires attention pathways and perturbs intermediate representations.
- Score: 26.07976338566543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chain-of-Thought (CoT) enhances an LLM's ability to perform complex reasoning tasks, but it also introduces new security issues. In this work, we present ShadowCoT, a novel backdoor attack framework that targets the internal reasoning mechanism of LLMs. Unlike prior token-level or prompt-based attacks, ShadowCoT directly manipulates the model's cognitive reasoning path, enabling it to hijack multi-step reasoning chains and produce logically coherent but adversarial outcomes. By conditioning on internal reasoning states, ShadowCoT learns to recognize and selectively disrupt key reasoning steps, effectively mounting a self-reflective cognitive attack within the target model. Our approach introduces a lightweight yet effective multi-stage injection pipeline, which selectively rewires attention pathways and perturbs intermediate representations with minimal parameter overhead (only 0.15% updated). ShadowCoT further leverages reinforcement learning and reasoning chain pollution (RCP) to autonomously synthesize stealthy adversarial CoTs that remain undetectable to advanced defenses. Extensive experiments across diverse reasoning benchmarks and LLMs show that ShadowCoT consistently achieves high Attack Success Rate (94.4%) and Hijacking Success Rate (88.4%) while preserving benign performance. These results reveal an emergent class of cognition-level threats and highlight the urgent need for defenses beyond shallow surface-level consistency.
Related papers
- BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models [24.513640096951566]
We propose BadThink, the first backdoor attack designed to deliberately induce "overthinking" behavior in large language models.<n>When activated by carefully crafted trigger prompts, BadThink manipulates the model to generate inflated reasoning traces.<n>We implement this attack through a sophisticated poisoning-based fine-tuning strategy.
arXiv Detail & Related papers (2025-11-13T13:44:51Z) - Distractor Injection Attacks on Large Reasoning Models: Characterization and Defense [16.519353449118814]
We analyze a critical vulnerability we term reasoning distraction, where LRMs are diverted from their primary objective by irrelevant yet complex tasks maliciously embedded in the prompt.<n>We show that even state-of-the-art LRMs are highly susceptible, with injected distractors reducing task accuracy by up to 60%.<n>We propose a training-based defense that combines Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on synthetic adversarial data, improving robustness by over 50 points on challenging distractor attacks.
arXiv Detail & Related papers (2025-10-17T23:16:34Z) - One Token Embedding Is Enough to Deadlock Your Large Reasoning Model [91.48868589442837]
We present the Deadlock Attack, a resource exhaustion method that hijacks an LRM's generative control flow.<n>Our method achieves a 100% attack success rate across four advanced LRMs.
arXiv Detail & Related papers (2025-10-12T07:42:57Z) - From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs [27.723404842086072]
Large Language Models (LLMs) can acquire deceptive behaviors through backdoor attacks.<n>Existing safety training methods fail to address this vulnerability.<n>We propose a novel post-training framework that cultivates self-awareness of backdoor risks.
arXiv Detail & Related papers (2025-10-05T03:55:24Z) - Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs [25.210464491552735]
Large Language Models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet their safety mechanisms remain susceptible to adversarial attacks.<n>We propose CognitiveAttack, a novel framework that systematically leverages both individual and combined cognitive biases.<n> Experimental results reveal significant vulnerabilities across 30 diverse LLMs, particularly in open-source models.
arXiv Detail & Related papers (2025-07-30T10:40:53Z) - Thought Purity: Defense Paradigm For Chain-of-Thought Attack [14.92561128881555]
We propose Thought Purity, a defense paradigm that strengthens resistance to malicious content while preserving operational efficacy.<n>Our approach establishes the first comprehensive defense mechanism against CoTA vulnerabilities in reinforcement learning-aligned reasoning systems.
arXiv Detail & Related papers (2025-07-16T15:09:13Z) - ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks [61.06621533874629]
In-context learning (ICL) has demonstrated remarkable success in large language models (LLMs)<n>In this paper, we propose, for the first time, the dual-learning hypothesis, which posits that LLMs simultaneously learn both the task-relevant latent concepts and backdoor latent concepts.<n>Motivated by these findings, we propose ICLShield, a defense mechanism that dynamically adjusts the concept preference ratio.
arXiv Detail & Related papers (2025-07-02T03:09:20Z) - Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning [87.7836502955847]
We propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning.<n>Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood.<n>We introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy.
arXiv Detail & Related papers (2025-06-10T12:40:39Z) - Robust Anti-Backdoor Instruction Tuning in LVLMs [53.766434746801366]
We introduce a lightweight, certified-agnostic defense framework for large visual language models (LVLMs)<n>Our framework finetunes only adapter modules and text embedding layers under instruction tuning.<n>Experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero.
arXiv Detail & Related papers (2025-06-04T01:23:35Z) - Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z) - CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization [7.282200564983221]
Large Language Models (LLMs) are susceptible to backdoor attacks.
We introduce Internal Consistency Regularization (CROW) to address layer-wise inconsistencies caused by backdoor triggers.
CROW consistently achieves a significant reductions in attack success rates across diverse backdoor strategies and tasks.
arXiv Detail & Related papers (2024-11-18T07:52:12Z) - CAT: Concept-level backdoor ATtacks for Concept Bottleneck Models [8.236058439213473]
Concept Bottleneck Models (CBMs) have emerged as a key approach to improve interpretability by leveraging high-level semantic information.
CBMs are susceptible to security threats, particularly backdoor attacks, which can covertly manipulate model behaviors.
We introduce CAT (Concept-level Backdoor ATtacks), a methodology that leverages the conceptual representations within CBMs to embed triggers during training.
An enhanced attack pattern, CAT+, incorporates a correlation function to systematically select the most effective and stealthy concept triggers.
arXiv Detail & Related papers (2024-10-07T08:14:17Z) - Celtibero: Robust Layered Aggregation for Federated Learning [0.0]
We introduce Celtibero, a novel defense mechanism that integrates layered aggregation to enhance robustness against adversarial manipulation.
We demonstrate that Celtibero consistently achieves high main task accuracy (MTA) while maintaining minimal attack success rates (ASR) across a range of untargeted and targeted poisoning attacks.
arXiv Detail & Related papers (2024-08-26T12:54:00Z) - DeCE: Deceptive Cross-Entropy Loss Designed for Defending Backdoor Attacks [26.24490960002264]
We propose a general and effective loss function DeCE (Deceptive Cross-Entropy) to enhance the security of Code Language Models.
Our experiments across various code synthesis datasets, models, and poisoning ratios demonstrate the applicability and effectiveness of DeCE.
arXiv Detail & Related papers (2024-07-12T03:18:38Z) - Prompt Leakage effect and defense strategies for multi-turn LLM interactions [95.33778028192593]
Leakage of system prompts may compromise intellectual property and act as adversarial reconnaissance for an attacker.
We design a unique threat model which leverages the LLM sycophancy effect and elevates the average attack success rate (ASR) from 17.7% to 86.2% in a multi-turn setting.
We measure the mitigation effect of 7 black-box defense strategies, along with finetuning an open-source model to defend against leakage attempts.
arXiv Detail & Related papers (2024-04-24T23:39:58Z) - Pre-trained Trojan Attacks for Visual Recognition [106.13792185398863]
Pre-trained vision models (PVMs) have become a dominant component due to their exceptional performance when fine-tuned for downstream tasks.
We propose the Pre-trained Trojan attack, which embeds backdoors into a PVM, enabling attacks across various downstream vision tasks.
We highlight the challenges posed by cross-task activation and shortcut connections in successful backdoor attacks.
arXiv Detail & Related papers (2023-12-23T05:51:40Z) - On the Difficulty of Defending Contrastive Learning against Backdoor
Attacks [58.824074124014224]
We show how contrastive backdoor attacks operate through distinctive mechanisms.
Our findings highlight the need for defenses tailored to the specificities of contrastive backdoor attacks.
arXiv Detail & Related papers (2023-12-14T15:54:52Z) - BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive
Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses.
We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z) - Policy Smoothing for Provably Robust Reinforcement Learning [109.90239627115336]
We study the provable robustness of reinforcement learning against norm-bounded adversarial perturbations of the inputs.
We generate certificates that guarantee that the total reward obtained by the smoothed policy will not fall below a certain threshold under a norm-bounded adversarial of perturbation the input.
arXiv Detail & Related papers (2021-06-21T21:42:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.