TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models
- URL: http://arxiv.org/abs/2603.02436v1
- Date: Mon, 02 Mar 2026 22:19:13 GMT
- Title: TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models
- Authors: Zhen Guo, Shanghao Shi, Hao Li, Shamim Yazdani, Ning Zhang, Reza Tourani,
- Abstract summary: We propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls.<n>Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy.<n>We demonstrate robustness against adaptive adversaries in a grey-box setting, establishing TraceGuard as a viable, low-latency security primitive.
- Score: 19.148124494194317
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The deployment of Large Reasoning Models (LRMs) in high-stakes decision-making pipelines has introduced a novel and opaque attack surface: reasoning backdoors. In these attacks, the model's intermediate Chain-of-Thought (CoT) is manipulated to provide a linguistically plausible but logically fallacious justification for a malicious conclusion. While frontier models exhibit an intrinsic capacity to detect these fractures, compact, deployable models suffer from a fundamental verification gap, relying on fragile lexical heuristics that are easily bypassed by motivated adversaries. To bridge this gap, we propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls. Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy through three synergistic phases: (1) Automated Forensic Synthesis, which generates contrastive reasoning pairs to isolate the specific logical point of fracture; (2) Step-Aware Supervised Fine-Tuning (SSFT), to instill a structural verification grammar; and (3) Verifier-Guided Reinforcement Learning (VGRL), utilizing Group Relative Policy Optimization. We identify and mitigate a critical failure mode of baseline alignment - lexical overfitting - whereby verifiers memorize adversarial triggers rather than auditing logical integrity. Our empirical evaluation demonstrates that TraceGuard acts as a security force multiplier: a 4B-parameter verifier achieves forensic precision on unseen attacks - including latent backdoors and post-hoc rationalizations - that rivals architectures two orders of magnitude larger. We further demonstrate robustness against adaptive adversaries in a grey-box setting, establishing TraceGuard as a viable, low-latency security primitive for the Trusted Computing Base.
Related papers
- The Semantic Trap: Do Fine-tuned LLMs Learn Vulnerability Root Cause or Just Functional Pattern? [14.472036099680961]
We propose TrapEval, a comprehensive evaluation framework designed to disentangle vulnerability root cause from functional pattern.<n>We fine-tune five representative state-of-the-art LLMs across three model families and evaluate them under cross-dataset testing, semantic-preservings, and varying degrees of semantic gap measured by CodeBLEU.<n>Our findings serve as a wake-up call: high benchmark scores on traditional datasets may be illusory, masking the model's inability to understand the true causal logic of vulnerabilities.
arXiv Detail & Related papers (2026-01-30T07:19:17Z) - CoG: Controllable Graph Reasoning via Relational Blueprints and Failure-Aware Refinement over Knowledge Graphs [53.199517625701475]
CoG is a training-free framework inspired by Dual-Process Theory that mimics the interplay between intuition and deliberation.<n>CoG significantly outperforms state-of-the-art approaches in both accuracy and efficiency.
arXiv Detail & Related papers (2026-01-16T07:27:40Z) - Countermind: A Multi-Layered Security Architecture for Large Language Models [0.0]
This paper proposes Countermind, a multi-layered security architecture intended to shift defenses from a reactive, post hoc posture to a proactive, pre-inference, and intra-inference enforcement model.<n>The architecture proposes a fortified perimeter designed to structurally validate and transform all inputs, and an internal governance mechanism intended to constrain the model's semantic processing pathways before an output is generated.
arXiv Detail & Related papers (2025-10-13T18:41:18Z) - Building a Foundational Guardrail for General Agentic Systems via Synthetic Data [76.18834864749606]
LLM agents can plan multi-step tasks, intervening at the planning stage-before any action is executed-is often the safest way to prevent harm.<n>Existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level.<n>We introduce AuraGen, a controllable engine that synthesizes benign trajectories, injects category-labeled risks with difficulty, and filters outputs via an automated reward model.
arXiv Detail & Related papers (2025-10-10T18:42:32Z) - DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z) - AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models [62.70575022567081]
We propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning.<n>Our work establishes a new direction for building more robust and reliable reasoning models.
arXiv Detail & Related papers (2025-09-29T04:27:23Z) - LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation [4.29885665563186]
LATENTGUARD is a framework that combines behavioral alignment with supervised latent space control for interpretable and precise safety steering.<n>Our results show significant improvements in both safety controllability and response interpretability without compromising utility.
arXiv Detail & Related papers (2025-09-24T07:31:54Z) - MirGuard: Towards a Robust Provenance-based Intrusion Detection System Against Graph Manipulation Attacks [13.92935628832727]
MirGuard is an anomaly detection framework that combines logic-aware multi-view augmentation with contrastive representation learning.<n>MirGuard significantly outperforms state-of-the-art detectors in robustness against various graph manipulation attacks.
arXiv Detail & Related papers (2025-08-14T13:35:51Z) - Thought Purity: A Defense Framework For Chain-of-Thought Attack [16.56580534764132]
We propose Thought Purity, a framework that strengthens resistance to malicious content while preserving operational efficacy.<n>Our approach establishes the first comprehensive defense mechanism against CoTA vulnerabilities in reinforcement learning-aligned reasoning systems.
arXiv Detail & Related papers (2025-07-16T15:09:13Z) - Lie Detector: Unified Backdoor Detection via Cross-Examination Framework [68.45399098884364]
We propose a unified backdoor detection framework in the semi-honest setting.<n>Our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines.<n> Notably, it is the first to effectively detect backdoors in multimodal large language models.
arXiv Detail & Related papers (2025-03-21T06:12:06Z) - Spatial-Frequency Discriminability for Revealing Adversarial Perturbations [53.279716307171604]
Vulnerability of deep neural networks to adversarial perturbations has been widely perceived in the computer vision community.
Current algorithms typically detect adversarial patterns through discriminative decomposition for natural and adversarial data.
We propose a discriminative detector relying on a spatial-frequency Krawtchouk decomposition.
arXiv Detail & Related papers (2023-05-18T10:18:59Z) - A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems.
This paper proposes a self-supervised adversarial training mechanism in the input space.
It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.