Related papers: STAR: Detecting Inference-time Backdoors in LLM Reasoning via State-Transition Amplification Ratio

STAR: Detecting Inference-time Backdoors in LLM Reasoning via State-Transition Amplification Ratio

URL: http://arxiv.org/abs/2601.08511v1
Date: Tue, 13 Jan 2026 12:51:13 GMT
Title: STAR: Detecting Inference-time Backdoors in LLM Reasoning via State-Transition Amplification Ratio
Authors: Seong-Gyu Park, Sohee Park, Jisu Lee, Hyunsik Na, Daeseon Choi,
Abstract summary: We propose STAR (State-Transition Amplification Ratio), a framework that detects backdoors by analyzing output probability shifts.<n>We quantify this state-transition amplification and employ the CUSUM algorithm to detect persistent anomalies.<n>Experiments across diverse models and five benchmark datasets demonstrate that STAR exhibits robust generalization capabilities.
Score: 3.5612678889511016
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent LLMs increasingly integrate reasoning mechanisms like Chain-of-Thought (CoT). However, this explicit reasoning exposes a new attack surface for inference-time backdoors, which inject malicious reasoning paths without altering model parameters. Because these attacks generate linguistically coherent paths, they effectively evade conventional detection. To address this, we propose STAR (State-Transition Amplification Ratio), a framework that detects backdoors by analyzing output probability shifts. STAR exploits the statistical discrepancy where a malicious input-induced path exhibits high posterior probability despite a low prior probability in the model's general knowledge. We quantify this state-transition amplification and employ the CUSUM algorithm to detect persistent anomalies. Experiments across diverse models (8B-70B) and five benchmark datasets demonstrate that STAR exhibits robust generalization capabilities, consistently achieving near-perfect performance (AUROC $\approx$ 1.0) with approximately $42\times$ greater efficiency than existing baselines. Furthermore, the framework proves robust against adaptive attacks attempting to bypass detection.

Related papers

TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models [19.148124494194317]
We propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls.<n>Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy.<n>We demonstrate robustness against adaptive adversaries in a grey-box setting, establishing TraceGuard as a viable, low-latency security primitive.
arXiv Detail & Related papers (2026-03-02T22:19:13Z)
Amortized Reasoning Tree Search: Decoupling Proposal and Decision in Large Language Models [2.5170433424424874]
Reinforcement Learning with Verifiable Rewards has established itself as the dominant paradigm for instilling rigorous reasoning capabilities in Large Language Models.<n>We identify a critical pathology in this alignment process: the systematic suppression of valid but rare (low-likelihood under the base model distribution) reasoning paths.<n>We propose Amortized Reasoning Tree Search (ARTS) to counteract this collapse without discarding the base model's latent diversity.
arXiv Detail & Related papers (2026-02-13T11:52:50Z)
STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction [78.0692157478247]
We propose STAR, a framework that bridges data-driven STatistical expectations with knowledge-driven Agentic Reasoning.<n>We show that STAR consistently outperforms all baselines on both score-based and rank-based metrics.
arXiv Detail & Related papers (2026-02-12T16:30:07Z)
Mitigating Cognitive Inertia in Large Reasoning Models via Latent Spike Steering [12.332146893333949]
Large Reasoning Models (LRMs) have achieved remarkable performance by scaling test-time compute.<n>LRMs frequently suffer from Cognitive Inertia, a failure pattern manifesting as either overthinking (inertia of motion) or rigidity (inertia of direction)
arXiv Detail & Related papers (2026-01-30T02:47:12Z)
CS-GBA: A Critical Sample-based Gradient-guided Backdoor Attack for Offline Reinforcement Learning [7.5200963577855875]
Offline Reinforcement Learning (RL) enables policy optimization from static datasets but is inherently vulnerable to backdoor attacks.<n>We propose CS-GBA (Critical Sample-based Gradient-guided Backdoor Attack), a novel framework designed to achieve high stealthiness and destructiveness under a strict budget.
arXiv Detail & Related papers (2026-01-15T13:57:52Z)
Reflective Confidence: Correcting Reasoning Flaws via Online Self-Correction [14.164508061248775]
Large language models (LLMs) have achieved strong performance on complex reasoning tasks using techniques such as chain-of-thought and self-consistency.<n>We propose reflective confidence, a novel reasoning framework that transforms low-confidence signals from termination indicators into reflection triggers.<n> Experiments on mathematical reasoning benchmarks, including AIME 2025, demonstrate significant accuracy improvements over advanced early-stopping baselines at comparable computational cost.
arXiv Detail & Related papers (2025-12-21T05:35:07Z)
BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models [24.513640096951566]
We propose BadThink, the first backdoor attack designed to deliberately induce "overthinking" behavior in large language models.<n>When activated by carefully crafted trigger prompts, BadThink manipulates the model to generate inflated reasoning traces.<n>We implement this attack through a sophisticated poisoning-based fine-tuning strategy.
arXiv Detail & Related papers (2025-11-13T13:44:51Z)
Efficient Thought Space Exploration through Strategic Intervention [54.35208611253168]
We propose a novel Hint-Practice Reasoning (HPR) framework that operationalizes this insight through two synergistic components.<n>The framework's core innovation lies in Distributional Inconsistency Reduction (DIR), which dynamically identifies intervention points.<n> Experiments across arithmetic and commonsense reasoning benchmarks demonstrate HPR's state-of-the-art efficiency-accuracy tradeoffs.
arXiv Detail & Related papers (2025-11-13T07:26:01Z)
Backdoor Cleaning without External Guidance in MLLM Fine-tuning [76.82121084745785]
Believe Your Eyes (BYE) is a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples.<n>It achieves near-zero attack success rates while maintaining clean-task performance.
arXiv Detail & Related papers (2025-05-22T17:11:58Z)
Lie Detector: Unified Backdoor Detection via Cross-Examination Framework [68.45399098884364]
We propose a unified backdoor detection framework in the semi-honest setting.<n>Our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines.<n> Notably, it is the first to effectively detect backdoors in multimodal large language models.
arXiv Detail & Related papers (2025-03-21T06:12:06Z)
Transferable Adversarial Attacks on SAM and Its Downstream Models [87.23908485521439]
This paper explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM)<n>To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm.
arXiv Detail & Related papers (2024-10-26T15:04:04Z)
TERD: A Unified Framework for Safeguarding Diffusion Models Against Backdoors [36.07978634674072]
Diffusion models are vulnerable to backdoor attacks that compromise their integrity. We propose TERD, a backdoor defense framework that builds unified modeling for current attacks. TERD secures a 100% True Positive Rate (TPR) and True Negative Rate (TNR) across datasets of varying resolutions.
arXiv Detail & Related papers (2024-09-09T03:02:16Z)
CC-Cert: A Probabilistic Approach to Certify General Robustness of Neural Networks [58.29502185344086]
In safety-critical machine learning applications, it is crucial to defend models against adversarial attacks. It is important to provide provable guarantees for deep learning models against semantically meaningful input transformations. We propose a new universal probabilistic certification approach based on Chernoff-Cramer bounds.
arXiv Detail & Related papers (2021-09-22T12:46:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.