PSRT: Accelerating LRM-based Guard Models via Prefilled Safe Reasoning Traces
- URL: http://arxiv.org/abs/2509.21768v1
- Date: Fri, 26 Sep 2025 02:14:31 GMT
- Title: PSRT: Accelerating LRM-based Guard Models via Prefilled Safe Reasoning Traces
- Authors: Jiawei Zhao, Yuang Qi, Weiming Zhang, Nenghai Yu, Kejiang Chen,
- Abstract summary: We introduce PSRT, a method that replaces the model's reasoning process with a Prefilled Safe Reasoning Trace.<n>PSRT prefills "safe reasoning virtual tokens" from a constructed dataset and learns over their continuous embeddings.<n>We evaluate PSRT on 7 models, 13 datasets, and 8 jailbreak methods.
- Score: 81.70980843006681
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Reasoning Models (LRMs) have demonstrated remarkable performance on tasks such as mathematics and code generation. Motivated by these strengths, recent work has empirically demonstrated the effectiveness of LRMs as guard models in improving harmful query detection. However, LRMs typically generate long reasoning traces during inference, causing substantial computational overhead. In this paper, we introduce PSRT, a method that replaces the model's reasoning process with a Prefilled Safe Reasoning Trace, thereby significantly reducing the inference cost of LRMs. Concretely, PSRT prefills "safe reasoning virtual tokens" from a constructed dataset and learns over their continuous embeddings. With the aid of indicator tokens, PSRT enables harmful-query detection in a single forward pass while preserving the classification effectiveness of LRMs. We evaluate PSRT on 7 models, 13 datasets, and 8 jailbreak methods. In terms of efficiency, PSRT completely removes the overhead of generating reasoning tokens during inference. In terms of classification performance, PSRT achieves nearly identical accuracy, with only a minor average F1 drop of 0.015 across 7 models and 5 datasets.
Related papers
- Training Large Reasoning Models Efficiently via Progressive Thought Encoding [63.254758972725654]
Large reasoning models (LRMs) excel on complex problems but face a critical barrier to efficiency.<n>We introduce Progressive Thought, a parameter-efficient fine-tuning method that enables LRMs to reason effectively under fixed-size caches.
arXiv Detail & Related papers (2026-02-18T20:03:38Z) - Are Reasoning LLMs Robust to Interventions on Their Chain-of-Thought? [79.86483056611105]
Reasoning LLMs generate step-by-step chains of thought before giving an answer.<n>How robust are these reasoning traces to disruptions that occur within them?<n>We introduce a controlled evaluation framework that perturbs a model's own CoT at fixed timesteps.
arXiv Detail & Related papers (2026-02-07T10:02:58Z) - Efficient Reasoning via Reward Model [24.105621725286497]
Reinforcement learning with verifiable rewards (RLVR) has been shown to enhance the reasoning capabilities of large language models (LLMs)<n>LRMs such as DeepSeek-R1 and OpenAI o1 often generate verbose responses containing redundant or irrelevant reasoning step-a phenomenon known as overthinking.<n>We introduce a novel reward formulation named Conciseness Reward Function (CRF) with explicit dependency between the outcome reward and conciseness score.
arXiv Detail & Related papers (2025-11-12T09:51:07Z) - Inducing Faithfulness in Structured Reasoning via Counterfactual Sensitivity [6.908972852063454]
Large language models often generate a correct answer while relying on a flawed or irrelevant reasoning trace.<n>This paper introduces textbfCounterfactual Sensitivity Regularization (CSR), a novel training objective.<n>CSR improves faithfulness over standard fine-tuning and process supervision by up to 70 percentage points.
arXiv Detail & Related papers (2025-09-01T15:18:46Z) - ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs [75.72672339168092]
We introduce ReasonFlux-PRM, a novel trajectory-aware PRM to evaluate trajectory-response type of reasoning traces.<n>ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data.<n>Our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling.
arXiv Detail & Related papers (2025-06-23T17:59:02Z) - Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills [32.96074934023323]
Large reasoning models (LRMs) have enabled strong chain-of-thought (CoT) generation through test-time computation.<n>We show that conventional unlearning algorithms, originally designed for non-reasoning models, are inadequate for LRMs.<n>We propose Reasoning-aware Representation Misdirection for Unlearning ($R2MU$), a novel method that effectively suppresses sensitive reasoning traces.
arXiv Detail & Related papers (2025-06-15T20:54:23Z) - R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step.<n>Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy.<n>We propose Reasoning-Driven Process Reward Modeling (R-PRM)<n>R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z) - The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models [69.798277882245]
We introduce Unsupervised Prefix Fine-Tuning (UPFT) to enhance large language models' reasoning efficiency.<n>UPFT removes the need for labeled data or exhaustive sampling.<n> Experiments show that UPFT matches the performance of supervised methods.
arXiv Detail & Related papers (2025-03-04T18:56:03Z) - Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [53.51560766150442]
Critical tokens are elements within reasoning trajectories that significantly influence incorrect outcomes.<n>We present a novel framework for identifying these tokens through rollout sampling.<n>We show that identifying and replacing critical tokens significantly improves model accuracy.
arXiv Detail & Related papers (2024-11-29T18:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.