Related papers: Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

URL: http://arxiv.org/abs/2509.24393v1
Date: Mon, 29 Sep 2025 07:41:09 GMT
Title: Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention
Authors: Yichi Zhang, Yue Ding, Jingwen Yang, Tianwei Luo, Dongbai Li, Ranjie Duan, Qiang Liu, Hang Su, Yinpeng Dong, Jun Zhu,
Abstract summary: We show that existing methods overlook the unique significance of safe reasoning, undermining their trustworthiness and posing potential risks in applications if unsafe reasoning is accessible for and exploited by malicious users.<n>We propose Intervened Preference Optimization (IPO), an alignment method that enforces safe reasoning by substituting compliance steps with safety triggers and constructing pairs for preference learning with strong signals.
Score: 53.25106308403173
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Although Large Reasoning Models (LRMs) have progressed in solving complex problems, their chain-of-thought (CoT) reasoning often contains harmful content that can persist even when the final responses appear safe. We show that this issue still remains in existing methods which overlook the unique significance of safe reasoning, undermining their trustworthiness and posing potential risks in applications if unsafe reasoning is accessible for and exploited by malicious users. We therefore shift our focus to aligning the safety of reasoning itself in this paper and explore process supervision as the solution. However, simply rewarding safe reasoning proves inadequate due to low rollout diversity and limited training signals. To tackle this challenge, we first delve into the characteristics of safe reasoning and uncover several critical insights that 1) safe reasoning is often consolidated by a few critical steps of safety triggers; 2) compliance cues strongly correlate with unsafe continuations; and 3) corrective interventions reliably steer unsafe trajectories towards safer traces. Motivated by these, we propose Intervened Preference Optimization (IPO), an alignment method that enforces safe reasoning by substituting compliance steps with safety triggers and constructing pairs for preference learning with strong signals. Experiments on jailbreak and adversarial safety benchmarks demonstrate that IPO remarkably improves overall safety regarding both reasoning and responses, outperforming SFT-based and RL-based baselines with a relative reduction of over 30% in harmfulness, while preserving excellent performance across diverse reasoning tasks. The results highlight the importance of explicit alignment for reasoning and provide a practical path to safer LRMs.

Related papers

Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability [18.931331452604066]
Large reasoning models (LRMs) extend large language models by generating explicit chain-of-thought (CoT) reasoning.<n>Existing safety alignment approaches rely on supervised fine-tuning (SFT) over safety-oriented long CoT datasets.<n>We investigate reinforcement learning (RL) as a complementary optimization framework for LRM safety training.
arXiv Detail & Related papers (2025-12-01T16:35:34Z)
SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models [60.8821834954637]
We present SafeRBench, the first benchmark that assesses LRM safety end-to-end.<n>We pioneer the incorporation of risk categories and levels into input design.<n>We introduce a micro-thought chunking mechanism to segment long reasoning traces into semantically coherent units.
arXiv Detail & Related papers (2025-11-19T06:46:33Z)
When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails [74.63933201261595]
Large Reasoning Models (LRMs) demonstrate remarkable capabilities on complex reasoning tasks.<n>LRMs remain vulnerable to severe safety risks, including harmful content generation and jailbreak attacks.<n>We propose the Chain-of-Guardrail (CoG), a training framework that recomposes or backtracks unsafe reasoning steps.
arXiv Detail & Related papers (2025-10-24T09:32:25Z)
AURA: Affordance-Understanding and Risk-aware Alignment Technique for Large Language Models [6.059681491089391]
AURA provides comprehensive, step level evaluations across logical coherence and safety-awareness.<n>Our framework seamlessly combines introspective self-critique, fine-grained PRM assessments, and adaptive safety-aware decoding.<n>This research represents a pivotal step toward safer, more responsible, and contextually aware AI, setting a new benchmark for alignment-sensitive applications.
arXiv Detail & Related papers (2025-08-08T08:43:24Z)
ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments [18.198349215500183]
ReasoningGuard injects timely safety aha moments to steer harmless while helpful reasoning processes.<n>Our approach outperforms seven existing safeguards, achieving state-of-the-art safety defenses.
arXiv Detail & Related papers (2025-08-06T08:35:10Z)
UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases [57.69882799751655]
We release UnsafeChain, a safety alignment dataset constructed from hard prompts with diverse sources.<n>We fine-tune three large reasoning models (LRMs) and compare them against recent SafeChain and STAR-1.<n>UnsafeChain consistently outperforms prior datasets, with even a 1K subset matching or surpassing baseline performance.
arXiv Detail & Related papers (2025-07-29T10:08:52Z)
AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning [21.399086197886202]
Large language models (LLMs) possess latent safety understanding from their vast pretraining data.<n>We propose textbfAlphaAlign, a pure reinforcement learning framework with verifiable safety reward.<n>This allows the model to develop proactive safety reasoning capabilities without depending on supervised safety-specific reasoning data.
arXiv Detail & Related papers (2025-07-20T14:47:03Z)
ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning [49.47193675702453]
Large Language Models (LLMs) have demonstrated remarkable generative capabilities.<n>LLMs remain vulnerable to malicious instructions that can bypass safety constraints.<n>We propose a reasoning-based safety alignment framework, ARMOR, that replaces the ad-hoc chains of thought reasoning process with human-aligned, structured one.
arXiv Detail & Related papers (2025-07-14T09:05:54Z)
Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models [29.569220030102986]
We introduce textbfBeyond Safe Answers (BSA) bench, a novel benchmark comprising 2,000 challenging instances organized into three distinct SSA scenario types.<n> Evaluations of 19 state-of-the-art LRMs demonstrate the difficulty of this benchmark, with top-performing models achieving only 38.0% accuracy in correctly identifying risk rationales.<n>Our work provides a comprehensive assessment tool for evaluating and improving safety reasoning fidelity in LRMs, advancing the development of genuinely risk-aware and reliably safe AI systems.
arXiv Detail & Related papers (2025-05-26T08:49:19Z)
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning [76.56522719330911]
Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering.<n>LRMs pose great safety risks against harmful queries and adversarial attacks.<n>We propose SafeKey to better activate the safety aha moment in the key sentence.
arXiv Detail & Related papers (2025-05-22T03:46:03Z)
How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study [90.34190170330481]
Large Reasoning Models (LRMs) have achieved remarkable success on reasoning-intensive tasks such as mathematics and programming.<n>However, their enhanced reasoning capabilities do not necessarily translate to improved safety performance.<n>We present a comprehensive empirical study on how to enhance the safety of LRMs through Supervised Fine-Tuning.
arXiv Detail & Related papers (2025-05-21T11:45:29Z)
SAFER: Advancing Safety Alignment via Efficient Ex-Ante Reasoning [51.78514648677898]
We propose SAFER, a framework for Safety Alignment via eFficient Ex-Ante Reasoning.<n>Our approach instantiates structured Ex-Ante reasoning through initial assessment, rule verification, and path calibration.<n> Experiments on multiple open-source LLMs demonstrate that SAFER significantly enhances safety performance while maintaining helpfulness and response efficiency.
arXiv Detail & Related papers (2025-04-03T16:07:38Z)
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 [70.94607997570729]
We present a comprehensive safety assessment of OpenAI-o3 and DeepSeek-R1 reasoning models.<n>We investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications.
arXiv Detail & Related papers (2025-02-18T09:06:07Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.