Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability
- URL: http://arxiv.org/abs/2512.01848v1
- Date: Mon, 01 Dec 2025 16:35:34 GMT
- Title: Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability
- Authors: Jinghan Jia, Nathalie Baracaldo, Sijia Liu,
- Abstract summary: Large reasoning models (LRMs) extend large language models by generating explicit chain-of-thought (CoT) reasoning.<n>Existing safety alignment approaches rely on supervised fine-tuning (SFT) over safety-oriented long CoT datasets.<n>We investigate reinforcement learning (RL) as a complementary optimization framework for LRM safety training.
- Score: 18.931331452604066
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large reasoning models (LRMs) extend large language models by generating explicit chain-of-thought (CoT) reasoning, significantly improving mathematical and logical problem solving. However, this explicit reasoning process also introduces new safety risks, as unsafe behaviors often emerge within intermediate reasoning trajectories, even when final answers appear harmless. Existing safety alignment approaches primarily rely on supervised fine-tuning (SFT) over safety-oriented long CoT datasets. While intuitive, we find that SFT produces inconsistent safety improvements, degrades reasoning ability, and generalizes poorly across model families. These limitations suggest that purely supervised approaches are insufficient for robust safety alignment in LRMs. To address this, we investigate reinforcement learning (RL) as a complementary optimization framework for LRM safety training. Unlike SFT, RL directly optimizes model policies with reward feedback, enabling more adaptive and stable alignment. Extensive experiments across multiple model families and benchmarks show that RL achieves stronger and more consistent safety gains while maintaining reasoning competence. Further analysis of reflection dynamics and token-level entropy reveals that RL suppresses unsafe exploratory reasoning while preserving reflective depth, leading to safer and more reliable reasoning processes.
Related papers
- Mitigating Safety Tax via Distribution-Grounded Refinement in Large Reasoning Models [63.368505631152594]
Safety alignment incurs safety tax that perturbs a large reasoning model's (LRM) general reasoning ability.<n>Existing datasets used for safety alignment for an LRM are usually constructed by distilling safety reasoning traces and answers from an external LRM or human labeler.<n>We propose a safety alignment dataset construction method, dubbed DGR. DGR transforms and refines an existing out-of-distributional safety reasoning dataset to be aligned with the target's LLM inner distribution.
arXiv Detail & Related papers (2026-02-02T14:18:48Z) - THINKSAFE: Self-Generated Safety Alignment for Reasoning Models [60.10077024249373]
We propose ThinkSafe, a framework that restores safety alignment without external teachers.<n>Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm.<n> Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency.
arXiv Detail & Related papers (2026-01-30T16:31:02Z) - Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs [3.198812241868092]
reinforcement learning with verifiable rewards (RLVR) has emerged as a promising alternative that optimize models on objectively measurable tasks.<n>We present the first comprehensive theoretical and empirical analysis of safety properties in RLVR.<n> Empirically, we conduct extensive experiments across five adversarial safety benchmarks, demonstrating that RLVR can simultaneously enhance reasoning capabilities while maintaining or improving safety guardrails.
arXiv Detail & Related papers (2025-11-26T04:36:34Z) - When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails [74.63933201261595]
Large Reasoning Models (LRMs) demonstrate remarkable capabilities on complex reasoning tasks.<n>LRMs remain vulnerable to severe safety risks, including harmful content generation and jailbreak attacks.<n>We propose the Chain-of-Guardrail (CoG), a training framework that recomposes or backtracks unsafe reasoning steps.
arXiv Detail & Related papers (2025-10-24T09:32:25Z) - Large Reasoning Models Learn Better Alignment from Flawed Thinking [56.08883934423522]
Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer.<n>We propose RECAP, a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories.
arXiv Detail & Related papers (2025-10-01T14:15:43Z) - Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention [53.25106308403173]
We show that existing methods overlook the unique significance of safe reasoning, undermining their trustworthiness and posing potential risks in applications if unsafe reasoning is accessible for and exploited by malicious users.<n>We propose Intervened Preference Optimization (IPO), an alignment method that enforces safe reasoning by substituting compliance steps with safety triggers and constructing pairs for preference learning with strong signals.
arXiv Detail & Related papers (2025-09-29T07:41:09Z) - How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study [90.34190170330481]
Large Reasoning Models (LRMs) have achieved remarkable success on reasoning-intensive tasks such as mathematics and programming.<n>However, their enhanced reasoning capabilities do not necessarily translate to improved safety performance.<n>We present a comprehensive empirical study on how to enhance the safety of LRMs through Supervised Fine-Tuning.
arXiv Detail & Related papers (2025-05-21T11:45:29Z) - Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking [54.10710423370126]
We propose Reasoning-to-Defend (R2D), a training paradigm that integrates a safety-aware reasoning mechanism into Large Language Models' generation process.<n>CPO enhances the model's perception of the safety status of given dialogues.<n>Experiments demonstrate that R2D effectively mitigates various attacks and improves overall safety, while maintaining the original performances.
arXiv Detail & Related papers (2025-02-18T15:48:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.