THINKSAFE: Self-Generated Safety Alignment for Reasoning Models
- URL: http://arxiv.org/abs/2601.23143v1
- Date: Fri, 30 Jan 2026 16:31:02 GMT
- Title: THINKSAFE: Self-Generated Safety Alignment for Reasoning Models
- Authors: Seanie Lee, Sangwoo Park, Yumin Choi, Gyeongman Kim, Minki Kang, Jihun Yun, Dongmin Park, Jongho Park, Sung Ju Hwang,
- Abstract summary: We propose ThinkSafe, a framework that restores safety alignment without external teachers.<n>Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm.<n> Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency.
- Score: 60.10077024249373
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We propose ThinkSafe, a self-generated alignment framework that restores safety alignment without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, guiding the model to generate in-distribution safety reasoning traces. Fine-tuning on these self-generated responses effectively realigns the model while minimizing distribution shift. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency. Notably, it achieves superior safety and comparable reasoning to GRPO, with significantly reduced computational cost. Code, models, and datasets are available at https://github.com/seanie12/ThinkSafe.git.
Related papers
- Mitigating Safety Tax via Distribution-Grounded Refinement in Large Reasoning Models [63.368505631152594]
Safety alignment incurs safety tax that perturbs a large reasoning model's (LRM) general reasoning ability.<n>Existing datasets used for safety alignment for an LRM are usually constructed by distilling safety reasoning traces and answers from an external LRM or human labeler.<n>We propose a safety alignment dataset construction method, dubbed DGR. DGR transforms and refines an existing out-of-distributional safety reasoning dataset to be aligned with the target's LLM inner distribution.
arXiv Detail & Related papers (2026-02-02T14:18:48Z) - Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability [18.931331452604066]
Large reasoning models (LRMs) extend large language models by generating explicit chain-of-thought (CoT) reasoning.<n>Existing safety alignment approaches rely on supervised fine-tuning (SFT) over safety-oriented long CoT datasets.<n>We investigate reinforcement learning (RL) as a complementary optimization framework for LRM safety training.
arXiv Detail & Related papers (2025-12-01T16:35:34Z) - When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails [74.63933201261595]
Large Reasoning Models (LRMs) demonstrate remarkable capabilities on complex reasoning tasks.<n>LRMs remain vulnerable to severe safety risks, including harmful content generation and jailbreak attacks.<n>We propose the Chain-of-Guardrail (CoG), a training framework that recomposes or backtracks unsafe reasoning steps.
arXiv Detail & Related papers (2025-10-24T09:32:25Z) - Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning? [68.82210578851442]
We investigate why safety alignment fails in reasoning models through a mechanistic interpretability lens.<n>Using a linear probing approach to trace refusal intentions across token positions, we discover a phenomenon termed as textbfrefusal cliff<n>We propose textbfCliff-as-a-Judge, a novel data selection method that identifies training examples exhibiting the largest refusal cliff to efficiently repair reasoning models' safety alignment.
arXiv Detail & Related papers (2025-10-07T15:32:59Z) - Large Reasoning Models Learn Better Alignment from Flawed Thinking [56.08883934423522]
Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer.<n>We propose RECAP, a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories.
arXiv Detail & Related papers (2025-10-01T14:15:43Z) - AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning [21.399086197886202]
Large language models (LLMs) possess latent safety understanding from their vast pretraining data.<n>We propose textbfAlphaAlign, a pure reinforcement learning framework with verifiable safety reward.<n>This allows the model to develop proactive safety reasoning capabilities without depending on supervised safety-specific reasoning data.
arXiv Detail & Related papers (2025-07-20T14:47:03Z) - SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning [76.56522719330911]
Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering.<n>LRMs pose great safety risks against harmful queries and adversarial attacks.<n>We propose SafeKey to better activate the safety aha moment in the key sentence.
arXiv Detail & Related papers (2025-05-22T03:46:03Z) - RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability [29.437113221903715]
We introduce RealSafe-R1 as safety-aligned versions of DeepSeek-R1 models.<n>Our method preserves the models' reasoning capabilities by maintaining the training data within the original distribution of generation.
arXiv Detail & Related papers (2025-04-14T10:26:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.