R1-ACT: Efficient Reasoning Model Safety Alignment by Activating Safety Knowledge
- URL: http://arxiv.org/abs/2508.00324v1
- Date: Fri, 01 Aug 2025 05:14:13 GMT
- Title: R1-ACT: Efficient Reasoning Model Safety Alignment by Activating Safety Knowledge
- Authors: Yeonjun In, Wonjoong Kim, Sangwu Park, Chanyoung Park,
- Abstract summary: We investigate the underlying cause of LRM safety risks and find that models already possess sufficient safety knowledge but fail to activate it during reasoning.<n>We propose R1-Act, a simple and efficient post-training method that explicitly triggers safety knowledge through a structured reasoning process.<n>R1-Act achieves strong safety improvements while preserving reasoning performance, outperforming prior alignment methods.
- Score: 16.653490433862224
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although large reasoning models (LRMs) have demonstrated impressive capabilities on complex tasks, recent studies reveal that these models frequently fulfill harmful user instructions, raising significant safety concerns. In this paper, we investigate the underlying cause of LRM safety risks and find that models already possess sufficient safety knowledge but fail to activate it during reasoning. Based on this insight, we propose R1-Act, a simple and efficient post-training method that explicitly triggers safety knowledge through a structured reasoning process. R1-Act achieves strong safety improvements while preserving reasoning performance, outperforming prior alignment methods. Notably, it requires only 1,000 training examples and 90 minutes of training on a single RTX A6000 GPU. Extensive experiments across multiple LRM backbones and sizes demonstrate the robustness, scalability, and practical efficiency of our approach.
Related papers
- How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study [90.34190170330481]
Large Reasoning Models (LRMs) have achieved remarkable success on reasoning-intensive tasks such as mathematics and programming.<n>However, their enhanced reasoning capabilities do not necessarily translate to improved safety performance.<n>We present a comprehensive empirical study on how to enhance the safety of LRMs through Supervised Fine-Tuning.
arXiv Detail & Related papers (2025-05-21T11:45:29Z) - RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability [29.437113221903715]
We introduce RealSafe-R1 as safety-aligned versions of DeepSeek-R1 models.<n>Our method preserves the models' reasoning capabilities by maintaining the training data within the original distribution of generation.
arXiv Detail & Related papers (2025-04-14T10:26:37Z) - The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 [70.94607997570729]
We present a comprehensive safety assessment of OpenAI-o3 and DeepSeek-R1 reasoning models.<n>We investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications.
arXiv Detail & Related papers (2025-02-18T09:06:07Z) - Safety Reasoning with Guidelines [63.15719512614899]
Refusal Training (RT) struggles to generalize against various Out-of-Distribution (OOD) jailbreaking attacks.<n>We propose training model to perform safety reasoning for each query.
arXiv Detail & Related papers (2025-02-06T13:01:44Z) - OpenAI o1 System Card [274.83891368890977]
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought.<n>This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
arXiv Detail & Related papers (2024-12-21T18:04:31Z) - Safety-Driven Deep Reinforcement Learning Framework for Cobots: A Sim2Real Approach [1.0488553716155147]
This study presents a novel methodology incorporating safety constraints into a robotic simulation during the training of deep reinforcement learning (DRL)
The framework integrates specific parts of the safety requirements, such as velocity constraints, directly within the DRL model.
The proposed approach outperforms the conventional method by a 16.5% average success rate on the tested scenarios.
arXiv Detail & Related papers (2024-07-02T12:56:17Z) - Safety through Permissibility: Shield Construction for Fast and Safe Reinforcement Learning [57.84059344739159]
"Shielding" is a popular technique to enforce safety inReinforcement Learning (RL)
We propose a new permissibility-based framework to deal with safety and shield construction.
arXiv Detail & Related papers (2024-05-29T18:00:21Z) - Evaluating Model-free Reinforcement Learning toward Safety-critical
Tasks [70.76757529955577]
This paper revisits prior work in this scope from the perspective of state-wise safe RL.
We propose Unrolling Safety Layer (USL), a joint method that combines safety optimization and safety projection.
To facilitate further research in this area, we reproduce related algorithms in a unified pipeline and incorporate them into SafeRL-Kit.
arXiv Detail & Related papers (2022-12-12T06:30:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.