FuSaR: A Fuzzification-Based Method for LRM Safety-Reasoning Balance
- URL: http://arxiv.org/abs/2508.12897v1
- Date: Mon, 18 Aug 2025 12:54:16 GMT
- Title: FuSaR: A Fuzzification-Based Method for LRM Safety-Reasoning Balance
- Authors: Jianhao Chen, Mayi Xu, Xiaohu Li, Yongqi Li, Xiangyu Zhang, Jianjie Huang, Tieyun Qian,
- Abstract summary: Large Reasoning Models (LRMs) have demonstrated impressive performance across various tasks due to their powerful reasoning capabilities.<n>We propose a novel method to improve the safety of LLMs without sacrificing their reasoning capability.
- Score: 16.657840274027958
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Reasoning Models (LRMs) have demonstrated impressive performance across various tasks due to their powerful reasoning capabilities. However, their safety performance remains a significant concern. In this paper, we explore the reasons behind the vulnerability of LRMs. Based on this, we propose a novel method to improve the safety of LLMs without sacrificing their reasoning capability. Specifically, we exploit the competition between LRM's reasoning ability and safety ability, and achieve jailbreak by improving LRM's reasoning performance to reduce its safety performance. We then introduce an alignment strategy based on Fuzzification to balance Safety-Reasoning (FuSaR), by detoxifying the harmful reasoning process, where both the dangerous entities and the dangerous procedures in the reasoning steps are hidden. FuSaR successfully mitigates safety risks while preserving core reasoning information. We validate this strategy through alignment experiments on several open-source LRMs using detoxified reasoning data. The results compared with existing baselines conclusively show that FuSaR is an efficient alignment strategy to simultaneously enhance both the reasoning capability and safety of LRMs.
Related papers
- Mitigating Safety Tax via Distribution-Grounded Refinement in Large Reasoning Models [63.368505631152594]
Safety alignment incurs safety tax that perturbs a large reasoning model's (LRM) general reasoning ability.<n>Existing datasets used for safety alignment for an LRM are usually constructed by distilling safety reasoning traces and answers from an external LRM or human labeler.<n>We propose a safety alignment dataset construction method, dubbed DGR. DGR transforms and refines an existing out-of-distributional safety reasoning dataset to be aligned with the target's LLM inner distribution.
arXiv Detail & Related papers (2026-02-02T14:18:48Z) - When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails [74.63933201261595]
Large Reasoning Models (LRMs) demonstrate remarkable capabilities on complex reasoning tasks.<n>LRMs remain vulnerable to severe safety risks, including harmful content generation and jailbreak attacks.<n>We propose the Chain-of-Guardrail (CoG), a training framework that recomposes or backtracks unsafe reasoning steps.
arXiv Detail & Related papers (2025-10-24T09:32:25Z) - Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention [53.25106308403173]
We show that existing methods overlook the unique significance of safe reasoning, undermining their trustworthiness and posing potential risks in applications if unsafe reasoning is accessible for and exploited by malicious users.<n>We propose Intervened Preference Optimization (IPO), an alignment method that enforces safe reasoning by substituting compliance steps with safety triggers and constructing pairs for preference learning with strong signals.
arXiv Detail & Related papers (2025-09-29T07:41:09Z) - ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning [49.47193675702453]
Large Language Models (LLMs) have demonstrated remarkable generative capabilities.<n>LLMs remain vulnerable to malicious instructions that can bypass safety constraints.<n>We propose a reasoning-based safety alignment framework, ARMOR, that replaces the ad-hoc chains of thought reasoning process with human-aligned, structured one.
arXiv Detail & Related papers (2025-07-14T09:05:54Z) - Saffron-1: Safety Inference Scaling [69.61130284742353]
SAFFRON is a novel inference scaling paradigm tailored explicitly for safety assurance.<n>Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations.<n>We publicly release our trained multifurcation reward model (Saffron-1) and the accompanying token-level safety reward dataset (Safety4M)
arXiv Detail & Related papers (2025-06-06T18:05:45Z) - How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study [90.34190170330481]
Large Reasoning Models (LRMs) have achieved remarkable success on reasoning-intensive tasks such as mathematics and programming.<n>However, their enhanced reasoning capabilities do not necessarily translate to improved safety performance.<n>We present a comprehensive empirical study on how to enhance the safety of LRMs through Supervised Fine-Tuning.
arXiv Detail & Related papers (2025-05-21T11:45:29Z) - SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models [50.34706204154244]
Acquiring reasoning capabilities catastrophically degrades inherited safety alignment.<n>Certain scenarios suffer 25 times higher attack rates.<n>Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction.
arXiv Detail & Related papers (2025-04-09T06:53:23Z) - Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable [7.140765245328677]
Safety alignment is an important procedure before the official deployment of a Large Language Model.<n>We show that there exists a trade-off between reasoning and safety capability with the sequential LRM production pipeline.<n>As a by-product, we curate a dataset called DirectRefusal, which might serve as an alternative dataset for safety alignment.
arXiv Detail & Related papers (2025-03-01T16:42:01Z) - SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [21.317245896641136]
Long chain-of-thought (CoT) reasoning generates structured intermediate steps, enhancing reasoning capabilities.<n>Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs.
arXiv Detail & Related papers (2025-02-17T16:57:56Z) - To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models [56.19026073319406]
Large Reasoning Models (LRMs) are designed to solve complex tasks by generating explicit reasoning traces before producing final answers.<n>We reveal a critical vulnerability in LRMs -- termed Unthinking -- wherein the thinking process can be bypassed by manipulating special tokens.<n>In this paper, we investigate this vulnerability from both malicious and beneficial perspectives.
arXiv Detail & Related papers (2025-02-16T10:45:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.