Related papers: InvThink: Towards AI Safety via Inverse Reasoning

InvThink: Towards AI Safety via Inverse Reasoning

URL: http://arxiv.org/abs/2510.01569v1
Date: Thu, 02 Oct 2025 01:26:53 GMT
Title: InvThink: Towards AI Safety via Inverse Reasoning
Authors: Yubin Kim, Taehan Kim, Eugene Park, Chunjong Park, Cynthia Breazeal, Daniel McDuff, Hae Won Park,
Abstract summary: InvThink gives large language models the capability of inverse thinking: reasoning through failure modes before generating responses.<n>Our method reveals three key findings: (i) safety improvements show stronger scaling with model size compared to existing safety methods.<n>InvThink excels in high-stakes domains including external-facing (medicine, finance, law) and agentic (blackmail, murder) risk scenarios, achieving up to 15.7% reduction in harmful responses.
Score: 23.940337534762563
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present InvThink, a simple yet powerful approach that gives large language models (LLMs) the capability of inverse thinking: reasoning through failure modes before generating responses. Unlike existing safety alignment methods that optimize directly for safe response, InvThink instructs models to 1) enumerate potential harms, 2) analyze their consequences, and 3) generate safe outputs that proactively avoid these risks. Our method reveals three key findings: (i) safety improvements show stronger scaling with model size compared to existing safety methods. (ii) InvThink mitigates safety tax; by training models to systematically consider failure modes, it preserves general reasoning capabilities on standard benchmarks. (iii) beyond general safety tasks, InvThink excels in high-stakes domains including external-facing (medicine, finance, law) and agentic (blackmail, murder) risk scenarios, achieving up to 15.7% reduction in harmful responses compared to baseline methods like SafetyPrompt. We further implement InvThink via supervised fine-tuning, and reinforcement learning across three LLM families. These results suggest that inverse reasoning provides a scalable and generalizable path toward safer, more capable language models.

Related papers

Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models [58.17589701432514]
Think-Reflect-Revise (TRR) is a training framework designed to enhance the safety alignment of Large Vision Language Models (LVLMs)<n>We first build a Reflective Safety Reasoning (ReSafe) dataset with 5,000 examples that follow a think-reflect-revise process.<n>We then fine-tune the target model using the ReSafe dataset to initialize reflective behavior, and finally reinforce policy-guided reflection through reinforcement learning.
arXiv Detail & Related papers (2025-12-08T03:46:03Z)
GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision [47.99880677909197]
GuardTrace-VL is a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis.<n>We propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences.<n>On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks.
arXiv Detail & Related papers (2025-11-26T02:49:51Z)
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models [50.34706204154244]
Acquiring reasoning capabilities catastrophically degrades inherited safety alignment.<n>Certain scenarios suffer 25 times higher attack rates.<n>Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction.
arXiv Detail & Related papers (2025-04-09T06:53:23Z)
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning [10.844235123282056]
Vision-language-action models (VLAs) show potential as generalist robot policies.<n>These models pose extreme safety challenges during real-world deployment, including the risk of harm to the environment, the robot itself, and humans.<n>We address this by exploring an integrated safety approach (ISA), systematically modeling safety requirements, then actively eliciting diverse unsafe behaviors.
arXiv Detail & Related papers (2025-03-05T13:16:55Z)
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 [70.94607997570729]
We present a comprehensive safety assessment of OpenAI-o3 and DeepSeek-R1 reasoning models.<n>We investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications.
arXiv Detail & Related papers (2025-02-18T09:06:07Z)
OpenAI o1 System Card [274.83891368890977]
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought.<n>This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
arXiv Detail & Related papers (2024-12-21T18:04:31Z)
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level [10.476222570886483]
Large language models (LLMs) have demonstrated immense utility across various industries.<n>As LLMs advance, the risk of harmful outputs increases due to incorrect or malicious instruction prompts.<n>This paper examines the LLMs' capability to recognize harmful outputs, revealing and quantifying their proficiency in assessing the danger of previous tokens.
arXiv Detail & Related papers (2024-10-09T12:09:30Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.