Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought
- URL: http://arxiv.org/abs/2602.06650v1
- Date: Fri, 06 Feb 2026 12:20:01 GMT
- Title: Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought
- Authors: Jianfeng Si, Lin Sun, Weihong Lin, Xiangzheng Zhang,
- Abstract summary: Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies.<n>We present textbfPACT (Prompt-Thought Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning.
- Score: 5.251527748612469
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies that lack runtime controllabilityxf, making it difficult to tailor responses to diverse application needs. %As a result, models may over-refuse benign requests or under-constrain harmful ones. We present \textbf{PACT} (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning. PACT operates under a hierarchical policy architecture: a non-overridable global safety policy establishes immutable boundaries for critical risks (e.g., child safety, violent extremism), while user-defined policies can introduce domain-specific (non-global) risk categories and specify label-to-action behaviors to improve utility in real-world deployment settings. The framework decomposes safety decisions into structured Classify$\rightarrow$Act paths that route queries to the appropriate action (comply, guide, or reject) and render the decision-making process transparent. Extensive experiments demonstrate that PACT achieves near state-of-the-art safety performance under global policy evaluation while attaining the best controllability under user-specific policy evaluation, effectively mitigating the safety-helpfulness trade-off. We will release the PACT model suite, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment.
Related papers
- Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning [1.4707788677208018]
We propose an adversarial attack framework to reveal vulnerabilities of Safe RL policies.<n>Our framework learns a constraint model and a surrogate (learner) policy, enabling gradient-based attack optimization.
arXiv Detail & Related papers (2026-02-18T15:43:36Z) - Safe Reinforcement Learning via Recovery-based Shielding with Gaussian Process Dynamics Models [57.006252510102506]
Reinforcement learning (RL) is a powerful framework for optimal decision-making and control but often lacks provable guarantees for safety-critical applications.<n>We introduce a novel recovery-based shielding framework that enables safe RL with a provable safety lower bound for unknown and non-linear continuous dynamical systems.
arXiv Detail & Related papers (2026-02-12T22:03:35Z) - Constrained Language Model Policy Optimization via Risk-aware Stepwise Alignment [49.2305683068875]
We propose Risk-aware Stepwise Alignment (RSA), a novel alignment method that incorporates risk awareness into the policy optimization process.<n> RSA mitigates risks induced by excessive model shift away from a reference policy, and it explicitly suppresses low-probability yet high-impact harmful behaviors.<n> Experimental results demonstrate that our method achieves high levels of helpfulness while ensuring strong safety.
arXiv Detail & Related papers (2025-12-30T14:38:02Z) - SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization [79.14563283347773]
Multimodal large language models (MLLMs) have demonstrated impressive reasoning and instruction-following capabilities.<n>Cross-modal couplings can produce unsafe semantics even when individual inputs are benign.<n>We propose SafeGRPO, a self-rewarded multimodal safety alignment framework.
arXiv Detail & Related papers (2025-11-17T05:09:49Z) - RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards [55.76285458905577]
Large Language Models (LLMs) continue to exhibit vulnerabilities despite deliberate safety alignment efforts.<n>To safeguard against the risk of policy-violating content, system-level moderation via external guard models has emerged as a prevalent mitigation strategy.<n>We propose RSafe, an adaptive reasoning-based safeguard that conducts guided safety reasoning to provide robust protection within the scope of specified safety policies.
arXiv Detail & Related papers (2025-06-09T13:20:04Z) - Shape it Up! Restoring LLM Safety during Finetuning [65.75757313781104]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z) - SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning [30.037309138373754]
Vision-language-action models (VLAs) show potential as generalist robot policies.<n>These models pose extreme safety challenges during real-world deployment, including the risk of harm to the environment, the robot itself, and humans.<n>We address this by exploring an integrated safety approach (ISA), systematically modeling safety requirements, then actively eliciting diverse unsafe behaviors.
arXiv Detail & Related papers (2025-03-05T13:16:55Z) - Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization [16.35399722653875]
We propose Rectified Policy Optimization (RePO) to balance helpfulness and safety (harmlessness) in large language models (LLMs)<n>At the core of RePO is a policy update mechanism driven by rectified policy gradients, which penalizes the strict safety violation of every prompt, thereby enhancing safety across nearly all prompts.
arXiv Detail & Related papers (2024-10-25T19:08:23Z) - Safe Policy Improvement in Constrained Markov Decision Processes [10.518340300810504]
We present a solution to the synthesis problem by solving its two main challenges: reward-shaping from a set of formal requirements and safe policy update.
For the former, we propose an automatic reward-shaping procedure, defining a scalar reward signal compliant with the task specification.
For the latter, we introduce an algorithm ensuring that the policy is improved in a safe fashion with high-confidence guarantees.
arXiv Detail & Related papers (2022-10-20T13:29:32Z) - Safe Reinforcement Learning via Confidence-Based Filters [78.39359694273575]
We develop a control-theoretic approach for certifying state safety constraints for nominal policies learned via standard reinforcement learning techniques.
We provide formal safety guarantees, and empirically demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-07-04T11:43:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.