Related papers: Constrained Language Model Policy Optimization via Risk-aware Stepwise Alignment

Constrained Language Model Policy Optimization via Risk-aware Stepwise Alignment

URL: http://arxiv.org/abs/2512.24263v1
Date: Tue, 30 Dec 2025 14:38:02 GMT
Title: Constrained Language Model Policy Optimization via Risk-aware Stepwise Alignment
Authors: Lijun Zhang, Lin Li, Wei Wei, Yajie Qi, Huizhong Song, Jun Wang, Yaodong Yang, Jiye Liang,
Abstract summary: We propose Risk-aware Stepwise Alignment (RSA), a novel alignment method that incorporates risk awareness into the policy optimization process.<n> RSA mitigates risks induced by excessive model shift away from a reference policy, and it explicitly suppresses low-probability yet high-impact harmful behaviors.<n> Experimental results demonstrate that our method achieves high levels of helpfulness while ensuring strong safety.
Score: 49.2305683068875
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: When fine-tuning pre-trained Language Models (LMs) to exhibit desired behaviors, maintaining control over risk is critical for ensuring both safety and trustworthiness. Most existing safety alignment methods, such as Safe RLHF and SACPO, typically operate under a risk-neutral paradigm that is insufficient to address the risks arising from deviations from the reference policy and offers limited robustness against rare but potentially catastrophic harmful behaviors. To address this limitation, we propose Risk-aware Stepwise Alignment (RSA), a novel alignment method that explicitly incorporates risk awareness into the policy optimization process by leveraging a class of nested risk measures. Specifically, RSA formulates safety alignment as a token-level risk-aware constrained policy optimization problem and solves it through a stepwise alignment procedure that yields token-level policy updates derived from the nested risk measures. This design offers two key benefits: (1) it mitigates risks induced by excessive model shift away from a reference policy, and (2) it explicitly suppresses low-probability yet high-impact harmful behaviors. Moreover, we provide theoretical analysis on policy optimality under mild assumptions. Experimental results demonstrate that our method achieves high levels of helpfulness while ensuring strong safety and significantly suppresses tail risks, namely low-probability yet high-impact unsafe responses.

Related papers

Risk-sensitive reinforcement learning using expectiles, shortfall risk and optimized certainty equivalent risk [7.358503757109041]
We propose risk-sensitive reinforcement learning algorithms catering to three families of risk measures.<n>For each risk measure, in the context of a finite horizon Markov decision process, we first derive a policy gradient theorem.<n>We conduct numerical experiments to validate the theoretical findings on popular RL benchmarks.
arXiv Detail & Related papers (2026-02-10T00:38:21Z)
Risk-Sensitive Exponential Actor Critic [8.650002078377485]
We show that risk-sensitive exponential actor-critic (rsEAC) produces more numerically stable updates compared to existing approaches.<n>rsEAC reliably learns risk-sensitive policies in challenging risky variants of continuous tasks in MuJoCo.
arXiv Detail & Related papers (2026-02-06T21:23:43Z)
Rectified Robust Policy Optimization for Model-Uncertain Constrained Reinforcement Learning without Strong Duality [53.525547349715595]
We propose a novel primal-only algorithm called Rectified Robust Policy Optimization (RRPO)<n>RRPO operates directly on the primal problem without relying on dual formulations.<n>We show convergence to an approximately optimal feasible policy with complexity matching the best-known lower bound.
arXiv Detail & Related papers (2025-08-24T16:59:38Z)
Risk-sensitive Actor-Critic with Static Spectral Risk Measures for Online and Offline Reinforcement Learning [4.8342038441006805]
We propose a novel framework for optimizing static Spectral Risk Measures (SRM)<n>Our algorithms consistently outperform existing risk-sensitive methods in both online and offline environments across diverse domains.
arXiv Detail & Related papers (2025-07-05T04:41:54Z)
RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards [55.76285458905577]
Large Language Models (LLMs) continue to exhibit vulnerabilities despite deliberate safety alignment efforts.<n>To safeguard against the risk of policy-violating content, system-level moderation via external guard models has emerged as a prevalent mitigation strategy.<n>We propose RSafe, an adaptive reasoning-based safeguard that conducts guided safety reasoning to provide robust protection within the scope of specified safety policies.
arXiv Detail & Related papers (2025-06-09T13:20:04Z)
Beyond CVaR: Leveraging Static Spectral Risk Measures for Enhanced Decision-Making in Distributional Reinforcement Learning [4.8342038441006805]
In domains such as finance, healthcare, and robotics, managing worst-case scenarios is critical.<n> Distributional Reinforcement Learning (DRL) provides a natural framework to incorporate risk sensitivity into decision-making processes.<n>We present a novel DRL algorithm with convergence guarantees that optimize for a broader class of static Spectral Risk Measures (SRM)
arXiv Detail & Related papers (2025-01-03T20:25:41Z)
Is Risk-Sensitive Reinforcement Learning Properly Resolved? [54.00107408956307]
We propose a novel algorithm, namely Trajectory Q-Learning (TQL), for RSRL problems with provable policy improvement.<n>Based on our new learning architecture, we are free to introduce a general and practical implementation for different risk measures to learn disparate risk-sensitive policies.
arXiv Detail & Related papers (2023-07-02T11:47:21Z)
On the Global Convergence of Risk-Averse Policy Gradient Methods with Expected Conditional Risk Measures [17.668631383216233]
Risk-sensitive reinforcement learning (RL) has become a popular tool for controlling the risk of uncertain outcomes.<n>It remains unclear if Policy Gradient (PG) methods enjoy the same global convergence guarantees as in the risk-neutral case.
arXiv Detail & Related papers (2023-01-26T04:35:28Z)
Efficient Risk-Averse Reinforcement Learning [79.61412643761034]
In risk-averse reinforcement learning (RL), the goal is to optimize some risk measure of the returns. We prove that under certain conditions this inevitably leads to a local-optimum barrier, and propose a soft risk mechanism to bypass it. We demonstrate improved risk aversion in maze navigation, autonomous driving, and resource allocation benchmarks.
arXiv Detail & Related papers (2022-05-10T19:40:52Z)
Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning [75.17074235764757]
We present a framework for risk-averse control in a discounted infinite horizon MDP. MVPI enjoys great flexibility in that any policy evaluation method and risk-neutral control method can be dropped in for risk-averse control off the shelf. This flexibility reduces the gap between risk-neutral control and risk-averse control and is achieved by working on a novel augmented MDP.
arXiv Detail & Related papers (2020-04-22T22:23:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.