Safe Langevin Soft Actor Critic
- URL: http://arxiv.org/abs/2602.00587v1
- Date: Sat, 31 Jan 2026 08:06:35 GMT
- Title: Safe Langevin Soft Actor Critic
- Authors: Mahesh Keswani, Samyak Jain, Raunak P. Bhattacharyya,
- Abstract summary: We introduce Safe Langevin Soft Actor-Critic (SL-SAC) to balance reward and safety in constrained reinforcement learning.<n>We show that SL-SAC achieves the lowest cost in 7 out of 10 tasks while maintaining competitive returns.<n>On Safety-Gymnasium, SL-SAC achieves cost reductions of 19-63% in velocity tasks compared to state-of-the-art baselines.
- Score: 10.683491090059867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Balancing reward and safety in constrained reinforcement learning remains challenging due to poor generalization from sharp value minima and inadequate handling of heavy-tailed risk distribution. We introduce Safe Langevin Soft Actor-Critic (SL-SAC), a principled algorithm that addresses both issues through parameter-space exploration and distributional risk control. Our approach combines three key mechanisms: (1) Adaptive Stochastic Gradient Langevin Dynamics (aSGLD) for reward critics, promoting ensemble diversity and escape from poor optima; (2) distributional cost estimation via Implicit Quantile Networks (IQN) with Conditional Value-at-Risk (CVaR) optimization for tail-risk mitigation; and (3) a reactive Lagrangian relaxation scheme that adapts constraint enforcement based on the empirical CVaR of episodic costs. We provide theoretical guarantees on CVaR estimation error and demonstrate that CVaR-based Lagrange updates yield stronger constraint violation signals than expected-cost updates. On Safety-Gymnasium benchmarks, SL-SAC achieves the lowest cost in 7 out of 10 tasks while maintaining competitive returns, with cost reductions of 19-63% in velocity tasks compared to state-of-the-art baselines.
Related papers
- Unifying Stable Optimization and Reference Regularization in RLHF [64.16830602324345]
This paper introduces a unified regularization approach that balances objectives of preventing reward hacking and maintaining stable policy updates.<n>Our simple yet principled alignment objective yields a weighted supervised fine-tuning loss with a superior trade-off, which demonstrably improves both alignment results and implementation complexity.
arXiv Detail & Related papers (2026-02-12T03:31:19Z) - Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation [56.92367609590823]
Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs)<n>We argue that Long CoT is inherently ill-suited for the sequential recommendation domain.<n>We propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation.
arXiv Detail & Related papers (2026-01-31T10:02:43Z) - Online Risk-Averse Planning in POMDPs Using Iterated CVaR Value Function [9.269394037577177]
We study risk-sensitive planning under partial observability using the dynamic risk measure Iterated Conditional Value-at-Risk (ICVaR)<n>A policy evaluation algorithm for ICVaR is developed with finite-time performance guarantees that do not depend on the cardinality of the action space.<n>Experiments on benchmark POMDP domains demonstrate that the proposed ICVaR planners achieve lower tail risk compared to their risk-neutral counterparts.
arXiv Detail & Related papers (2026-01-28T12:48:20Z) - Extreme Value Policy Optimization for Safe Reinforcement Learning [38.341398602157575]
Constrained Reinforcement Learning (CRL) addresses this by maximizing returns under predefined constraints.<n>However, expectation-based constraints overlook rare but high-impact extreme value events in the tail distribution.<n>We propose the Extreme Value policy Optimization (EVO) algorithm, leveraging Extreme Value Theory (EVT) to model and exploit extreme reward and cost samples.
arXiv Detail & Related papers (2026-01-17T11:12:24Z) - Boundary-to-Region Supervision for Offline Safe Reinforcement Learning [56.150983204962735]
Boundary-to-Region (B2R) is a framework that enables asymmetric conditioning through cost signal realignment.<n>B2R redefines CTG as a boundary constraint under a fixed safety budget, unifying the cost distribution of all feasible trajectories.<n> Experimental results show that B2R satisfies safety constraints in 35 out of 38 safety-critical tasks.
arXiv Detail & Related papers (2025-09-30T03:38:20Z) - Safety-Aware Reinforcement Learning for Control via Risk-Sensitive Action-Value Iteration and Quantile Regression [2.592761128203891]
Quantile-based action-value iteration methods reduce this bias by learning a distribution of the expected cost-to-go.<n>Existing methods often require complex neural architectures or manual tradeoffs due to combined cost functions.<n>We propose a risk-regularized quantile-based algorithm integrating Conditional Value-at-Risk to enforce safety without complex architectures.
arXiv Detail & Related papers (2025-06-08T00:22:00Z) - Adaptive Insurance Reserving with CVaR-Constrained Reinforcement Learning under Macroeconomic Regimes [0.0]
This paper proposes a reinforcement learning (RL) framework for insurance reserving that integrates tail-risk sensitivity, macroeconomic regime modeling, and regulatory compliance.<n>The framework also accommodates fixed-shock stress testing and regime-stratified analysis, providing a principled and principled approach to reserving under uncertainty.
arXiv Detail & Related papers (2025-04-13T01:43:25Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Distributional Soft Actor-Critic with Three Refinements [47.46661939652862]
Reinforcement learning (RL) has shown remarkable success in solving complex decision-making and control tasks.<n>Many model-free RL algorithms experience performance degradation due to inaccurate value estimation.<n>This paper introduces three key refinements to DSACv1 to overcome these limitations and further improve Q-value estimation accuracy.
arXiv Detail & Related papers (2023-10-09T16:52:48Z) - Provably Efficient Iterated CVaR Reinforcement Learning with Function
Approximation and Human Feedback [57.6775169085215]
Risk-sensitive reinforcement learning aims to optimize policies that balance the expected reward and risk.
We present a novel framework that employs an Iterated Conditional Value-at-Risk (CVaR) objective under both linear and general function approximations.
We propose provably sample-efficient algorithms for this Iterated CVaR RL and provide rigorous theoretical analysis.
arXiv Detail & Related papers (2023-07-06T08:14:54Z) - Handling Long and Richly Constrained Tasks through Constrained
Hierarchical Reinforcement Learning [20.280636126917614]
Safety in goal directed Reinforcement Learning (RL) settings has typically been handled through constraints over trajectories.
We propose a (safety) Constrained Search with Hierarchical Reinforcement Learning (CoSHRL) mechanism that combines an upper level constrained search agent with a low-level goal conditioned RL agent.
A major advantage of CoSHRL is that it can handle constraints on the cost value distribution and can adjust to flexible constraint thresholds without retraining.
arXiv Detail & Related papers (2023-02-21T12:57:12Z) - Penalized Proximal Policy Optimization for Safe Reinforcement Learning [68.86485583981866]
We propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem.
P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective.
We show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
arXiv Detail & Related papers (2022-05-24T06:15:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.