Related papers: Safe Langevin Soft Actor Critic

Safe Langevin Soft Actor Critic

URL: http://arxiv.org/abs/2602.00587v1
Date: Sat, 31 Jan 2026 08:06:35 GMT
Title: Safe Langevin Soft Actor Critic
Authors: Mahesh Keswani, Samyak Jain, Raunak P. Bhattacharyya,
Abstract summary: We introduce Safe Langevin Soft Actor-Critic (SL-SAC) to balance reward and safety in constrained reinforcement learning.<n>We show that SL-SAC achieves the lowest cost in 7 out of 10 tasks while maintaining competitive returns.<n>On Safety-Gymnasium, SL-SAC achieves cost reductions of 19-63% in velocity tasks compared to state-of-the-art baselines.
Score: 10.683491090059867
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Balancing reward and safety in constrained reinforcement learning remains challenging due to poor generalization from sharp value minima and inadequate handling of heavy-tailed risk distribution. We introduce Safe Langevin Soft Actor-Critic (SL-SAC), a principled algorithm that addresses both issues through parameter-space exploration and distributional risk control. Our approach combines three key mechanisms: (1) Adaptive Stochastic Gradient Langevin Dynamics (aSGLD) for reward critics, promoting ensemble diversity and escape from poor optima; (2) distributional cost estimation via Implicit Quantile Networks (IQN) with Conditional Value-at-Risk (CVaR) optimization for tail-risk mitigation; and (3) a reactive Lagrangian relaxation scheme that adapts constraint enforcement based on the empirical CVaR of episodic costs. We provide theoretical guarantees on CVaR estimation error and demonstrate that CVaR-based Lagrange updates yield stronger constraint violation signals than expected-cost updates. On Safety-Gymnasium benchmarks, SL-SAC achieves the lowest cost in 7 out of 10 tasks while maintaining competitive returns, with cost reductions of 19-63% in velocity tasks compared to state-of-the-art baselines.

Related papers

Unifying Stable Optimization and Reference Regularization in RLHF [64.16830602324345]
This paper introduces a unified regularization approach that balances objectives of preventing reward hacking and maintaining stable policy updates.<n>Our simple yet principled alignment objective yields a weighted supervised fine-tuning loss with a superior trade-off, which demonstrably improves both alignment results and implementation complexity.
arXiv Detail & Related papers (2026-02-12T03:31:19Z)
Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation [56.92367609590823]
Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs)<n>We argue that Long CoT is inherently ill-suited for the sequential recommendation domain.<n>We propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation.
arXiv Detail & Related papers (2026-01-31T10:02:43Z)
Online Risk-Averse Planning in POMDPs Using Iterated CVaR Value Function [9.269394037577177]
We study risk-sensitive planning under partial observability using the dynamic risk measure Iterated Conditional Value-at-Risk (ICVaR)<n>A policy evaluation algorithm for ICVaR is developed with finite-time performance guarantees that do not depend on the cardinality of the action space.<n>Experiments on benchmark POMDP domains demonstrate that the proposed ICVaR planners achieve lower tail risk compared to their risk-neutral counterparts.
arXiv Detail & Related papers (2026-01-28T12:48:20Z)
Extreme Value Policy Optimization for Safe Reinforcement Learning [38.341398602157575]
Constrained Reinforcement Learning (CRL) addresses this by maximizing returns under predefined constraints.<n>However, expectation-based constraints overlook rare but high-impact extreme value events in the tail distribution.<n>We propose the Extreme Value policy Optimization (EVO) algorithm, leveraging Extreme Value Theory (EVT) to model and exploit extreme reward and cost samples.
arXiv Detail & Related papers (2026-01-17T11:12:24Z)
Boundary-to-Region Supervision for Offline Safe Reinforcement Learning [56.150983204962735]
Boundary-to-Region (B2R) is a framework that enables asymmetric conditioning through cost signal realignment.<n>B2R redefines CTG as a boundary constraint under a fixed safety budget, unifying the cost distribution of all feasible trajectories.<n> Experimental results show that B2R satisfies safety constraints in 35 out of 38 safety-critical tasks.
arXiv Detail & Related papers (2025-09-30T03:38:20Z)
Safety-Aware Reinforcement Learning for Control via Risk-Sensitive Action-Value Iteration and Quantile Regression [2.592761128203891]
Quantile-based action-value iteration methods reduce this bias by learning a distribution of the expected cost-to-go.<n>Existing methods often require complex neural architectures or manual tradeoffs due to combined cost functions.<n>We propose a risk-regularized quantile-based algorithm integrating Conditional Value-at-Risk to enforce safety without complex architectures.
arXiv Detail & Related papers (2025-06-08T00:22:00Z)
Adaptive Insurance Reserving with CVaR-Constrained Reinforcement Learning under Macroeconomic Regimes [0.0]
This paper proposes a reinforcement learning (RL) framework for insurance reserving that integrates tail-risk sensitivity, macroeconomic regime modeling, and regulatory compliance.<n>The framework also accommodates fixed-shock stress testing and regime-stratified analysis, providing a principled and principled approach to reserving under uncertainty.
arXiv Detail & Related papers (2025-04-13T01:43:25Z)
Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values. We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z)
Distributional Soft Actor-Critic with Three Refinements [47.46661939652862]
Reinforcement learning (RL) has shown remarkable success in solving complex decision-making and control tasks.<n>Many model-free RL algorithms experience performance degradation due to inaccurate value estimation.<n>This paper introduces three key refinements to DSACv1 to overcome these limitations and further improve Q-value estimation accuracy.
arXiv Detail & Related papers (2023-10-09T16:52:48Z)
Provably Efficient Iterated CVaR Reinforcement Learning with Function Approximation and Human Feedback [57.6775169085215]
Risk-sensitive reinforcement learning aims to optimize policies that balance the expected reward and risk. We present a novel framework that employs an Iterated Conditional Value-at-Risk (CVaR) objective under both linear and general function approximations. We propose provably sample-efficient algorithms for this Iterated CVaR RL and provide rigorous theoretical analysis.
arXiv Detail & Related papers (2023-07-06T08:14:54Z)
Handling Long and Richly Constrained Tasks through Constrained Hierarchical Reinforcement Learning [20.280636126917614]
Safety in goal directed Reinforcement Learning (RL) settings has typically been handled through constraints over trajectories. We propose a (safety) Constrained Search with Hierarchical Reinforcement Learning (CoSHRL) mechanism that combines an upper level constrained search agent with a low-level goal conditioned RL agent. A major advantage of CoSHRL is that it can handle constraints on the cost value distribution and can adjust to flexible constraint thresholds without retraining.
arXiv Detail & Related papers (2023-02-21T12:57:12Z)
Penalized Proximal Policy Optimization for Safe Reinforcement Learning [68.86485583981866]
We propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem. P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective. We show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
arXiv Detail & Related papers (2022-05-24T06:15:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.