Reward Redistribution for CVaR MDPs using a Bellman Operator on L-infinity
- URL: http://arxiv.org/abs/2602.03778v1
- Date: Tue, 03 Feb 2026 17:39:45 GMT
- Title: Reward Redistribution for CVaR MDPs using a Bellman Operator on L-infinity
- Authors: Aneri Muni, Vincent Taboga, Esther Derman, Pierre-Luc Bacon, Erick Delage,
- Abstract summary: Tail-end risk measures such as static conditional value-at-risk (CVaR) are used in safety-critical applications to prevent rare, yet catastrophic events.<n>We develop risk-averse value and model-free Q-learning algorithms that rely on discretized augmented states.<n> Empirical results demonstrate that our algorithms successfully learn CVaR-sensitive policies and achieve effective performance-safety trade-offs.
- Score: 16.835098688159004
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tail-end risk measures such as static conditional value-at-risk (CVaR) are used in safety-critical applications to prevent rare, yet catastrophic events. Unlike risk-neutral objectives, the static CVaR of the return depends on entire trajectories without admitting a recursive Bellman decomposition in the underlying Markov decision process. A classical resolution relies on state augmentation with a continuous variable. However, unless restricted to a specialized class of admissible value functions, this formulation induces sparse rewards and degenerate fixed points. In this work, we propose a novel formulation of the static CVaR objective based on augmentation. Our alternative approach leads to a Bellman operator with: (1) dense per-step rewards; (2) contracting properties on the full space of bounded value functions. Building on this theoretical foundation, we develop risk-averse value iteration and model-free Q-learning algorithms that rely on discretized augmented states. We further provide convergence guarantees and approximation error bounds due to discretization. Empirical results demonstrate that our algorithms successfully learn CVaR-sensitive policies and achieve effective performance-safety trade-offs.
Related papers
- Robust Bayesian Dynamic Programming for On-policy Risk-sensitive Reinforcement Learning [4.71677151409532]
We propose a novel framework for risk-sensitive reinforcement learning that incorporates robustness against transition uncertainty.<n>Our framework unifies and generalizes most existing RL frameworks by permitting general coherent risk measures for both inner and outer risk measures.
arXiv Detail & Related papers (2025-12-31T03:13:22Z) - Safety-Aware Reinforcement Learning for Control via Risk-Sensitive Action-Value Iteration and Quantile Regression [2.592761128203891]
Quantile-based action-value iteration methods reduce this bias by learning a distribution of the expected cost-to-go.<n>Existing methods often require complex neural architectures or manual tradeoffs due to combined cost functions.<n>We propose a risk-regularized quantile-based algorithm integrating Conditional Value-at-Risk to enforce safety without complex architectures.
arXiv Detail & Related papers (2025-06-08T00:22:00Z) - Ensuring Safety in an Uncertain Environment: Constrained MDPs via Stochastic Thresholds [28.4976864705409]
This paper studies constrained Markov decision processes (CMDPs) with constraints against thresholds, aiming at the safety of reinforcement learning in unknown and uncertain environments.<n>We leverage a GrowingWindow estimator sampling from interactions with the uncertain and dynamic environment to estimate the thresholds, based on which we design Pessimistic-Optimistic Thresholding (SPOT)<n>SPOT enables reinforcement learning under both pessimistic and optimistic threshold settings.
arXiv Detail & Related papers (2025-04-07T11:58:19Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization [39.740287682191884]
In robust Markov decision processes (RMDPs) it is assumed that the reward and the transition dynamics lie in a given uncertainty set.
This so-called rectangularity condition is solely motivated by computational concerns.
We introduce a policy-gradient method and prove its convergence.
arXiv Detail & Related papers (2023-09-03T07:34:26Z) - Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level
Stability and High-Level Behavior [51.60683890503293]
We propose a theoretical framework for studying behavior cloning of complex expert demonstrations using generative modeling.
We show that pure supervised cloning can generate trajectories matching the per-time step distribution of arbitrary expert trajectories.
arXiv Detail & Related papers (2023-07-27T04:27:26Z) - On Dynamic Programming Decompositions of Static Risk Measures in Markov Decision Processes [30.95065329164904]
We show that popular decompositions for Conditional-Value-at-Risk (CVaR) and Entropic-Value-at-Risk (EVaR) are inherently suboptimal regardless of the discretization level.
Our findings are significant because risk-averse algorithms are used in high-stake environments, making their correctness much more critical.
arXiv Detail & Related papers (2023-04-24T22:28:20Z) - Model-Based Uncertainty in Value Functions [89.31922008981735]
We focus on characterizing the variance over values induced by a distribution over MDPs.
Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation.
We propose a new uncertainty Bellman equation whose solution converges to the true posterior variance over values.
arXiv Detail & Related papers (2023-02-24T09:18:27Z) - Learning Dynamical Systems via Koopman Operator Regression in
Reproducing Kernel Hilbert Spaces [52.35063796758121]
We formalize a framework to learn the Koopman operator from finite data trajectories of the dynamical system.
We link the risk with the estimation of the spectral decomposition of the Koopman operator.
Our results suggest RRR might be beneficial over other widely used estimators.
arXiv Detail & Related papers (2022-05-27T14:57:48Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.