Related papers: Policy Newton methods for Distortion Riskmetrics

Policy Newton methods for Distortion Riskmetrics

URL: http://arxiv.org/abs/2508.07249v1
Date: Sun, 10 Aug 2025 09:03:32 GMT
Title: Policy Newton methods for Distortion Riskmetrics
Authors: Soumen Pachal, Mizhaan Prajit Maniyar, Prashanth L. A,
Abstract summary: We find a risk-optimal policy by maximizing the distortion riskmetric (DRM) of the discounted reward in a finite horizon Markov decision process (MDP)<n>We propose a natural DRM Hessian estimator from sample trajectories of the underlying MDP.<n>Our proposed algorithm is shown to converge to an $epsilon$-second-order stationary point ($epsilon$-SOSP) of the DRM objective.
Score: 7.8105721078323835
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider the problem of risk-sensitive control in a reinforcement learning (RL) framework. In particular, we aim to find a risk-optimal policy by maximizing the distortion riskmetric (DRM) of the discounted reward in a finite horizon Markov decision process (MDP). DRMs are a rich class of risk measures that include several well-known risk measures as special cases. We derive a policy Hessian theorem for the DRM objective using the likelihood ratio method. Using this result, we propose a natural DRM Hessian estimator from sample trajectories of the underlying MDP. Next, we present a cubic-regularized policy Newton algorithm for solving this problem in an on-policy RL setting using estimates of the DRM gradient and Hessian. Our proposed algorithm is shown to converge to an $\epsilon$-second-order stationary point ($\epsilon$-SOSP) of the DRM objective, and this guarantee ensures the escaping of saddle points. The sample complexity of our algorithms to find an $ \epsilon$-SOSP is $\mathcal{O}(\epsilon^{-3.5})$. Our experiments validate the theoretical findings. To the best of our knowledge, our is the first work to present convergence to an $\epsilon$-SOSP of a risk-sensitive objective, while existing works in the literature have either shown convergence to a first-order stationary point of a risk-sensitive objective, or a SOSP of a risk-neutral one.

Related papers

Risk-sensitive reinforcement learning using expectiles, shortfall risk and optimized certainty equivalent risk [7.358503757109041]
We propose risk-sensitive reinforcement learning algorithms catering to three families of risk measures.<n>For each risk measure, in the context of a finite horizon Markov decision process, we first derive a policy gradient theorem.<n>We conduct numerical experiments to validate the theoretical findings on popular RL benchmarks.
arXiv Detail & Related papers (2026-02-10T00:38:21Z)
Empirical Risk Minimization with $f$-Divergence Regularization [48.54320235705813]
This paper presents the solution to the empirical risk minimization problem with $f$-divergence regularization (ERM-$f$DR)<n>The proposed approach extends applicability to a broader class of $f$-divergences than previously reported and yields theoretical results that recover previously known results.
arXiv Detail & Related papers (2026-01-19T16:13:58Z)
Bayesian Risk-Sensitive Policy Optimization For MDPs With General Loss Functions [8.16996766356341]
We consider Markov decision processes (MDPs) with a general loss function and unknown parameters.<n>We take a Bayesian approach to estimate the parameters from data and impose a coherent risk functional on the loss.<n>We propose a policy gradient optimization method, leveraging the dual representation of coherent risk measures.
arXiv Detail & Related papers (2025-09-19T01:16:59Z)
A Reductions Approach to Risk-Sensitive Reinforcement Learning with Optimized Certainty Equivalents [44.09686403685058]
We study risk-sensitive RL where the goal is learn a history-dependent policy that optimize some risk measure of cumulative rewards.<n>We propose two meta-algorithms: one grounded in optimism and another based on policy gradients.<n>We empirically show that our algorithms learn the optimal history-dependent policy in a proof-of-concept MDP.
arXiv Detail & Related papers (2024-03-10T21:45:12Z)
Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values. We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z)
Provably Efficient CVaR RL in Low-rank MDPs [58.58570425202862]
We study risk-sensitive Reinforcement Learning (RL) We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to balance interplay between exploration, exploitation, and representation learning in CVaR RL. We prove that our algorithm achieves a sample complexity of $epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations.
arXiv Detail & Related papers (2023-11-20T17:44:40Z)
A policy gradient approach for optimization of smooth risk measures [8.087699764574788]
We consider episodic Markov decision processes, and model the risk using the broad class of smooth risk measures of the cumulative discounted reward. We propose two template policy gradient algorithms that optimize a smooth risk measure in on-policy and off-policy RL settings.
arXiv Detail & Related papers (2022-02-22T17:26:28Z)
Policy Gradient Methods for Distortion Risk Measures [9.554545881355377]
We propose policy gradient algorithms which learn risk-sensitive policies in a reinforcement learning framework. We derive a variant of the policy gradient theorem that caters to the DRM objective, and integrate it with a likelihood ratio-based gradient estimation scheme.
arXiv Detail & Related papers (2021-07-09T13:14:12Z)
Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria. We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff in Regret [115.85354306623368]
We study risk-sensitive reinforcement learning in episodic Markov decision processes with unknown transition kernels. We propose two provably efficient model-free algorithms, Risk-Sensitive Value Iteration (RSVI) and Risk-Sensitive Q-learning (RSQ) We prove that RSVI attains an $tildeObig(lambda(|beta| H2) cdot sqrtH3 S2AT big)$ regret, while RSQ attains an $tildeObig(lambda
arXiv Detail & Related papers (2020-06-22T19:28:26Z)
Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation. We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
Cautious Reinforcement Learning via Distributional Risk in the Dual Domain [45.17200683056563]
We study the estimation of risk-sensitive policies in reinforcement learning problems defined by a Markov Decision Process (MDPs) whose state and action spaces are countably finite. We propose a new definition of risk, which we call caution, as a penalty function added to the dual objective of the linear programming (LP) formulation of reinforcement learning.
arXiv Detail & Related papers (2020-02-27T23:18:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.