Planning and Learning in Average Risk-aware MDPs
- URL: http://arxiv.org/abs/2503.17629v1
- Date: Sat, 22 Mar 2025 03:18:09 GMT
- Title: Planning and Learning in Average Risk-aware MDPs
- Authors: Weikai Wang, Erick Delage,
- Abstract summary: We extend risk-neutral algorithms to accommodate the more general class of dynamic risk measures.<n>Both the RVI and Q-learning algorithms are proven to converge to optimality.<n>Our approach enables the identification of policies that are finely tuned to the intricate risk-awareness of the agent that they serve.
- Score: 4.696083734269232
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: For continuing tasks, average cost Markov decision processes have well-documented value and can be solved using efficient algorithms. However, it explicitly assumes that the agent is risk-neutral. In this work, we extend risk-neutral algorithms to accommodate the more general class of dynamic risk measures. Specifically, we propose a relative value iteration (RVI) algorithm for planning and design two model-free Q-learning algorithms, namely a generic algorithm based on the multi-level Monte Carlo method, and an off-policy algorithm dedicated to utility-base shortfall risk measures. Both the RVI and MLMC-based Q-learning algorithms are proven to converge to optimality. Numerical experiments validate our analysis, confirms empirically the convergence of the off-policy algorithm, and demonstrate that our approach enables the identification of policies that are finely tuned to the intricate risk-awareness of the agent that they serve.
Related papers
- Efficient Risk-sensitive Planning via Entropic Risk Measures [51.42922439693624]
We show that only Entropic Risk Measures (EntRM) can be efficiently optimized through dynamic programming.<n>We prove that this optimality front can be computed effectively thanks to a novel structural analysis and smoothness properties of entropic risks.
arXiv Detail & Related papers (2025-02-27T09:56:51Z) - Burning RED: Unlocking Subtask-Driven Reinforcement Learning and Risk-Awareness in Average-Reward Markov Decision Processes [7.028778922533688]
Average-reward Markov decision processes (MDPs) provide a foundational framework for sequential decision-making under uncertainty.<n>We introduce Reward-Extended Differential (or RED) reinforcement learning: a novel RL framework that can be used to effectively and efficiently solve various learning objectives, or subtasks, simultaneously in the average-reward setting.
arXiv Detail & Related papers (2024-10-14T14:52:23Z) - A Reductions Approach to Risk-Sensitive Reinforcement Learning with Optimized Certainty Equivalents [44.09686403685058]
We study risk-sensitive RL where the goal is learn a history-dependent policy that optimize some risk measure of cumulative rewards.<n>We propose two meta-algorithms: one grounded in optimism and another based on policy gradients.<n>We empirically show that our algorithms learn the optimal history-dependent policy in a proof-of-concept MDP.
arXiv Detail & Related papers (2024-03-10T21:45:12Z) - Risk-sensitive Markov Decision Process and Learning under General Utility Functions [3.069335774032178]
Reinforcement Learning (RL) has gained substantial attention across diverse application domains and theoretical investigations.<n>We consider a scenario where the decision-maker seeks to optimize a general utility function of the cumulative reward in the framework of a decision process (MDP)<n>We propose a modified value iteration algorithm that employs an epsilon-covering over the space of cumulative reward.<n>In the absence of a simulator, our algorithm, designed with an upper-confidence-bound exploration approach, identifies a near-optimal policy.
arXiv Detail & Related papers (2023-11-22T18:50:06Z) - Multivariate Systemic Risk Measures and Computation by Deep Learning
Algorithms [63.03966552670014]
We discuss the key related theoretical aspects, with a particular focus on the fairness properties of primal optima and associated risk allocations.
The algorithms we provide allow for learning primals, optima for the dual representation and corresponding fair risk allocations.
arXiv Detail & Related papers (2023-02-02T22:16:49Z) - Conditionally Elicitable Dynamic Risk Measures for Deep Reinforcement
Learning [0.0]
We develop an efficient approach to estimate a class of dynamic spectral risk measures with deep neural networks.
We also develop a risk-sensitive actor-critic algorithm that uses full episodes and does not require any additional nested transitions.
arXiv Detail & Related papers (2022-06-29T14:11:15Z) - A policy gradient approach for optimization of smooth risk measures [8.087699764574788]
We consider episodic Markov decision processes, and model the risk using the broad class of smooth risk measures of the cumulative discounted reward.
We propose two template policy gradient algorithms that optimize a smooth risk measure in on-policy and off-policy RL settings.
arXiv Detail & Related papers (2022-02-22T17:26:28Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Run2Survive: A Decision-theoretic Approach to Algorithm Selection based
on Survival Analysis [75.64261155172856]
survival analysis (SA) naturally supports censored data and offers appropriate ways to use such data for learning distributional models of algorithm runtime.
We leverage such models as a basis of a sophisticated decision-theoretic approach to algorithm selection, which we dub Run2Survive.
In an extensive experimental study with the standard benchmark ASlib, our approach is shown to be highly competitive and in many cases even superior to state-of-the-art AS approaches.
arXiv Detail & Related papers (2020-07-06T15:20:17Z) - Thompson Sampling Algorithms for Mean-Variance Bandits [97.43678751629189]
We develop Thompson Sampling-style algorithms for mean-variance MAB.
We also provide comprehensive regret analyses for Gaussian and Bernoulli bandits.
Our algorithms significantly outperform existing LCB-based algorithms for all risk tolerances.
arXiv Detail & Related papers (2020-02-01T15:33:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.