Related papers: Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization

Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization

URL: http://arxiv.org/abs/2312.04386v2
Date: Wed, 13 Dec 2023 08:57:29 GMT
Title: Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization
Authors: Carlos E. Luis, Alessandro G. Bottero, Julia Vinogradska, Felix Berkenkamp, Jan Peters
Abstract summary: We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. In particular, we focus on characterizing the variance over values induced by a distribution over MDPs. We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
Score: 63.32053223422317
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. In particular, we focus on characterizing the variance over values induced by a distribution over MDPs. Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation (UBE), but the over-approximation may result in inefficient exploration. We propose a new UBE whose solution converges to the true posterior variance over values and leads to lower regret in tabular exploration problems. We identify challenges to apply the UBE theory beyond tabular problems and propose a suitable approximation. Based on this approximation, we introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC), that can be applied for either risk-seeking or risk-averse policy optimization with minimal changes. Experiments in both online and offline RL demonstrate improved performance compared to other uncertainty estimation methods.

Related papers

Risk-sensitive Reinforcement Learning Based on Convex Scoring Functions [8.758206783988404]
We propose a reinforcement learning framework under a broad class of risk objectives, characterized by convex scoring functions.<n>This class covers many common risk measures, such as variance, Expected Shortfall, entropic Value-at-Risk, and mean-risk utility.<n>We validate our approach in simulation experiments with a financial application in statistical arbitrage trading, demonstrating the effectiveness of the algorithm.
arXiv Detail & Related papers (2025-05-07T16:31:42Z)
Decision from Suboptimal Classifiers: Excess Risk Pre- and Post-Calibration [52.70324949884702]
We quantify the excess risk incurred using approximate posterior probabilities in batch binary decision-making. We identify regimes where recalibration alone addresses most of the regret, and regimes where the regret is dominated by the grouping loss. On NLP experiments, we show that these quantities identify when the expected gain of more advanced post-training is worth the operational cost.
arXiv Detail & Related papers (2025-03-23T10:52:36Z)
Pitfall of Optimism: Distributional Reinforcement Learning by Randomizing Risk Criterion [9.35556128467037]
We present a novel distributional reinforcement learning algorithm that selects actions by randomizing risk criterion to avoid one-sided tendency on risk. Our theoretical results support that the proposed method does not fall into biased exploration and is guaranteed to converge to an optimal return.
arXiv Detail & Related papers (2023-10-25T10:53:04Z)
Model-Based Uncertainty in Value Functions [89.31922008981735]
We focus on characterizing the variance over values induced by a distribution over MDPs. Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation. We propose a new uncertainty Bellman equation whose solution converges to the true posterior variance over values.
arXiv Detail & Related papers (2023-02-24T09:18:27Z)
Regret Bounds for Markov Decision Processes with Recursive Optimized Certainty Equivalents [3.8980564330208662]
We propose a new episodic risk-sensitive reinforcement learning formulation. We design an efficient learning algorithm for this problem based on value iteration and upper confidence bound. Our bounds show that the regret rate achieved by our proposed algorithm has optimal dependence on the number of episodes and the number of actions.
arXiv Detail & Related papers (2023-01-30T01:22:31Z)
Learning to Optimize with Stochastic Dominance Constraints [103.26714928625582]
In this paper, we develop a simple yet efficient approach for the problem of comparing uncertain quantities. We recast inner optimization in the Lagrangian as a learning problem for surrogate approximation, which bypasses apparent intractability. The proposed light-SD demonstrates superior performance on several representative problems ranging from finance to supply chain management.
arXiv Detail & Related papers (2022-11-14T21:54:31Z)
Mean-Semivariance Policy Optimization via Risk-Averse Reinforcement Learning [12.022303947412917]
This paper aims at optimizing the mean-semivariance criterion in reinforcement learning w.r.t. steady rewards. We reveal that the MSV problem can be solved by iteratively solving a sequence of RL problems with a policy-dependent reward function. We propose two on-policy algorithms based on the policy gradient theory and the trust region method.
arXiv Detail & Related papers (2022-06-15T08:32:53Z)
Optimal variance-reduced stochastic approximation in Banach spaces [114.8734960258221]
We study the problem of estimating the fixed point of a contractive operator defined on a separable Banach space. We establish non-asymptotic bounds for both the operator defect and the estimation error.
arXiv Detail & Related papers (2022-01-21T02:46:57Z)
Risk-Averse Bayes-Adaptive Reinforcement Learning [3.5289688061934963]
We pose the problem of optimising the conditional value at risk (CVaR) of the total return in Bayes-adaptive Markov decision processes (MDPs) We show that a policy optimising CVaR in this setting is risk-averse to both the parametric uncertainty due to the prior distribution over MDPs, and the internal uncertainty due to the inherentity of MDPs. Our experiments demonstrate that our approach significantly outperforms baseline approaches for this problem.
arXiv Detail & Related papers (2021-02-10T22:34:33Z)
Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria. We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
Thompson Sampling Algorithms for Mean-Variance Bandits [97.43678751629189]
We develop Thompson Sampling-style algorithms for mean-variance MAB. We also provide comprehensive regret analyses for Gaussian and Bernoulli bandits. Our algorithms significantly outperform existing LCB-based algorithms for all risk tolerances.
arXiv Detail & Related papers (2020-02-01T15:33:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.