Exponential Bellman Equation and Improved Regret Bounds for
Risk-Sensitive Reinforcement Learning
- URL: http://arxiv.org/abs/2111.03947v1
- Date: Sat, 6 Nov 2021 19:35:18 GMT
- Title: Exponential Bellman Equation and Improved Regret Bounds for
Risk-Sensitive Reinforcement Learning
- Authors: Yingjie Fei, Zhuoran Yang, Yudong Chen, Zhaoran Wang
- Abstract summary: We study risk-sensitive reinforcement learning (RL) based on the entropic risk measure.
We identify the deficiencies in existing algorithms and their analysis that result in such a gap.
We show that these analytic and algorithmic innovations together lead to improved regret upper bounds over existing ones.
- Score: 106.20712175398275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study risk-sensitive reinforcement learning (RL) based on the entropic
risk measure. Although existing works have established non-asymptotic regret
guarantees for this problem, they leave open an exponential gap between the
upper and lower bounds. We identify the deficiencies in existing algorithms and
their analysis that result in such a gap. To remedy these deficiencies, we
investigate a simple transformation of the risk-sensitive Bellman equations,
which we call the exponential Bellman equation. The exponential Bellman
equation inspires us to develop a novel analysis of Bellman backup procedures
in risk-sensitive RL algorithms, and further motivates the design of a novel
exploration mechanism. We show that these analytic and algorithmic innovations
together lead to improved regret upper bounds over existing ones.
Related papers
- Continuous-time Risk-sensitive Reinforcement Learning via Quadratic Variation Penalty [5.710971447109951]
This paper studies continuous-time risk-sensitive reinforcement learning (RL)
I highlight that the conventional policy gradient representation is inadequate for risk-sensitive problems due to the nonlinear nature of quadratic variation.
I prove the convergence of the proposed algorithm for Merton's investment problem and quantify the impact of temperature parameter on the behavior of the learning procedure.
arXiv Detail & Related papers (2024-04-19T03:05:41Z) - Provably Efficient Partially Observable Risk-Sensitive Reinforcement
Learning with Hindsight Observation [35.278669159850146]
We introduce a novel formulation that integrates hindsight observations into a Partially Observable Decision Process (POMDP) framework.
We develop the first provably efficient RL algorithm tailored for this setting.
These techniques are of particular interest to the theoretical study of reinforcement learning.
arXiv Detail & Related papers (2024-02-28T08:24:06Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Model-Based Uncertainty in Value Functions [89.31922008981735]
We focus on characterizing the variance over values induced by a distribution over MDPs.
Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation.
We propose a new uncertainty Bellman equation whose solution converges to the true posterior variance over values.
arXiv Detail & Related papers (2023-02-24T09:18:27Z) - Surveillance Evasion Through Bayesian Reinforcement Learning [78.79938727251594]
We consider a 2D continuous path planning problem with a completely unknown intensity of random termination.
Those Observers' surveillance intensity is a priori unknown and has to be learned through repetitive path planning.
arXiv Detail & Related papers (2021-09-30T02:29:21Z) - A Unifying Theory of Thompson Sampling for Continuous Risk-Averse
Bandits [91.3755431537592]
This paper unifies the analysis of risk-averse Thompson sampling algorithms for the multi-armed bandit problem.
Using the contraction principle in the theory of large deviations, we prove novel concentration bounds for continuous risk functionals.
We show that a wide class of risk functionals as well as "nice" functions of them satisfy the continuity condition.
arXiv Detail & Related papers (2021-08-25T17:09:01Z) - Risk Bounds and Rademacher Complexity in Batch Reinforcement Learning [36.015585972493575]
This paper considers batch Reinforcement Learning (RL) with general value function approximation.
The excess risk of Empirical Risk Minimizer (ERM) is bounded by the Rademacher complexity of the function class.
Fast statistical rates can be achieved by using tools of local Rademacher complexity.
arXiv Detail & Related papers (2021-03-25T14:45:29Z) - A Full Characterization of Excess Risk via Empirical Risk Landscape [8.797852602680445]
In this paper, we provide a unified analysis of the risk of the model trained by a proper algorithm with both smooth convex and non- loss functions.
arXiv Detail & Related papers (2020-12-04T08:24:50Z) - Bounded Risk-Sensitive Markov Games: Forward Policy Design and Inverse
Reward Learning with Iterative Reasoning and Cumulative Prospect Theory [33.57592649823294]
We investigate the problem of bounded risk-sensitive Markov Game (BRSMG) and its inverse reward learning problem.
We show that humans have bounded intelligence and maximize risk-sensitive utilities in BRSMGs.
The results show that the behaviors of agents demonstrate both risk-averse and risk-seeking characteristics.
arXiv Detail & Related papers (2020-09-03T07:32:32Z) - Reparameterized Variational Divergence Minimization for Stable Imitation [57.06909373038396]
We study the extent to which variations in the choice of probabilistic divergence may yield more performant ILO algorithms.
We contribute a re parameterization trick for adversarial imitation learning to alleviate the challenges of the promising $f$-divergence minimization framework.
Empirically, we demonstrate that our design choices allow for ILO algorithms that outperform baseline approaches and more closely match expert performance in low-dimensional continuous-control tasks.
arXiv Detail & Related papers (2020-06-18T19:04:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.