Cautious Reinforcement Learning via Distributional Risk in the Dual
Domain
- URL: http://arxiv.org/abs/2002.12475v1
- Date: Thu, 27 Feb 2020 23:18:04 GMT
- Title: Cautious Reinforcement Learning via Distributional Risk in the Dual
Domain
- Authors: Junyu Zhang, Amrit Singh Bedi, Mengdi Wang, Alec Koppel
- Abstract summary: We study the estimation of risk-sensitive policies in reinforcement learning problems defined by a Markov Decision Process (MDPs) whose state and action spaces are countably finite.
We propose a new definition of risk, which we call caution, as a penalty function added to the dual objective of the linear programming (LP) formulation of reinforcement learning.
- Score: 45.17200683056563
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the estimation of risk-sensitive policies in reinforcement learning
problems defined by a Markov Decision Process (MDPs) whose state and action
spaces are countably finite. Prior efforts are predominately afflicted by
computational challenges associated with the fact that risk-sensitive MDPs are
time-inconsistent. To ameliorate this issue, we propose a new definition of
risk, which we call caution, as a penalty function added to the dual objective
of the linear programming (LP) formulation of reinforcement learning. The
caution measures the distributional risk of a policy, which is a function of
the policy's long-term state occupancy distribution. To solve this problem in
an online model-free manner, we propose a stochastic variant of primal-dual
method that uses Kullback-Lieber (KL) divergence as its proximal term. We
establish that the number of iterations/samples required to attain
approximately optimal solutions of this scheme matches tight dependencies on
the cardinality of the state and action spaces, but differs in its dependence
on the infinity norm of the gradient of the risk measure. Experiments
demonstrate the merits of this approach for improving the reliability of reward
accumulation without additional computational burdens.
Related papers
- Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Soft Robust MDPs and Risk-Sensitive MDPs: Equivalence, Policy Gradient, and Sample Complexity [7.57543767554282]
This paper introduces a new formulation for risk-sensitive MDPs, which assesses risk in a slightly different manner compared to the classical Markov risk measure.
We derive the policy gradient theorem for both problems, proving gradient domination and global convergence of the exact policy gradient method.
We also propose a sample-based offline learning algorithm, namely the robust fitted-Z iteration (RFZI)
arXiv Detail & Related papers (2023-06-20T15:51:25Z) - Distributional Method for Risk Averse Reinforcement Learning [0.0]
We introduce a distributional method for learning the optimal policy in risk averse Markov decision process.
We assume sequential observations of states, actions, and costs and assess the performance of a policy using dynamic risk measures.
arXiv Detail & Related papers (2023-02-27T19:48:42Z) - Conservative Distributional Reinforcement Learning with Safety
Constraints [22.49025480735792]
Safety exploration can be regarded as a constrained Markov decision problem where the expected long-term cost is constrained.
Previous off-policy algorithms convert the constrained optimization problem into the corresponding unconstrained dual problem by introducing the Lagrangian relaxation technique.
We present a novel off-policy reinforcement learning algorithm called Conservative Distributional Maximum a Posteriori Policy Optimization.
arXiv Detail & Related papers (2022-01-18T19:45:43Z) - An Offline Risk-aware Policy Selection Method for Bayesian Markov
Decision Processes [0.0]
Exploitation vs Caution (EvC) is a paradigm that elegantly incorporates model uncertainty abiding by the Bayesian formalism.
We validate EvC with state-of-the-art approaches in different discrete, yet simple, environments offering a fair variety of MDP classes.
In the tested scenarios EvC manages to select robust policies and hence stands out as a useful tool for practitioners.
arXiv Detail & Related papers (2021-05-27T20:12:20Z) - Off-Policy Evaluation of Slate Policies under Bayes Risk [70.10677881866047]
We study the problem of off-policy evaluation for slate bandits, for the typical case in which the logging policy factorizes over the slots of the slate.
We show that the risk improvement over PI grows linearly with the number of slots, and linearly with the gap between the arithmetic and the harmonic mean of a set of slot-level divergences.
arXiv Detail & Related papers (2021-01-05T20:07:56Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.