Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy
- URL: http://arxiv.org/abs/2012.14098v2
- Date: Sun, 17 Sep 2023 01:47:17 GMT
- Title: Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy
- Authors: Han Zhong, Xun Deng, Ethan X. Fang, Zhuoran Yang, Zhaoran Wang, Runze
Li
- Abstract summary: We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
- Score: 95.98698822755227
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While deep reinforcement learning has achieved tremendous successes in
various applications, most existing works only focus on maximizing the expected
value of total return and thus ignore its inherent stochasticity. Such
stochasticity is also known as the aleatoric uncertainty and is closely related
to the notion of risk. In this work, we make the first attempt to study
risk-sensitive deep reinforcement learning under the average reward setting
with the variance risk criteria. In particular, we focus on a
variance-constrained policy optimization problem where the goal is to find a
policy that maximizes the expected value of the long-run average reward,
subject to a constraint that the long-run variance of the average reward is
upper bounded by a threshold. Utilizing Lagrangian and Fenchel dualities, we
transform the original problem into an unconstrained saddle-point policy
optimization problem, and propose an actor-critic algorithm that iteratively
and efficiently updates the policy, the Lagrange multiplier, and the Fenchel
dual variable. When both the value and policy functions are represented by
multi-layer overparameterized neural networks, we prove that our actor-critic
algorithm generates a sequence of policies that finds a globally optimal policy
at a sublinear rate. Further, We provide numerical studies of the proposed
method using two real datasets to back up the theoretical results.
Related papers
- Stationary Policies are Optimal in Risk-averse Total-reward MDPs with EVaR [12.719528972742394]
We show that the risk-averse em total reward criterion can be optimized by a stationary policy.
Our results indicate that the total reward criterion may be preferable to the discounted criterion in a broad range of risk-averse reinforcement learning domains.
arXiv Detail & Related papers (2024-08-30T13:33:18Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states.
The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z) - Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems.
The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy.
We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z) - Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback.
We consider the general reward setting where the reward can be defined over the whole trajectory.
We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z) - A Risk-Sensitive Approach to Policy Optimization [21.684251937825234]
Standard deep reinforcement learning (DRL) aims to maximize expected reward, considering collected experiences equally in formulating a policy.
We propose a more direct approach whereby risk-sensitive objectives, specified in terms of the cumulative distribution function (CDF) of the distribution of full-episode rewards, are optimized.
We demonstrate that the use of moderately "pessimistic" risk profiles, which emphasize scenarios where the agent performs poorly, leads to enhanced exploration and a continual focus on addressing deficiencies.
arXiv Detail & Related papers (2022-08-19T00:55:05Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - On the Convergence and Optimality of Policy Gradient for Markov Coherent
Risk [32.97618081988295]
We present a tight upper bound on the suboptimality of the learned policy, characterizing its dependence on the nonlinearity of the objective and the degree of risk aversion.
We propose a practical implementation of PG that uses state distribution reweighting to overcome previous limitations.
arXiv Detail & Related papers (2021-03-04T04:11:09Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - Variational Policy Gradient Method for Reinforcement Learning with
General Utilities [38.54243339632217]
In recent years, reinforcement learning systems with general goals beyond a cumulative sum of rewards have gained traction.
In this paper, we consider policy in Decision Problems, where the objective converges a general concave utility function.
We derive a new Variational Policy Gradient Theorem for RL with general utilities.
arXiv Detail & Related papers (2020-07-04T17:51:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.