A Risk-Sensitive Approach to Policy Optimization
- URL: http://arxiv.org/abs/2208.09106v2
- Date: Thu, 16 Nov 2023 03:51:30 GMT
- Title: A Risk-Sensitive Approach to Policy Optimization
- Authors: Jared Markowitz, Ryan W. Gardner, Ashley Llorens, Raman Arora, I-Jeng
Wang
- Abstract summary: Standard deep reinforcement learning (DRL) aims to maximize expected reward, considering collected experiences equally in formulating a policy.
We propose a more direct approach whereby risk-sensitive objectives, specified in terms of the cumulative distribution function (CDF) of the distribution of full-episode rewards, are optimized.
We demonstrate that the use of moderately "pessimistic" risk profiles, which emphasize scenarios where the agent performs poorly, leads to enhanced exploration and a continual focus on addressing deficiencies.
- Score: 21.684251937825234
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Standard deep reinforcement learning (DRL) aims to maximize expected reward,
considering collected experiences equally in formulating a policy. This differs
from human decision-making, where gains and losses are valued differently and
outlying outcomes are given increased consideration. It also fails to
capitalize on opportunities to improve safety and/or performance through the
incorporation of distributional context. Several approaches to distributional
DRL have been investigated, with one popular strategy being to evaluate the
projected distribution of returns for possible actions. We propose a more
direct approach whereby risk-sensitive objectives, specified in terms of the
cumulative distribution function (CDF) of the distribution of full-episode
rewards, are optimized. This approach allows for outcomes to be weighed based
on relative quality, can be used for both continuous and discrete action
spaces, and may naturally be applied in both constrained and unconstrained
settings. We show how to compute an asymptotically consistent estimate of the
policy gradient for a broad class of risk-sensitive objectives via sampling,
subsequently incorporating variance reduction and regularization measures to
facilitate effective on-policy learning. We then demonstrate that the use of
moderately "pessimistic" risk profiles, which emphasize scenarios where the
agent performs poorly, leads to enhanced exploration and a continual focus on
addressing deficiencies. We test the approach using different risk profiles in
six OpenAI Safety Gym environments, comparing to state of the art on-policy
methods. Without cost constraints, we find that pessimistic risk profiles can
be used to reduce cost while improving total reward accumulation. With cost
constraints, they are seen to provide higher positive rewards than risk-neutral
approaches at the prescribed allowable cost.
Related papers
- Data-driven decision-making under uncertainty with entropic risk measure [5.407319151576265]
The entropic risk measure is widely used in high-stakes decision making to account for tail risks associated with an uncertain loss.
To debias the empirical entropic risk estimator, we propose a strongly consistent bootstrapping procedure.
We show that cross validation methods can result in significantly higher out-of-sample risk for the insurer if the bias in validation performance is not corrected for.
arXiv Detail & Related papers (2024-09-30T04:02:52Z) - Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence [15.720824593964027]
Risk-sensitive reinforcement learning (RL) is crucial for maintaining reliable performance in high-stakes applications.
This paper introduces a policy gradient method for risk-sensitive DRL with general coherent risk measures.
We also design a categorical distributional policy gradient algorithm (CDPG) based on categorical distributional policy evaluation and trajectory gradient estimation.
arXiv Detail & Related papers (2024-05-23T16:16:58Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Improved Policy Evaluation for Randomized Trials of Algorithmic Resource
Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT.
We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z) - Efficient Risk-Averse Reinforcement Learning [79.61412643761034]
In risk-averse reinforcement learning (RL), the goal is to optimize some risk measure of the returns.
We prove that under certain conditions this inevitably leads to a local-optimum barrier, and propose a soft risk mechanism to bypass it.
We demonstrate improved risk aversion in maze navigation, autonomous driving, and resource allocation benchmarks.
arXiv Detail & Related papers (2022-05-10T19:40:52Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Bayesian Robust Optimization for Imitation Learning [34.40385583372232]
Inverse reinforcement learning can enable generalization to new states by learning a parameterized reward function.
Existing safe imitation learning approaches based on IRL deal with this uncertainty using a maxmin framework.
BROIL provides a natural way to interpolate between return-maximizing and risk-minimizing behaviors.
arXiv Detail & Related papers (2020-07-24T01:52:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.