Policy Gradient Bayesian Robust Optimization for Imitation Learning
- URL: http://arxiv.org/abs/2106.06499v1
- Date: Fri, 11 Jun 2021 16:49:15 GMT
- Title: Policy Gradient Bayesian Robust Optimization for Imitation Learning
- Authors: Zaynah Javed, Daniel S. Brown, Satvik Sharma, Jerry Zhu, Ashwin
Balakrishna, Marek Petrik, Anca D. Dragan, Ken Goldberg
- Abstract summary: We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
- Score: 49.881386773269746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The difficulty in specifying rewards for many real-world problems has led to
an increased focus on learning rewards from human feedback, such as
demonstrations. However, there are often many different reward functions that
explain the human feedback, leaving agents with uncertainty over what the true
reward function is. While most policy optimization approaches handle this
uncertainty by optimizing for expected performance, many applications demand
risk-averse behavior. We derive a novel policy gradient-style robust
optimization approach, PG-BROIL, that optimizes a soft-robust objective that
balances expected performance and risk. To the best of our knowledge, PG-BROIL
is the first policy optimization algorithm robust to a distribution of reward
hypotheses which can scale to continuous MDPs. Results suggest that PG-BROIL
can produce a family of behaviors ranging from risk-neutral to risk-averse and
outperforms state-of-the-art imitation learning algorithms when learning from
ambiguous demonstrations by hedging against uncertainty, rather than seeking to
uniquely identify the demonstrator's reward function.
Related papers
- Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation [46.61909578101735]
Adversarial Policy Optimization (AdvPO) is a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback.
In this paper, we introduce a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model.
arXiv Detail & Related papers (2024-03-08T09:20:12Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Efficient Action Robust Reinforcement Learning with Probabilistic Policy
Execution Uncertainty [43.55450683502937]
In this paper, we focus on action robust RL with the probabilistic policy execution uncertainty.
We establish the existence of an optimal policy on the action robust MDPs with probabilistic policy execution uncertainty.
We also develop Action Robust Reinforcement Learning with Certificates (ARRLC) algorithm that minimax optimal regret and sample complexity.
arXiv Detail & Related papers (2023-07-15T00:26:51Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - A Risk-Sensitive Approach to Policy Optimization [21.684251937825234]
Standard deep reinforcement learning (DRL) aims to maximize expected reward, considering collected experiences equally in formulating a policy.
We propose a more direct approach whereby risk-sensitive objectives, specified in terms of the cumulative distribution function (CDF) of the distribution of full-episode rewards, are optimized.
We demonstrate that the use of moderately "pessimistic" risk profiles, which emphasize scenarios where the agent performs poorly, leads to enhanced exploration and a continual focus on addressing deficiencies.
arXiv Detail & Related papers (2022-08-19T00:55:05Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Bounded Risk-Sensitive Markov Games: Forward Policy Design and Inverse
Reward Learning with Iterative Reasoning and Cumulative Prospect Theory [33.57592649823294]
We investigate the problem of bounded risk-sensitive Markov Game (BRSMG) and its inverse reward learning problem.
We show that humans have bounded intelligence and maximize risk-sensitive utilities in BRSMGs.
The results show that the behaviors of agents demonstrate both risk-averse and risk-seeking characteristics.
arXiv Detail & Related papers (2020-09-03T07:32:32Z) - Bayesian Robust Optimization for Imitation Learning [34.40385583372232]
Inverse reinforcement learning can enable generalization to new states by learning a parameterized reward function.
Existing safe imitation learning approaches based on IRL deal with this uncertainty using a maxmin framework.
BROIL provides a natural way to interpolate between return-maximizing and risk-minimizing behaviors.
arXiv Detail & Related papers (2020-07-24T01:52:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.