Efficient Exploration via Epistemic-Risk-Seeking Policy Optimization
- URL: http://arxiv.org/abs/2302.09339v2
- Date: Sun, 4 Jun 2023 13:35:45 GMT
- Title: Efficient Exploration via Epistemic-Risk-Seeking Policy Optimization
- Authors: Brendan O'Donoghue
- Abstract summary: Exploration remains a key challenge in deep reinforcement learning (RL)
In this paper we propose a new, differentiable optimistic objective that when optimized yields a policy that provably explores efficiently.
Results show significant performance improvements even over other efficient exploration techniques.
- Score: 8.867416300893577
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Exploration remains a key challenge in deep reinforcement learning (RL).
Optimism in the face of uncertainty is a well-known heuristic with theoretical
guarantees in the tabular setting, but how best to translate the principle to
deep reinforcement learning, which involves online stochastic gradients and
deep network function approximators, is not fully understood. In this paper we
propose a new, differentiable optimistic objective that when optimized yields a
policy that provably explores efficiently, with guarantees even under function
approximation. Our new objective is a zero-sum two-player game derived from
endowing the agent with an epistemic-risk-seeking utility function, which
converts uncertainty into value and encourages the agent to explore uncertain
states. We show that the solution to this game minimizes an upper bound on the
regret, with the 'players' each attempting to minimize one component of a
particular regret decomposition. We derive a new model-free algorithm which we
call 'epistemic-risk-seeking actor-critic' (ERSAC), which is simply an
application of simultaneous stochastic gradient ascent-descent to the game.
Finally, we discuss a recipe for incorporating off-policy data and show that
combining the risk-seeking objective with replay data yields a double benefit
in terms of statistical efficiency. We conclude with some results showing good
performance of a deep RL agent using the technique on the challenging 'DeepSea'
environment, showing significant performance improvements even over other
efficient exploration techniques, as well as improved performance on the Atari
benchmark.
Related papers
- Complexity-Aware Deep Symbolic Regression with Robust Risk-Seeking Policy Gradients [20.941908494137806]
This paper proposes a novel deep symbolic regression approach to enhance the robustness and interpretability of data-driven mathematical expression discovery.
Despite the success of the state-of-the-art method, DSR, it is built on recurrent neural networks, purely guided by data fitness.
We use transformers in conjunction with breadth-first-search to improve the learning performance.
arXiv Detail & Related papers (2024-06-10T19:29:10Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Entropy Augmented Reinforcement Learning [0.0]
We propose a shifted Markov decision process (MDP) to encourage the exploration and reinforce the ability of escaping from suboptimums.
Our experiments test augmented TRPO and PPO on MuJoCo benchmark tasks, of an indication that the agent is heartened towards higher reward regions.
arXiv Detail & Related papers (2022-08-19T13:09:32Z) - Efficient Neural Network Analysis with Sum-of-Infeasibilities [64.31536828511021]
Inspired by sum-of-infeasibilities methods in convex optimization, we propose a novel procedure for analyzing verification queries on networks with extensive branching functions.
An extension to a canonical case-analysis-based complete search procedure can be achieved by replacing the convex procedure executed at each search state with DeepSoI.
arXiv Detail & Related papers (2022-03-19T15:05:09Z) - On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Bayesian Robust Optimization for Imitation Learning [34.40385583372232]
Inverse reinforcement learning can enable generalization to new states by learning a parameterized reward function.
Existing safe imitation learning approaches based on IRL deal with this uncertainty using a maxmin framework.
BROIL provides a natural way to interpolate between return-maximizing and risk-minimizing behaviors.
arXiv Detail & Related papers (2020-07-24T01:52:11Z) - Reparameterized Variational Divergence Minimization for Stable Imitation [57.06909373038396]
We study the extent to which variations in the choice of probabilistic divergence may yield more performant ILO algorithms.
We contribute a re parameterization trick for adversarial imitation learning to alleviate the challenges of the promising $f$-divergence minimization framework.
Empirically, we demonstrate that our design choices allow for ILO algorithms that outperform baseline approaches and more closely match expert performance in low-dimensional continuous-control tasks.
arXiv Detail & Related papers (2020-06-18T19:04:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.