Some Supervision Required: Incorporating Oracle Policies in
Reinforcement Learning via Epistemic Uncertainty Metrics
- URL: http://arxiv.org/abs/2208.10533v3
- Date: Mon, 21 Aug 2023 12:49:03 GMT
- Title: Some Supervision Required: Incorporating Oracle Policies in
Reinforcement Learning via Epistemic Uncertainty Metrics
- Authors: Jun Jet Tai, Jordan K. Terry, Mauro S. Innocente, James Brusey, Nadjim
Horri
- Abstract summary: Critic Confidence Guided Exploration takes in the policy's actions as suggestions and incorporates this information into the learning scheme.
We show that CCGE is able to perform competitively against adjacent algorithms that also leverage an oracle policy.
- Score: 2.56865487804497
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: An inherent problem of reinforcement learning is performing exploration of an
environment through random actions, of which a large portion can be
unproductive. Instead, exploration can be improved by initializing the learning
policy with an existing (previously learned or hard-coded) oracle policy,
offline data, or demonstrations. In the case of using an oracle policy, it can
be unclear how best to incorporate the oracle policy's experience into the
learning policy in a way that maximizes learning sample efficiency. In this
paper, we propose a method termed Critic Confidence Guided Exploration (CCGE)
for incorporating such an oracle policy into standard actor-critic
reinforcement learning algorithms. More specifically, CCGE takes in the oracle
policy's actions as suggestions and incorporates this information into the
learning scheme when uncertainty is high, while ignoring it when the
uncertainty is low. CCGE is agnostic to methods of estimating uncertainty, and
we show that it is equally effective with two different techniques.
Empirically, we evaluate the effect of CCGE on various benchmark reinforcement
learning tasks, and show that this idea can lead to improved sample efficiency
and final performance. Furthermore, when evaluated on sparse reward
environments, CCGE is able to perform competitively against adjacent algorithms
that also leverage an oracle policy. Our experiments show that it is possible
to utilize uncertainty as a heuristic to guide exploration using an oracle in
reinforcement learning. We expect that this will inspire more research in this
direction, where various heuristics are used to determine the direction of
guidance provided to learning.
Related papers
- No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery [53.08822154199948]
Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula promise to enable agents to be robust to in- and out-of-distribution tasks.
This work investigates how existing UED methods select training environments, focusing on task prioritisation metrics.
We develop a method that directly trains on scenarios with high learnability.
arXiv Detail & Related papers (2024-08-27T14:31:54Z) - Blending Imitation and Reinforcement Learning for Robust Policy
Improvement [16.588397203235296]
Imitation learning (IL) utilizes oracles to improve sample efficiency.
RPI draws on the strengths of IL, using oracle queries to facilitate exploration.
RPI is capable of learning from and improving upon a diverse set of black-box oracles.
arXiv Detail & Related papers (2023-10-03T01:55:54Z) - Assessor-Guided Learning for Continual Environments [17.181933166255448]
This paper proposes an assessor-guided learning strategy for continual learning.
An assessor guides the learning process of a base learner by controlling the direction and pace of the learning process.
The assessor is trained in a meta-learning manner with a meta-objective to boost the learning process of the base learner.
arXiv Detail & Related papers (2023-03-21T06:45:14Z) - Inapplicable Actions Learning for Knowledge Transfer in Reinforcement
Learning [3.194414753332705]
We show that learning inapplicable actions greatly improves the sample efficiency of RL algorithms.
Thanks to the transferability of the knowledge acquired, it can be reused in other tasks and domains to make the learning process more efficient.
arXiv Detail & Related papers (2022-11-28T17:45:39Z) - Curriculum Learning for Safe Mapless Navigation [71.55718344087657]
This work investigates the effects of Curriculum Learning (CL)-based approaches on the agent's performance.
In particular, we focus on the safety aspect of robotic mapless navigation, comparing over a standard end-to-end (E2E) training strategy.
arXiv Detail & Related papers (2021-12-23T12:30:36Z) - Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit.
We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner.
Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z) - Closing the Closed-Loop Distribution Shift in Safe Imitation Learning [80.05727171757454]
We treat safe optimization-based control strategies as experts in an imitation learning problem.
We train a learned policy that can be cheaply evaluated at run-time and that provably satisfies the same safety guarantees as the expert.
arXiv Detail & Related papers (2021-02-18T05:11:41Z) - Policy Improvement via Imitation of Multiple Oracles [38.84810247415195]
Imitation learning (IL) uses an oracle policy during training as a bootstrap to accelerate the learning process.
We introduce a novel IL algorithm MAMBA, which can provably learn a policy competitive with this benchmark.
arXiv Detail & Related papers (2020-07-01T22:33:28Z) - META-Learning Eligibility Traces for More Sample Efficient Temporal
Difference Learning [2.0559497209595823]
We propose a meta-learning method for adjusting the eligibility trace parameter, in a state-dependent manner.
The adaptation is achieved with the help of auxiliary learners that learn distributional information about the update targets online.
We prove that, under some assumptions, the proposed method improves the overall quality of the update targets, by minimizing the overall target error.
arXiv Detail & Related papers (2020-06-16T03:41:07Z) - Zeroth-Order Supervised Policy Improvement [94.0748002906652]
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL)
We propose Zeroth-Order Supervised Policy Improvement (ZOSPI)
ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods.
arXiv Detail & Related papers (2020-06-11T16:49:23Z) - Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data.
Can we learn effective policies via supervised learning without demonstrations?
We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.