Related papers: Some Supervision Required: Incorporating Oracle Policies in Reinforcement Learning via Epistemic Uncertainty Metrics

Some Supervision Required: Incorporating Oracle Policies in Reinforcement Learning via Epistemic Uncertainty Metrics

URL: http://arxiv.org/abs/2208.10533v3
Date: Mon, 21 Aug 2023 12:49:03 GMT
Title: Some Supervision Required: Incorporating Oracle Policies in Reinforcement Learning via Epistemic Uncertainty Metrics
Authors: Jun Jet Tai, Jordan K. Terry, Mauro S. Innocente, James Brusey, Nadjim Horri
Abstract summary: Critic Confidence Guided Exploration takes in the policy's actions as suggestions and incorporates this information into the learning scheme. We show that CCGE is able to perform competitively against adjacent algorithms that also leverage an oracle policy.
Score: 2.56865487804497
License: http://creativecommons.org/licenses/by/4.0/
Abstract: An inherent problem of reinforcement learning is performing exploration of an environment through random actions, of which a large portion can be unproductive. Instead, exploration can be improved by initializing the learning policy with an existing (previously learned or hard-coded) oracle policy, offline data, or demonstrations. In the case of using an oracle policy, it can be unclear how best to incorporate the oracle policy's experience into the learning policy in a way that maximizes learning sample efficiency. In this paper, we propose a method termed Critic Confidence Guided Exploration (CCGE) for incorporating such an oracle policy into standard actor-critic reinforcement learning algorithms. More specifically, CCGE takes in the oracle policy's actions as suggestions and incorporates this information into the learning scheme when uncertainty is high, while ignoring it when the uncertainty is low. CCGE is agnostic to methods of estimating uncertainty, and we show that it is equally effective with two different techniques. Empirically, we evaluate the effect of CCGE on various benchmark reinforcement learning tasks, and show that this idea can lead to improved sample efficiency and final performance. Furthermore, when evaluated on sparse reward environments, CCGE is able to perform competitively against adjacent algorithms that also leverage an oracle policy. Our experiments show that it is possible to utilize uncertainty as a heuristic to guide exploration using an oracle in reinforcement learning. We expect that this will inspire more research in this direction, where various heuristics are used to determine the direction of guidance provided to learning.

Related papers

OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics [101.78963920333342]
We introduce OpenUnlearning, a standardized framework for benchmarking large language models (LLMs) unlearning methods and metrics.<n>OpenUnlearning integrates 9 unlearning algorithms and 16 diverse evaluations across 3 leading benchmarks.<n>We also benchmark diverse unlearning methods and provide a comparative analysis against an extensive evaluation suite.
arXiv Detail & Related papers (2025-06-14T20:16:37Z)
Learning to Explore: An In-Context Learning Approach for Pure Exploration [23.16863295063427]
In this work, we study the active sequential hypothesis testing problem, also known as pure exploration.<n>We introduce In-Context Pure Exploration (ICPE), an in-context learning approach that uses Transformers to learn exploration strategies directly from experience.<n>ICPE combines supervised learning and reinforcement learning to identify and exploit latent structure across related tasks, without requiring prior assumptions.
arXiv Detail & Related papers (2025-06-02T17:04:50Z)
Guided Policy Optimization under Partial Observability [36.853129816484845]
Reinforcement Learning (RL) in partially observable environments poses significant challenges due to the complexity of learning under uncertainty.<n>We introduce Guided Policy Optimization (GPO), a framework that co-trains a guider and a learner.<n>We theoretically demonstrate that this learning scheme achieves optimality comparable to direct RL, thereby overcoming key limitations inherent in existing approaches.
arXiv Detail & Related papers (2025-05-21T12:01:08Z)
No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery [53.08822154199948]
Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula promise to enable agents to be robust to in- and out-of-distribution tasks. This work investigates how existing UED methods select training environments, focusing on task prioritisation metrics. We develop a method that directly trains on scenarios with high learnability.
arXiv Detail & Related papers (2024-08-27T14:31:54Z)
Blending Imitation and Reinforcement Learning for Robust Policy Improvement [16.588397203235296]
Imitation learning (IL) utilizes oracles to improve sample efficiency. RPI draws on the strengths of IL, using oracle queries to facilitate exploration. RPI is capable of learning from and improving upon a diverse set of black-box oracles.
arXiv Detail & Related papers (2023-10-03T01:55:54Z)
Assessor-Guided Learning for Continual Environments [17.181933166255448]
This paper proposes an assessor-guided learning strategy for continual learning. An assessor guides the learning process of a base learner by controlling the direction and pace of the learning process. The assessor is trained in a meta-learning manner with a meta-objective to boost the learning process of the base learner.
arXiv Detail & Related papers (2023-03-21T06:45:14Z)
Inapplicable Actions Learning for Knowledge Transfer in Reinforcement Learning [3.194414753332705]
We show that learning inapplicable actions greatly improves the sample efficiency of RL algorithms. Thanks to the transferability of the knowledge acquired, it can be reused in other tasks and domains to make the learning process more efficient.
arXiv Detail & Related papers (2022-11-28T17:45:39Z)
Curriculum Learning for Safe Mapless Navigation [71.55718344087657]
This work investigates the effects of Curriculum Learning (CL)-based approaches on the agent's performance. In particular, we focus on the safety aspect of robotic mapless navigation, comparing over a standard end-to-end (E2E) training strategy.
arXiv Detail & Related papers (2021-12-23T12:30:36Z)
Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit. We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner. Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z)
Closing the Closed-Loop Distribution Shift in Safe Imitation Learning [80.05727171757454]
We treat safe optimization-based control strategies as experts in an imitation learning problem. We train a learned policy that can be cheaply evaluated at run-time and that provably satisfies the same safety guarantees as the expert.
arXiv Detail & Related papers (2021-02-18T05:11:41Z)
Policy Improvement via Imitation of Multiple Oracles [38.84810247415195]
Imitation learning (IL) uses an oracle policy during training as a bootstrap to accelerate the learning process. We introduce a novel IL algorithm MAMBA, which can provably learn a policy competitive with this benchmark.
arXiv Detail & Related papers (2020-07-01T22:33:28Z)
META-Learning Eligibility Traces for More Sample Efficient Temporal Difference Learning [2.0559497209595823]
We propose a meta-learning method for adjusting the eligibility trace parameter, in a state-dependent manner. The adaptation is achieved with the help of auxiliary learners that learn distributional information about the update targets online. We prove that, under some assumptions, the proposed method improves the overall quality of the update targets, by minimizing the overall target error.
arXiv Detail & Related papers (2020-06-16T03:41:07Z)
Zeroth-Order Supervised Policy Improvement [94.0748002906652]
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL) We propose Zeroth-Order Supervised Policy Improvement (ZOSPI) ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods.
arXiv Detail & Related papers (2020-06-11T16:49:23Z)
Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data. Can we learn effective policies via supervised learning without demonstrations? We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.