Active Policy Improvement from Multiple Black-box Oracles
- URL: http://arxiv.org/abs/2306.10259v2
- Date: Wed, 5 Jul 2023 22:23:11 GMT
- Title: Active Policy Improvement from Multiple Black-box Oracles
- Authors: Xuefeng Liu, Takuma Yoneda, Chaoqi Wang, Matthew R. Walter, Yuxin Chen
- Abstract summary: We introduce MAPS and MAPS-SE, a class of policy improvement algorithms that perform imitation learning from multiple suboptimal oracles.
In particular, MAPS actively selects which of the oracles to imitate and improve their value function estimates.
We show that MAPS-SE significantly accelerates policy optimization via state-wise imitation learning from multiple oracles.
- Score: 24.320182712799955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning (RL) has made significant strides in various complex
domains. However, identifying an effective policy via RL often necessitates
extensive exploration. Imitation learning aims to mitigate this issue by using
expert demonstrations to guide exploration. In real-world scenarios, one often
has access to multiple suboptimal black-box experts, rather than a single
optimal oracle. These experts do not universally outperform each other across
all states, presenting a challenge in actively deciding which oracle to use and
in which state. We introduce MAPS and MAPS-SE, a class of policy improvement
algorithms that perform imitation learning from multiple suboptimal oracles. In
particular, MAPS actively selects which of the oracles to imitate and improve
their value function estimates, and MAPS-SE additionally leverages an active
state exploration criterion to determine which states one should explore. We
provide a comprehensive theoretical analysis and demonstrate that MAPS and
MAPS-SE enjoy sample efficiency advantage over the state-of-the-art policy
improvement algorithms. Empirical results show that MAPS-SE significantly
accelerates policy optimization via state-wise imitation learning from multiple
oracles across a broad spectrum of control tasks in the DeepMind Control Suite.
Our code is publicly available at: https://github.com/ripl/maps.
Related papers
- EVOLvE: Evaluating and Optimizing LLMs For Exploration [76.66831821738927]
Large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty.
We measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications.
Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs.
arXiv Detail & Related papers (2024-10-08T17:54:03Z) - MESA: Cooperative Meta-Exploration in Multi-Agent Learning through Exploiting State-Action Space Structure [37.56309011441144]
This paper introduces MESA, a novel meta-exploration method for cooperative multi-agent learning.
It learns to explore by first identifying the agents' high-rewarding joint state-action subspace from training tasks and then learning a set of diverse exploration policies to "cover" the subspace.
Experiments show that with learned exploration policies, MESA achieves significantly better performance in sparse-reward tasks in several multi-agent particle environments and multi-agent MuJoCo environments.
arXiv Detail & Related papers (2024-05-01T23:19:48Z) - Efficient Reinforcement Learning via Decoupling Exploration and Utilization [6.305976803910899]
Reinforcement Learning (RL) has achieved remarkable success across multiple fields and applications, including gaming, robotics, and autonomous vehicles.
In this work, our aim is to train agent with efficient learning by decoupling exploration and utilization, so that agent can escaping the conundrum of suboptimal Solutions.
The above idea is implemented in the proposed OPARL (Optimistic and Pessimistic Actor Reinforcement Learning) algorithm.
arXiv Detail & Related papers (2023-12-26T09:03:23Z) - Blending Imitation and Reinforcement Learning for Robust Policy
Improvement [16.588397203235296]
Imitation learning (IL) utilizes oracles to improve sample efficiency.
RPI draws on the strengths of IL, using oracle queries to facilitate exploration.
RPI is capable of learning from and improving upon a diverse set of black-box oracles.
arXiv Detail & Related papers (2023-10-03T01:55:54Z) - Near-optimal Policy Identification in Active Reinforcement Learning [84.27592560211909]
AE-LSVI is a novel variant of the kernelized least-squares value RL (LSVI) algorithm that combines optimism with pessimism for active exploration.
We show that AE-LSVI outperforms other algorithms in a variety of environments when robustness to the initial state is required.
arXiv Detail & Related papers (2022-12-19T14:46:57Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - The Surprising Effectiveness of MAPPO in Cooperative, Multi-Agent Games [67.47961797770249]
Multi-Agent PPO (MAPPO) is a multi-agent PPO variant which adopts a centralized value function.
We show that MAPPO achieves performance comparable to the state-of-the-art in three popular multi-agent testbeds.
arXiv Detail & Related papers (2021-03-02T18:59:56Z) - Learning Dexterous Manipulation from Suboptimal Experts [69.8017067648129]
Relative Entropy Q-Learning (REQ) is a simple policy algorithm that combines ideas from successful offline and conventional RL algorithms.
We show how REQ is also effective for general off-policy RL, offline RL, and RL from demonstrations.
arXiv Detail & Related papers (2020-10-16T18:48:49Z) - Policy Improvement via Imitation of Multiple Oracles [38.84810247415195]
Imitation learning (IL) uses an oracle policy during training as a bootstrap to accelerate the learning process.
We introduce a novel IL algorithm MAMBA, which can provably learn a policy competitive with this benchmark.
arXiv Detail & Related papers (2020-07-01T22:33:28Z) - Zeroth-Order Supervised Policy Improvement [94.0748002906652]
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL)
We propose Zeroth-Order Supervised Policy Improvement (ZOSPI)
ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods.
arXiv Detail & Related papers (2020-06-11T16:49:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.