Policy Improvement via Imitation of Multiple Oracles
- URL: http://arxiv.org/abs/2007.00795v2
- Date: Sun, 6 Dec 2020 04:02:56 GMT
- Title: Policy Improvement via Imitation of Multiple Oracles
- Authors: Ching-An Cheng, Andrey Kolobov, Alekh Agarwal
- Abstract summary: Imitation learning (IL) uses an oracle policy during training as a bootstrap to accelerate the learning process.
We introduce a novel IL algorithm MAMBA, which can provably learn a policy competitive with this benchmark.
- Score: 38.84810247415195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite its promise, reinforcement learning's real-world adoption has been
hampered by the need for costly exploration to learn a good policy. Imitation
learning (IL) mitigates this shortcoming by using an oracle policy during
training as a bootstrap to accelerate the learning process. However, in many
practical situations, the learner has access to multiple suboptimal oracles,
which may provide conflicting advice in a state. The existing IL literature
provides a limited treatment of such scenarios. Whereas in the single-oracle
case, the return of the oracle's policy provides an obvious benchmark for the
learner to compete against, neither such a benchmark nor principled ways of
outperforming it are known for the multi-oracle setting. In this paper, we
propose the state-wise maximum of the oracle policies' values as a natural
baseline to resolve conflicting advice from multiple oracles. Using a reduction
of policy optimization to online learning, we introduce a novel IL algorithm
MAMBA, which can provably learn a policy competitive with this benchmark. In
particular, MAMBA optimizes policies by using a gradient estimator in the style
of generalized advantage estimation (GAE). Our theoretical analysis shows that
this design makes MAMBA robust and enables it to outperform the oracle policies
by a larger margin than the IL state of the art, even in the single-oracle
case. In an evaluation against standard policy gradient with GAE and
AggreVaTe(D), we showcase MAMBA's ability to leverage demonstrations both from
a single and from multiple weak oracles, and significantly speed up policy
optimization.
Related papers
- Blending Imitation and Reinforcement Learning for Robust Policy
Improvement [16.588397203235296]
Imitation learning (IL) utilizes oracles to improve sample efficiency.
RPI draws on the strengths of IL, using oracle queries to facilitate exploration.
RPI is capable of learning from and improving upon a diverse set of black-box oracles.
arXiv Detail & Related papers (2023-10-03T01:55:54Z) - Active Policy Improvement from Multiple Black-box Oracles [24.320182712799955]
We introduce MAPS and MAPS-SE, a class of policy improvement algorithms that perform imitation learning from multiple suboptimal oracles.
In particular, MAPS actively selects which of the oracles to imitate and improve their value function estimates.
We show that MAPS-SE significantly accelerates policy optimization via state-wise imitation learning from multiple oracles.
arXiv Detail & Related papers (2023-06-17T05:03:43Z) - Oracle-Efficient Pessimism: Offline Policy Optimization in Contextual
Bandits [82.28442917447643]
We present the first general oracle-efficient algorithm for pessimistic OPO.
We obtain statistical guarantees analogous to those for prior pessimistic approaches.
We show advantage over unregularized OPO across a wide range of configurations.
arXiv Detail & Related papers (2023-06-13T17:29:50Z) - DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm [48.60180355291149]
We introduce doubly multi-step off-policy VI (DoMo-VI), a novel oracle algorithm that combines multi-step policy improvements and policy evaluations.
We then propose doubly multi-step off-policy actor-critic (DoMo-AC), a practical instantiation of the DoMo-VI algorithm.
arXiv Detail & Related papers (2023-05-29T14:36:51Z) - Some Supervision Required: Incorporating Oracle Policies in
Reinforcement Learning via Epistemic Uncertainty Metrics [2.56865487804497]
Critic Confidence Guided Exploration takes in the policy's actions as suggestions and incorporates this information into the learning scheme.
We show that CCGE is able to perform competitively against adjacent algorithms that also leverage an oracle policy.
arXiv Detail & Related papers (2022-08-22T18:26:43Z) - ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling [10.925914554822343]
We develop theory for optimal data collection within the class of tree-structured MDPs.
We empirically validate that ReVar leads to policy evaluation with mean squared error comparable to the oracle strategy.
arXiv Detail & Related papers (2022-03-09T03:41:15Z) - Optimization Issues in KL-Constrained Approximate Policy Iteration [48.24321346619156]
Many reinforcement learning algorithms can be seen as versions of approximate policy iteration (API)
While standard API often performs poorly, it has been shown that learning can be stabilized by regularizing each policy update by the KL-divergence to the previous policy.
Popular practical algorithms such as TRPO, MPO, and VMPO replace regularization by a constraint on KL-divergence of consecutive policies.
arXiv Detail & Related papers (2021-02-11T19:35:33Z) - Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks.
Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes.
We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z) - BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy.
We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent.
We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.