Related papers: Policy Improvement via Imitation of Multiple Oracles

Policy Improvement via Imitation of Multiple Oracles

URL: http://arxiv.org/abs/2007.00795v2
Date: Sun, 6 Dec 2020 04:02:56 GMT
Title: Policy Improvement via Imitation of Multiple Oracles
Authors: Ching-An Cheng, Andrey Kolobov, Alekh Agarwal
Abstract summary: Imitation learning (IL) uses an oracle policy during training as a bootstrap to accelerate the learning process. We introduce a novel IL algorithm MAMBA, which can provably learn a policy competitive with this benchmark.
Score: 38.84810247415195
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite its promise, reinforcement learning's real-world adoption has been hampered by the need for costly exploration to learn a good policy. Imitation learning (IL) mitigates this shortcoming by using an oracle policy during training as a bootstrap to accelerate the learning process. However, in many practical situations, the learner has access to multiple suboptimal oracles, which may provide conflicting advice in a state. The existing IL literature provides a limited treatment of such scenarios. Whereas in the single-oracle case, the return of the oracle's policy provides an obvious benchmark for the learner to compete against, neither such a benchmark nor principled ways of outperforming it are known for the multi-oracle setting. In this paper, we propose the state-wise maximum of the oracle policies' values as a natural baseline to resolve conflicting advice from multiple oracles. Using a reduction of policy optimization to online learning, we introduce a novel IL algorithm MAMBA, which can provably learn a policy competitive with this benchmark. In particular, MAMBA optimizes policies by using a gradient estimator in the style of generalized advantage estimation (GAE). Our theoretical analysis shows that this design makes MAMBA robust and enables it to outperform the oracle policies by a larger margin than the IL state of the art, even in the single-oracle case. In an evaluation against standard policy gradient with GAE and AggreVaTe(D), we showcase MAMBA's ability to leverage demonstrations both from a single and from multiple weak oracles, and significantly speed up policy optimization.

Related papers

Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z)
Convergence of Policy Mirror Descent Beyond Compatible Function Approximation [66.4260157478436]
We develop theoretical PMD general policy classes where we strictly assume a weaker variational dominance and obtain convergence to the best-in-class policy. Our main notion leverages a novel notion induced by the local norm induced by the occupancy- gradient measure.
arXiv Detail & Related papers (2025-02-16T08:05:46Z)
Blending Imitation and Reinforcement Learning for Robust Policy Improvement [16.588397203235296]
Imitation learning (IL) utilizes oracles to improve sample efficiency. RPI draws on the strengths of IL, using oracle queries to facilitate exploration. RPI is capable of learning from and improving upon a diverse set of black-box oracles.
arXiv Detail & Related papers (2023-10-03T01:55:54Z)
Active Policy Improvement from Multiple Black-box Oracles [24.320182712799955]
We introduce MAPS and MAPS-SE, a class of policy improvement algorithms that perform imitation learning from multiple suboptimal oracles. In particular, MAPS actively selects which of the oracles to imitate and improve their value function estimates. We show that MAPS-SE significantly accelerates policy optimization via state-wise imitation learning from multiple oracles.
arXiv Detail & Related papers (2023-06-17T05:03:43Z)
Oracle-Efficient Pessimism: Offline Policy Optimization in Contextual Bandits [82.28442917447643]
We present the first general oracle-efficient algorithm for pessimistic OPO. We obtain statistical guarantees analogous to those for prior pessimistic approaches. We show advantage over unregularized OPO across a wide range of configurations.
arXiv Detail & Related papers (2023-06-13T17:29:50Z)
DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm [48.60180355291149]
We introduce doubly multi-step off-policy VI (DoMo-VI), a novel oracle algorithm that combines multi-step policy improvements and policy evaluations. We then propose doubly multi-step off-policy actor-critic (DoMo-AC), a practical instantiation of the DoMo-VI algorithm.
arXiv Detail & Related papers (2023-05-29T14:36:51Z)
Some Supervision Required: Incorporating Oracle Policies in Reinforcement Learning via Epistemic Uncertainty Metrics [2.56865487804497]
Critic Confidence Guided Exploration takes in the policy's actions as suggestions and incorporates this information into the learning scheme. We show that CCGE is able to perform competitively against adjacent algorithms that also leverage an oracle policy.
arXiv Detail & Related papers (2022-08-22T18:26:43Z)
ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling [10.925914554822343]
We develop theory for optimal data collection within the class of tree-structured MDPs. We empirically validate that ReVar leads to policy evaluation with mean squared error comparable to the oracle strategy.
arXiv Detail & Related papers (2022-03-09T03:41:15Z)
Optimization Issues in KL-Constrained Approximate Policy Iteration [48.24321346619156]
Many reinforcement learning algorithms can be seen as versions of approximate policy iteration (API) While standard API often performs poorly, it has been shown that learning can be stabilized by regularizing each policy update by the KL-divergence to the previous policy. Popular practical algorithms such as TRPO, MPO, and VMPO replace regularization by a constraint on KL-divergence of consecutive policies.
arXiv Detail & Related papers (2021-02-11T19:35:33Z)
Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks. Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes. We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z)
BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy. We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent. We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.