PG3: Policy-Guided Planning for Generalized Policy Generation
- URL: http://arxiv.org/abs/2204.10420v1
- Date: Thu, 21 Apr 2022 21:59:25 GMT
- Title: PG3: Policy-Guided Planning for Generalized Policy Generation
- Authors: Ryan Yang, Tom Silver, Aidan Curtis, Tomas Lozano-Perez, Leslie Pack
Kaelbling
- Abstract summary: We study generalized policy search-based methods with a focus on the score function used to guide the search over policies.
The main idea behind our approach is that a candidate policy should be used to guide planning on training problems as a mechanism for evaluating that candidate.
Empirical results in six domains confirm that PG3 learns generalized policies more efficiently and effectively than several baselines.
- Score: 25.418642034856365
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A longstanding objective in classical planning is to synthesize policies that
generalize across multiple problems from the same domain. In this work, we
study generalized policy search-based methods with a focus on the score
function used to guide the search over policies. We demonstrate limitations of
two score functions and propose a new approach that overcomes these
limitations. The main idea behind our approach, Policy-Guided Planning for
Generalized Policy Generation (PG3), is that a candidate policy should be used
to guide planning on training problems as a mechanism for evaluating that
candidate. Theoretical results in a simplified setting give conditions under
which PG3 is optimal or admissible. We then study a specific instantiation of
policy search where planning problems are PDDL-based and policies are lifted
decision lists. Empirical results in six domains confirm that PG3 learns
generalized policies more efficiently and effectively than several baselines.
Code: https://github.com/ryangpeixu/pg3
Related papers
- Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Imitating Graph-Based Planning with Goal-Conditioned Policies [72.61631088613048]
We present a self-imitation scheme which distills a subgoal-conditioned policy into the target-goal-conditioned policy.
We empirically show that our method can significantly boost the sample-efficiency of the existing goal-conditioned RL methods.
arXiv Detail & Related papers (2023-03-20T14:51:10Z) - Towards A Unified Policy Abstraction Theory and Representation Learning
Approach in Markov Decision Processes [39.94472154078338]
We propose a unified policy abstraction theory, containing three types of policy abstraction associated to policy features at different levels.
We then generalize them to three policy metrics that quantify the distance (i.e., similarity) of policies.
For the empirical study, we investigate the efficacy of the proposed policy metrics and representations, in characterizing policy difference and conveying policy generalization respectively.
arXiv Detail & Related papers (2022-09-16T03:41:50Z) - Sigmoidally Preconditioned Off-policy Learning:a new exploration method
for reinforcement learning [14.991913317341417]
We focus on an off-policy Actor-Critic architecture, and propose a novel method, called Preconditioned Proximal Policy Optimization (P3O)
P3O can control the high variance of importance sampling by applying a preconditioner to the Conservative Policy Iteration (CPI) objective.
Results show that our P3O maximizes the CPI objective better than PPO during the training process.
arXiv Detail & Related papers (2022-05-20T09:38:04Z) - Learning Generalized Policies Without Supervision Using GNNs [20.322992960599255]
We consider the problem of learning generalized policies for classical planning domains using graph neural networks.
We use a simple and general GNN architecture and aim at obtaining crisp experimental results.
We exploit the relation established between the expressive power of GNNs and the $C_2$ fragment of first-order logic.
arXiv Detail & Related papers (2022-05-12T10:28:46Z) - Generalizing Off-Policy Learning under Sample Selection Bias [15.733136147164032]
We propose a novel framework for learning policies that generalize to the target population.
We prove that, if the uncertainty set is well-specified, our policies generalize to the target population as they can not do worse than on the training data.
arXiv Detail & Related papers (2021-12-02T16:18:16Z) - Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy.
We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance.
Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z) - Goal-Conditioned Reinforcement Learning with Imagined Subgoals [89.67840168694259]
We propose to incorporate imagined subgoals into policy learning to facilitate learning of complex tasks.
Imagined subgoals are predicted by a separate high-level policy, which is trained simultaneously with the policy and its critic.
We evaluate our approach on complex robotic navigation and manipulation tasks and show that it outperforms existing methods by a large margin.
arXiv Detail & Related papers (2021-07-01T15:30:59Z) - Learning General Policies from Small Examples Without Supervision [18.718037284357834]
Generalized planning is concerned with the computation of general policies that solve multiple instances of a planning domain all at once.
It has been recently shown that these policies can be computed in two steps: first, a suitable abstraction in the form of a qualitative numerical planning problem (QNP) is learned from sample plans.
In this work, we introduce an alternative approach for computing more expressive general policies which does not require sample plans or a QNP planner.
arXiv Detail & Related papers (2021-01-03T19:44:13Z) - Policy Evaluation Networks [50.53250641051648]
We introduce a scalable, differentiable fingerprinting mechanism that retains essential policy information in a concise embedding.
Our empirical results demonstrate that combining these three elements can produce policies that outperform those that generated the training data.
arXiv Detail & Related papers (2020-02-26T23:00:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.