Policy-Space Search: Equivalences, Improvements, and Compression
- URL: http://arxiv.org/abs/2403.19883v1
- Date: Thu, 28 Mar 2024 23:40:20 GMT
- Title: Policy-Space Search: Equivalences, Improvements, and Compression
- Authors: Frederico Messa, André Grahl Pereira,
- Abstract summary: Fully-observable non-deterministic (FOND) planning is at the core of artificial intelligence planning with uncertainty.
A* with Non-Determinism (AND*) is a FOND planner that generalizes A* for FOND planning.
- Score: 5.801044612920816
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fully-observable non-deterministic (FOND) planning is at the core of artificial intelligence planning with uncertainty. It models uncertainty through actions with non-deterministic effects. A* with Non-Determinism (AND*) (Messa and Pereira, 2023) is a FOND planner that generalizes A* (Hart et al., 1968) for FOND planning. It searches for a solution policy by performing an explicit heuristic search on the policy space of the FOND task. In this paper, we study and improve the performance of the policy-space search performed by AND*. We present a polynomial-time procedure that constructs a solution policy given just the set of states that should be mapped. This procedure, together with a better understanding of the structure of FOND policies, allows us to present three concepts of equivalences between policies. We use policy equivalences to prune part of the policy search space, making AND* substantially more effective in solving FOND tasks. We also study the impact of taking into account structural state-space symmetries to strengthen the detection of equivalence policies and the impact of performing the search with satisficing techniques. We apply a recent technique from the group theory literature to better compute structural state-space symmetries. Finally, we present a solution compressor that, given a policy defined over complete states, finds a policy that unambiguously represents it using the minimum number of partial states. AND* with the introduced techniques generates, on average, two orders of magnitude fewer policies to solve FOND tasks. These techniques allow explicit policy-space search to be competitive in terms of both coverage and solution compactness with other state-of-the-art FOND planners.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Policy Dispersion in Non-Markovian Environment [53.05904889617441]
This paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment.
We first adopt a transformer-based method to learn policy embeddings.
Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies.
arXiv Detail & Related papers (2023-02-28T11:58:39Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Towards A Unified Policy Abstraction Theory and Representation Learning
Approach in Markov Decision Processes [39.94472154078338]
We propose a unified policy abstraction theory, containing three types of policy abstraction associated to policy features at different levels.
We then generalize them to three policy metrics that quantify the distance (i.e., similarity) of policies.
For the empirical study, we investigate the efficacy of the proposed policy metrics and representations, in characterizing policy difference and conveying policy generalization respectively.
arXiv Detail & Related papers (2022-09-16T03:41:50Z) - Reward-Free Policy Space Compression for Reinforcement Learning [39.04317877999891]
In reinforcement learning, we encode the potential behaviors of an agent interacting with an environment into an infinite set of policies.
We seek for a reward-free compression of the policy space into a finite set of representative policies.
We show that this compression of the policy space can be formulated as a set cover problem, and it is inherently NP-hard.
arXiv Detail & Related papers (2022-02-22T18:11:57Z) - Policy Manifold Search: Exploring the Manifold Hypothesis for
Diversity-based Neuroevolution [4.920145245773581]
This paper proposes a novel method for diversity-based policy search via Neuroevolution.
We use the Quality-Diversity framework which provides a principled approach to policy search.
We also use the Jacobian of the inverse-mapping function to guide the search in the representation space.
arXiv Detail & Related papers (2021-04-27T18:52:03Z) - Policy Manifold Search for Improving Diversity-based Neuroevolution [4.920145245773581]
We propose a novel approach to diversity-based policy search via Neuroevolution.
Our approach iteratively collects policies according to the Quality-Diversity framework.
We use the Jacobian of the inverse transformation to guide the search in the latent space.
arXiv Detail & Related papers (2020-12-15T23:59:49Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State
Entropy Estimate [40.97686031763918]
In a reward-free environment, what is a suitable intrinsic objective for an agent to pursue so that it can learn an optimal task-agnostic exploration policy?
We argue that the entropy of the state distribution induced by finite-horizon trajectories is a sensible target.
We present a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), to learn a policy that maximizes a non-parametric, $k$-nearest neighbors estimate of the state distribution entropy.
arXiv Detail & Related papers (2020-07-09T08:44:39Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.