Learning and Planning in Complex Action Spaces
- URL: http://arxiv.org/abs/2104.06303v1
- Date: Tue, 13 Apr 2021 15:48:48 GMT
- Title: Learning and Planning in Complex Action Spaces
- Authors: Thomas Hubert and Julian Schrittwieser and Ioannis Antonoglou and
Mohammadamin Barekatain and Simon Schmitt and David Silver
- Abstract summary: We propose a general framework to reason in a principled way about policy evaluation and improvement.
This sample-based policy iteration framework can in principle be applied to any reinforcement learning algorithm.
We demonstrate this approach on the classical board game of Go and on two continuous control benchmark domains.
- Score: 19.33000677254158
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many important real-world problems have action spaces that are
high-dimensional, continuous or both, making full enumeration of all possible
actions infeasible. Instead, only small subsets of actions can be sampled for
the purpose of policy evaluation and improvement. In this paper, we propose a
general framework to reason in a principled way about policy evaluation and
improvement over such sampled action subsets. This sample-based policy
iteration framework can in principle be applied to any reinforcement learning
algorithm based upon policy iteration. Concretely, we propose Sampled MuZero,
an extension of the MuZero algorithm that is able to learn in domains with
arbitrarily complex action spaces by planning over sampled actions. We
demonstrate this approach on the classical board game of Go and on two
continuous control benchmark domains: DeepMind Control Suite and Real-World RL
Suite.
Related papers
- Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Learning Generalized Policies for Fully Observable Non-Deterministic Planning Domains [12.730070122798459]
General policies represent reactive strategies for solving large families of planning problems.
We extend the formulations and the resulting methods for learning general policies over fully observable, non-deterministic domains.
arXiv Detail & Related papers (2024-04-03T06:25:42Z) - Offline Imitation Learning from Multiple Baselines with Applications to Compiler Optimization [17.729842629392742]
We study a Reinforcement Learning problem in which we are given a set of trajectories collected with K baseline policies.
The goal is to learn a policy which performs as well as the best combination of baselines on the entire state space.
arXiv Detail & Related papers (2024-03-28T14:34:02Z) - DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm [48.60180355291149]
We introduce doubly multi-step off-policy VI (DoMo-VI), a novel oracle algorithm that combines multi-step policy improvements and policy evaluations.
We then propose doubly multi-step off-policy actor-critic (DoMo-AC), a practical instantiation of the DoMo-VI algorithm.
arXiv Detail & Related papers (2023-05-29T14:36:51Z) - Chain-of-Thought Predictive Control [32.30974063877643]
We study generalizable policy learning from demonstrations for complex low-level control.
We propose a novel hierarchical imitation learning method that utilizes sub-optimal demos.
arXiv Detail & Related papers (2023-04-03T07:59:13Z) - POLTER: Policy Trajectory Ensemble Regularization for Unsupervised
Reinforcement Learning [30.834631947104498]
We present POLTER - a method to regularize the pretraining that can be applied to any URL algorithm.
We evaluate POLTER on the Unsupervised Reinforcement Learning Benchmark (URLB), which consists of 12 tasks in 3 domains.
We demonstrate the generality of our approach by improving the performance of a diverse set of data- and knowledge-based URL algorithms by 19% on average and up to 40% in the best case.
arXiv Detail & Related papers (2022-05-23T14:42:38Z) - Constructing a Good Behavior Basis for Transfer using Generalized Policy
Updates [63.58053355357644]
We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks.
We show theoretically that having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance.
arXiv Detail & Related papers (2021-12-30T12:20:46Z) - Zeroth-Order Supervised Policy Improvement [94.0748002906652]
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL)
We propose Zeroth-Order Supervised Policy Improvement (ZOSPI)
ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods.
arXiv Detail & Related papers (2020-06-11T16:49:23Z) - Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension.
We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation.
These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.