Memory-Constrained Policy Optimization
- URL: http://arxiv.org/abs/2204.09315v1
- Date: Wed, 20 Apr 2022 08:50:23 GMT
- Title: Memory-Constrained Policy Optimization
- Authors: Hung Le, Thommen Karimpanal George, Majid Abdolshah, Dung Nguyen, Kien
Do, Sunil Gupta, Svetha Venkatesh
- Abstract summary: We introduce a new constrained optimization method for policy gradient reinforcement learning.
We form a second trust region through the construction of another virtual policy that represents a wide range of past policies.
We then enforce the new policy to stay closer to the virtual policy, which is beneficial in case the old policy performs badly.
- Score: 59.63021433336966
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a new constrained optimization method for policy gradient
reinforcement learning, which uses two trust regions to regulate each policy
update. In addition to using the proximity of one single old policy as the
first trust region as done by prior works, we propose to form a second trust
region through the construction of another virtual policy that represents a
wide range of past policies. We then enforce the new policy to stay closer to
the virtual policy, which is beneficial in case the old policy performs badly.
More importantly, we propose a mechanism to automatically build the virtual
policy from a memory buffer of past policies, providing a new capability for
dynamically selecting appropriate trust regions during the optimization
process. Our proposed method, dubbed as Memory-Constrained Policy Optimization
(MCPO), is examined on a diverse suite of environments including robotic
locomotion control, navigation with sparse rewards and Atari games,
consistently demonstrating competitive performance against recent on-policy
constrained policy gradient methods.
Related papers
- Supported Trust Region Optimization for Offline Reinforcement Learning [59.43508325943592]
We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy.
We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset.
arXiv Detail & Related papers (2023-11-15T13:16:16Z) - IOB: Integrating Optimization Transfer and Behavior Transfer for
Multi-Policy Reuse [50.90781542323258]
Reinforcement learning (RL) agents can transfer knowledge from source policies to a related target task.
Previous methods introduce additional components, such as hierarchical policies or estimations of source policies' value functions.
We propose a novel transfer RL method that selects the source policy without training extra components.
arXiv Detail & Related papers (2023-08-14T09:22:35Z) - Provably Convergent Policy Optimization via Metric-aware Trust Region
Methods [21.950484108431944]
Trust-region methods are pervasively used to stabilize policy optimization in reinforcement learning.
We exploit more flexible metrics and examine two natural extensions of policy optimization with Wasserstein and Sinkhorn trust regions.
We show that WPO guarantees a monotonic performance improvement, and SPO provably converges to WPO as the entropic regularizer diminishes.
arXiv Detail & Related papers (2023-06-25T05:41:38Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Trust-Region-Free Policy Optimization for Stochastic Policies [60.52463923712565]
We show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee.
We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) explicit as it is free of any trust region constraints.
arXiv Detail & Related papers (2023-02-15T23:10:06Z) - Local Policy Improvement for Recommender Systems [8.617221361305901]
We show how to train a new policy given data collected from a previously-deployed policy.
We suggest an alternative approach of local policy improvement without off-policy correction.
This local policy improvement paradigm is ideal for recommender systems, as previous policies are typically of decent quality and policies are updated frequently.
arXiv Detail & Related papers (2022-12-22T00:47:40Z) - Fast Model-based Policy Search for Universal Policy Networks [45.44896435487879]
Adapting an agent's behaviour to new environments has been one of the primary focus areas of physics based reinforcement learning.
We propose a Gaussian Process-based prior learned in simulation, that captures the likely performance of a policy when transferred to a previously unseen environment.
We integrate this prior with a Bayesian optimisation-based policy search process to improve the efficiency of identifying the most appropriate policy from the universal policy network.
arXiv Detail & Related papers (2022-02-11T18:08:02Z) - Non-Stationary Off-Policy Optimization [50.41335279896062]
We study the novel problem of off-policy optimization in piecewise-stationary contextual bandits.
In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state.
In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance.
arXiv Detail & Related papers (2020-06-15T09:16:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.