Improving Actor-Critic Reinforcement Learning via Hamiltonian Policy
- URL: http://arxiv.org/abs/2103.12020v1
- Date: Mon, 22 Mar 2021 17:26:43 GMT
- Title: Improving Actor-Critic Reinforcement Learning via Hamiltonian Policy
- Authors: Duo Xu, Faramarz Fekri
- Abstract summary: Approximating optimal policies in reinforcement learning (RL) is often necessary in many real-world scenarios.
In this work, inspired by the previous use of Hamiltonian Monte Carlo (HMC) in VI, we propose to integrate policy optimization with HMC.
We show that the proposed approach is a data-efficient, and an easy-to-implement improvement over previous policy optimization methods.
- Score: 11.34520632697191
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Approximating optimal policies in reinforcement learning (RL) is often
necessary in many real-world scenarios, which is termed as policy optimization.
By viewing the reinforcement learning from the perspective of variational
inference (VI), the policy network is trained to obtain the approximate
posterior of actions given the optimality criteria. However, in practice, the
policy optimization may lead to suboptimal policy estimates due to the
amortization gap and insufficient exploration. In this work, inspired by the
previous use of Hamiltonian Monte Carlo (HMC) in VI, we propose to integrate
policy optimization with HMC. As such we choose evolving actions from the base
policy according to HMC. First, HMC can improve the policy distribution to
better approximate the posterior and hence reduces the amortization gap.
Second, HMC can also guide the exploration more to the regions with higher
action values, enhancing the exploration efficiency. Instead of directly
applying HMC into RL, we propose a new leapfrog operator to simulate the
Hamiltonian dynamics. With comprehensive empirical experiments on continuous
control baselines, including MuJoCo, PyBullet Roboschool and DeepMind Control
Suite, we show that the proposed approach is a data-efficient, and an
easy-to-implement improvement over previous policy optimization methods.
Besides, the proposed approach can also outperform previous methods on DeepMind
Control Suite, which has image-based high-dimensional observation space.
Related papers
- Forward KL Regularized Preference Optimization for Aligning Diffusion Policies [8.958830452149789]
A central problem for learning diffusion policies is to align the policy output with human intents in various tasks.
We propose a novel framework, Forward KL regularized Preference optimization, to align the diffusion policy with preferences directly.
The results show our method exhibits superior alignment with preferences and outperforms previous state-of-the-art algorithms.
arXiv Detail & Related papers (2024-09-09T13:56:03Z) - POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy
Decomposition [40.851324484481275]
We study off-policy learning of contextual bandit policies in large discrete action spaces.
We propose a novel two-stage algorithm, called Policy Optimization via Two-Stage Policy Decomposition.
We show that POTEC provides substantial improvements in OPL effectiveness particularly in large and structured action spaces.
arXiv Detail & Related papers (2024-02-09T03:01:13Z) - Theoretically Guaranteed Policy Improvement Distilled from Model-Based
Planning [64.10794426777493]
Model-based reinforcement learning (RL) has demonstrated remarkable successes on a range of continuous control tasks.
Recent practices tend to distill optimized action sequences into an RL policy during the training phase.
We develop an approach to distill from model-based planning to the policy.
arXiv Detail & Related papers (2023-07-24T16:52:31Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Acceleration in Policy Optimization [50.323182853069184]
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates.
We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change.
We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
arXiv Detail & Related papers (2023-06-18T15:50:57Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Generalised Policy Improvement with Geometric Policy Composition [18.80807234471197]
We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL.
We show that we can evaluate any non-Markov policy that switches between a set of base Markov policies with fixed probability by a careful composition of the base policy GHMs.
We can then apply generalised policy improvement (GPI) to collections of such non-Markov policies to obtain a new Markov policy that will in general outperform its precursors.
arXiv Detail & Related papers (2022-06-17T12:52:13Z) - Variance Reduction based Partial Trajectory Reuse to Accelerate Policy
Gradient Optimization [3.621753051212441]
We extend the idea of green simulation assisted policy gradient (GS-PG) to partial historical trajectory reuse for Markov Decision Processes (MDP)
In this paper, the mixture likelihood ratio (MLR) based policy gradient estimation is used to leverage the information from historical state decision transitions generated under different behavioral policies.
arXiv Detail & Related papers (2022-05-06T01:42:28Z) - Iterative Amortized Policy Optimization [147.63129234446197]
Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control.
From the variational inference perspective, policy networks are a form of textitamortized optimization, optimizing network parameters rather than the policy distributions directly.
We demonstrate that iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.
arXiv Detail & Related papers (2020-10-20T23:25:42Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z) - Population-Guided Parallel Policy Search for Reinforcement Learning [17.360163137926]
A new population-guided parallel learning scheme is proposed to enhance the performance of off-policy reinforcement learning (RL)
In the proposed scheme, multiple identical learners with their own value-functions and policies share a common experience replay buffer, and search a good policy in collaboration with the guidance of the best policy information.
arXiv Detail & Related papers (2020-01-09T10:13:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.