Theoretically Guaranteed Policy Improvement Distilled from Model-Based
Planning
- URL: http://arxiv.org/abs/2307.12933v1
- Date: Mon, 24 Jul 2023 16:52:31 GMT
- Title: Theoretically Guaranteed Policy Improvement Distilled from Model-Based
Planning
- Authors: Chuming Li, Ruonan Jia, Jie Liu, Yinmin Zhang, Yazhe Niu, Yaodong
Yang, Yu Liu, Wanli Ouyang
- Abstract summary: Model-based reinforcement learning (RL) has demonstrated remarkable successes on a range of continuous control tasks.
Recent practices tend to distill optimized action sequences into an RL policy during the training phase.
We develop an approach to distill from model-based planning to the policy.
- Score: 64.10794426777493
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Model-based reinforcement learning (RL) has demonstrated remarkable successes
on a range of continuous control tasks due to its high sample efficiency. To
save the computation cost of conducting planning online, recent practices tend
to distill optimized action sequences into an RL policy during the training
phase. Although the distillation can incorporate both the foresight of planning
and the exploration ability of RL policies, the theoretical understanding of
these methods is yet unclear. In this paper, we extend the policy improvement
step of Soft Actor-Critic (SAC) by developing an approach to distill from
model-based planning to the policy. We then demonstrate that such an approach
of policy improvement has a theoretical guarantee of monotonic improvement and
convergence to the maximum value defined in SAC. We discuss effective design
choices and implement our theory as a practical algorithm -- Model-based
Planning Distilled to Policy (MPDP) -- that updates the policy jointly over
multiple future time steps. Extensive experiments show that MPDP achieves
better sample efficiency and asymptotic performance than both model-free and
model-based planning algorithms on six continuous control benchmark tasks in
MuJoCo.
Related papers
- Diffusion Policy Policy Optimization [37.04382170999901]
Diffusion Policy Optimization, DPPO, is an algorithmic framework for fine-tuning diffusion-based policies.
DPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks.
We show that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization.
arXiv Detail & Related papers (2024-09-01T02:47:50Z) - Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality.
We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO)
Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions.
We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z) - Maximum Entropy Reinforcement Learning via Energy-Based Normalizing Flow [14.681645502417215]
We introduce a new MaxEnt RL framework modeled using Energy-Based Normalizing Flows (EBFlow)
This framework integrates the policy evaluation steps and the policy improvement steps, resulting in a single objective training process.
Our method achieves superior performance compared to widely-adopted representative baselines.
arXiv Detail & Related papers (2024-05-22T13:26:26Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Planning with Sequence Models through Iterative Energy Minimization [22.594413287842574]
We suggest an approach towards integrating planning with sequence models based on the idea of iterative energy minimization.
We train a masked language model to capture an implicit energy function over trajectories of actions, and formulate planning as finding a trajectory of actions with minimum energy.
We illustrate how this procedure enables improved performance over recent approaches across BabyAI and Atari environments.
arXiv Detail & Related papers (2023-03-28T17:53:22Z) - Diverse Policy Optimization for Structured Action Space [59.361076277997704]
We propose Diverse Policy Optimization (DPO) to model the policies in structured action space as the energy-based models (EBM)
A novel and powerful generative model, GFlowNet, is introduced as the efficient, diverse EBM-based policy sampler.
Experiments on ATSC and Battle benchmarks demonstrate that DPO can efficiently discover surprisingly diverse policies.
arXiv Detail & Related papers (2023-02-23T10:48:09Z) - Evaluating model-based planning and planner amortization for continuous
control [79.49319308600228]
We take a hybrid approach, combining model predictive control (MPC) with a learned model and model-free policy learning.
We find that well-tuned model-free agents are strong baselines even for high DoF control problems.
We show that it is possible to distil a model-based planner into a policy that amortizes the planning without any loss of performance.
arXiv Detail & Related papers (2021-10-07T12:00:40Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.