Variational Model-based Policy Optimization
- URL: http://arxiv.org/abs/2006.05443v2
- Date: Wed, 24 Jun 2020 01:00:51 GMT
- Title: Variational Model-based Policy Optimization
- Authors: Yinlam Chow and Brandon Cui and MoonKyung Ryu and Mohammad Ghavamzadeh
- Abstract summary: Model-based reinforcement learning (RL) algorithms allow us to combine model-generated data with those collected from interaction with the real system in order to alleviate the data efficiency problem in RL.
We propose an objective function as a variational lower-bound of a log-likelihood of a log-likelihood to jointly learn and improve model and policy.
Our experiments on a number of continuous control tasks show that despite being more complex, our model-based (E-step) algorithm, called emactoral model-based policy optimization (VMBPO), is more sample-efficient and
- Score: 34.80171122943031
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Model-based reinforcement learning (RL) algorithms allow us to combine
model-generated data with those collected from interaction with the real system
in order to alleviate the data efficiency problem in RL. However, designing
such algorithms is often challenging because the bias in simulated data may
overshadow the ease of data generation. A potential solution to this challenge
is to jointly learn and improve model and policy using a universal objective
function. In this paper, we leverage the connection between RL and
probabilistic inference, and formulate such an objective function as a
variational lower-bound of a log-likelihood. This allows us to use expectation
maximization (EM) and iteratively fix a baseline policy and learn a variational
distribution, consisting of a model and a policy (E-step), followed by
improving the baseline policy given the learned variational distribution
(M-step). We propose model-based and model-free policy iteration (actor-critic)
style algorithms for the E-step and show how the variational distribution
learned by them can be used to optimize the M-step in a fully model-based
fashion. Our experiments on a number of continuous control tasks show that
despite being more complex, our model-based (E-step) algorithm, called {\em
variational model-based policy optimization} (VMBPO), is more sample-efficient
and robust to hyper-parameter tuning than its model-free (E-step) counterpart.
Using the same control tasks, we also compare VMBPO with several
state-of-the-art model-based and model-free RL algorithms and show its sample
efficiency and performance.
Related papers
- The Virtues of Laziness in Model-based RL: A Unified Objective and
Algorithms [37.025378882978714]
We propose a novel approach to addressing two fundamental challenges in Model-based Reinforcement Learning (MBRL)
Our "lazy" method leverages a novel unified objective, Performance Difference via Advantage in Model, to capture the performance difference between the learned policy and expert policy.
We present two no-regret algorithms to optimize the proposed objective, and demonstrate their statistical and computational gains.
arXiv Detail & Related papers (2023-03-01T17:42:26Z) - When to Update Your Model: Constrained Model-based Reinforcement
Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL)
Our follow-up derived bounds reveal the relationship between model shifts and performance improvement.
A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z) - Fully Decentralized Model-based Policy Optimization for Networked
Systems [23.46407780093797]
This work aims to improve data efficiency of multi-agent control by model-based learning.
We consider networked systems where agents are cooperative and communicate only locally with their neighbors.
In our method, each agent learns a dynamic model to predict future states and broadcast their predictions by communication, and then the policies are trained under the model rollouts.
arXiv Detail & Related papers (2022-07-13T23:52:14Z) - Evaluating model-based planning and planner amortization for continuous
control [79.49319308600228]
We take a hybrid approach, combining model predictive control (MPC) with a learned model and model-free policy learning.
We find that well-tuned model-free agents are strong baselines even for high DoF control problems.
We show that it is possible to distil a model-based planner into a policy that amortizes the planning without any loss of performance.
arXiv Detail & Related papers (2021-10-07T12:00:40Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z) - Model-based Policy Optimization with Unsupervised Model Adaptation [37.09948645461043]
We investigate how to bridge the gap between real and simulated data due to inaccurate model estimation for better policy optimization.
We propose a novel model-based reinforcement learning framework AMPO, which introduces unsupervised model adaptation.
Our approach achieves state-of-the-art performance in terms of sample efficiency on a range of continuous control benchmark tasks.
arXiv Detail & Related papers (2020-10-19T14:19:42Z) - Control as Hybrid Inference [62.997667081978825]
We present an implementation of CHI which naturally mediates the balance between iterative and amortised inference.
We verify the scalability of our algorithm on a continuous control benchmark, demonstrating that it outperforms strong model-free and model-based baselines.
arXiv Detail & Related papers (2020-07-11T19:44:09Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z) - Model-Augmented Actor-Critic: Backpropagating through Paths [81.86992776864729]
Current model-based reinforcement learning approaches use the model simply as a learned black-box simulator.
We show how to make more effective use of the model by exploiting its differentiability.
arXiv Detail & Related papers (2020-05-16T19:18:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.