Mismatched No More: Joint Model-Policy Optimization for Model-Based RL
- URL: http://arxiv.org/abs/2110.02758v1
- Date: Wed, 6 Oct 2021 13:43:27 GMT
- Title: Mismatched No More: Joint Model-Policy Optimization for Model-Based RL
- Authors: Benjamin Eysenbach, Alexander Khazatsky, Sergey Levine, and Ruslan
Salakhutdinov
- Abstract summary: We propose a single objective for jointly training the model and the policy, such that updates to either component increases a lower bound on expected return.
Our objective is a global lower bound on expected return, and this bound becomes tight under certain assumptions.
The resulting algorithm (MnM) is conceptually similar to a GAN.
- Score: 172.37829823752364
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many model-based reinforcement learning (RL) methods follow a similar
template: fit a model to previously observed data, and then use data from that
model for RL or planning. However, models that achieve better training
performance (e.g., lower MSE) are not necessarily better for control: an RL
agent may seek out the small fraction of states where an accurate model makes
mistakes, or it might act in ways that do not expose the errors of an
inaccurate model. As noted in prior work, there is an objective mismatch:
models are useful if they yield good policies, but they are trained to maximize
their accuracy, rather than the performance of the policies that result from
them. In this work, we propose a single objective for jointly training the
model and the policy, such that updates to either component increases a lower
bound on expected return. This joint optimization mends the objective mismatch
in prior work. Our objective is a global lower bound on expected return, and
this bound becomes tight under certain assumptions. The resulting algorithm
(MnM) is conceptually similar to a GAN: a classifier distinguishes between real
and fake transitions, the model is updated to produce transitions that look
realistic, and the policy is updated to avoid states where the model
predictions are unrealistic.
Related papers
- Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest.
Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z) - Predictable MDP Abstraction for Unsupervised Model-Based RL [93.91375268580806]
We propose predictable MDP abstraction (PMA)
Instead of training a predictive model on the original MDP, we train a model on a transformed MDP with a learned action space.
We theoretically analyze PMA and empirically demonstrate that PMA leads to significant improvements over prior unsupervised model-based RL approaches.
arXiv Detail & Related papers (2023-02-08T07:37:51Z) - Plan To Predict: Learning an Uncertainty-Foreseeing Model for
Model-Based Reinforcement Learning [32.24146877835396]
We propose emphPlan To Predict (P2P), a framework that treats the model rollout process as a sequential decision making problem.
We show that P2P achieves state-of-the-art performance on several challenging benchmark tasks.
arXiv Detail & Related papers (2023-01-20T10:17:22Z) - When to Update Your Model: Constrained Model-based Reinforcement
Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL)
Our follow-up derived bounds reveal the relationship between model shifts and performance improvement.
A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z) - Simplifying Model-based RL: Learning Representations, Latent-space
Models, and Policies with One Objective [142.36200080384145]
We propose a single objective which jointly optimize a latent-space model and policy to achieve high returns while remaining self-consistent.
We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods.
arXiv Detail & Related papers (2022-09-18T03:51:58Z) - Value Gradient weighted Model-Based Reinforcement Learning [28.366157882991565]
Model-based reinforcement learning (MBRL) is a sample efficient technique to obtain control policies.
VaGraM is a novel method for value-aware model learning.
arXiv Detail & Related papers (2022-04-04T13:28:31Z) - How to Learn when Data Gradually Reacts to Your Model [10.074466859579571]
We propose a new algorithm, Stateful Performative Gradient Descent (Stateful PerfGD), for minimizing the performative loss even in the presence of these effects.
Our experiments confirm that Stateful PerfGD substantially outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2021-12-13T22:05:26Z) - Model-based Policy Optimization with Unsupervised Model Adaptation [37.09948645461043]
We investigate how to bridge the gap between real and simulated data due to inaccurate model estimation for better policy optimization.
We propose a novel model-based reinforcement learning framework AMPO, which introduces unsupervised model adaptation.
Our approach achieves state-of-the-art performance in terms of sample efficiency on a range of continuous control benchmark tasks.
arXiv Detail & Related papers (2020-10-19T14:19:42Z) - Trust the Model When It Is Confident: Masked Model-based Actor-Critic [11.675078067322897]
Masked Model-based Actor-Critic (M2AC) is a novel policy optimization algorithm.
M2AC implements a masking mechanism based on the model's uncertainty to decide whether its prediction should be used or not.
arXiv Detail & Related papers (2020-10-10T03:39:56Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.