Plan To Predict: Learning an Uncertainty-Foreseeing Model for
Model-Based Reinforcement Learning
- URL: http://arxiv.org/abs/2301.08502v1
- Date: Fri, 20 Jan 2023 10:17:22 GMT
- Title: Plan To Predict: Learning an Uncertainty-Foreseeing Model for
Model-Based Reinforcement Learning
- Authors: Zifan Wu, Chao Yu, Chen Chen, Jianye Hao, Hankz Hankui Zhuo
- Abstract summary: We propose emphPlan To Predict (P2P), a framework that treats the model rollout process as a sequential decision making problem.
We show that P2P achieves state-of-the-art performance on several challenging benchmark tasks.
- Score: 32.24146877835396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In Model-based Reinforcement Learning (MBRL), model learning is critical
since an inaccurate model can bias policy learning via generating misleading
samples. However, learning an accurate model can be difficult since the policy
is continually updated and the induced distribution over visited states used
for model learning shifts accordingly. Prior methods alleviate this issue by
quantifying the uncertainty of model-generated samples. However, these methods
only quantify the uncertainty passively after the samples were generated,
rather than foreseeing the uncertainty before model trajectories fall into
those highly uncertain regions. The resulting low-quality samples can induce
unstable learning targets and hinder the optimization of the policy. Moreover,
while being learned to minimize one-step prediction errors, the model is
generally used to predict for multiple steps, leading to a mismatch between the
objectives of model learning and model usage. To this end, we propose
\emph{Plan To Predict} (P2P), an MBRL framework that treats the model rollout
process as a sequential decision making problem by reversely considering the
model as a decision maker and the current policy as the dynamics. In this way,
the model can quickly adapt to the current policy and foresee the multi-step
future uncertainty when generating trajectories. Theoretically, we show that
the performance of P2P can be guaranteed by approximately optimizing a lower
bound of the true environment return. Empirical results demonstrate that P2P
achieves state-of-the-art performance on several challenging benchmark tasks.
Related papers
- COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically
for Model-Based RL [50.385005413810084]
Dyna-style model-based reinforcement learning contains two phases: model rollouts to generate sample for policy learning and real environment exploration.
$textttCOPlanner$ is a planning-driven framework for model-based methods to address the inaccurately learned dynamics model problem.
arXiv Detail & Related papers (2023-10-11T06:10:07Z) - Predictable MDP Abstraction for Unsupervised Model-Based RL [93.91375268580806]
We propose predictable MDP abstraction (PMA)
Instead of training a predictive model on the original MDP, we train a model on a transformed MDP with a learned action space.
We theoretically analyze PMA and empirically demonstrate that PMA leads to significant improvements over prior unsupervised model-based RL approaches.
arXiv Detail & Related papers (2023-02-08T07:37:51Z) - When to Update Your Model: Constrained Model-based Reinforcement
Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL)
Our follow-up derived bounds reveal the relationship between model shifts and performance improvement.
A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z) - Value Gradient weighted Model-Based Reinforcement Learning [28.366157882991565]
Model-based reinforcement learning (MBRL) is a sample efficient technique to obtain control policies.
VaGraM is a novel method for value-aware model learning.
arXiv Detail & Related papers (2022-04-04T13:28:31Z) - Revisiting Design Choices in Model-Based Offline Reinforcement Learning [39.01805509055988]
Offline reinforcement learning enables agents to leverage large pre-collected datasets of environment transitions to learn control policies.
This paper compares and designs novel protocols to investigate their interaction with other hyper parameters, such as the number of models, or imaginary rollout horizon.
arXiv Detail & Related papers (2021-10-08T13:51:34Z) - Mismatched No More: Joint Model-Policy Optimization for Model-Based RL [172.37829823752364]
We propose a single objective for jointly training the model and the policy, such that updates to either component increases a lower bound on expected return.
Our objective is a global lower bound on expected return, and this bound becomes tight under certain assumptions.
The resulting algorithm (MnM) is conceptually similar to a GAN.
arXiv Detail & Related papers (2021-10-06T13:43:27Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z) - Model-based Policy Optimization with Unsupervised Model Adaptation [37.09948645461043]
We investigate how to bridge the gap between real and simulated data due to inaccurate model estimation for better policy optimization.
We propose a novel model-based reinforcement learning framework AMPO, which introduces unsupervised model adaptation.
Our approach achieves state-of-the-art performance in terms of sample efficiency on a range of continuous control benchmark tasks.
arXiv Detail & Related papers (2020-10-19T14:19:42Z) - Bidirectional Model-based Policy Optimization [30.732572976324516]
Model-based reinforcement learning approaches leverage a forward dynamics model to support planning and decision making.
In this paper, we propose to additionally construct a backward dynamics model to reduce the reliance on accuracy in forward model predictions.
We develop a novel method, called Bidirectional Model-based Policy (BMPO), to utilize both the forward model and backward model to generate short branched rollouts for policy optimization.
arXiv Detail & Related papers (2020-07-04T03:34:09Z) - Model-Augmented Actor-Critic: Backpropagating through Paths [81.86992776864729]
Current model-based reinforcement learning approaches use the model simply as a learned black-box simulator.
We show how to make more effective use of the model by exploiting its differentiability.
arXiv Detail & Related papers (2020-05-16T19:18:10Z) - Bootstrapped model learning and error correction for planning with
uncertainty in model-based RL [1.370633147306388]
A natural aim is to learn a model that reflects accurately the dynamics of the environment.
This paper explores the problem of model misspecification through uncertainty-aware reinforcement learning agents.
We propose a bootstrapped multi-headed neural network that learns the distribution of future states and rewards.
arXiv Detail & Related papers (2020-04-15T15:41:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.