Conservative Bayesian Model-Based Value Expansion for Offline Policy
Optimization
- URL: http://arxiv.org/abs/2210.03802v1
- Date: Fri, 7 Oct 2022 20:13:50 GMT
- Title: Conservative Bayesian Model-Based Value Expansion for Offline Policy
Optimization
- Authors: Jihwan Jeong, Xiaoyu Wang, Michael Gimelfarb, Hyunwoo Kim, Baher
Abdulhai, Scott Sanner
- Abstract summary: offline reinforcement learning (RL) addresses the problem of learning a performant policy from a fixed batch of data collected by following some behavior policy.
Model-based approaches are particularly appealing since they can extract more learning signals from the logged dataset by learning a model of the environment.
- Score: 41.774837419584735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline reinforcement learning (RL) addresses the problem of learning a
performant policy from a fixed batch of data collected by following some
behavior policy. Model-based approaches are particularly appealing in the
offline setting since they can extract more learning signals from the logged
dataset by learning a model of the environment. However, the performance of
existing model-based approaches falls short of model-free counterparts, due to
the compounding of estimation errors in the learned model. Driven by this
observation, we argue that it is critical for a model-based method to
understand when to trust the model and when to rely on model-free estimates,
and how to act conservatively w.r.t. both. To this end, we derive an elegant
and simple methodology called conservative Bayesian model-based value expansion
for offline policy optimization (CBOP), that trades off model-free and
model-based estimates during the policy evaluation step according to their
epistemic uncertainties, and facilitates conservatism by taking a lower bound
on the Bayesian posterior value estimate. On the standard D4RL continuous
control tasks, we find that our method significantly outperforms previous
model-based approaches: e.g., MOPO by $116.4$%, MOReL by $23.2$% and COMBO by
$23.7$%. Further, CBOP achieves state-of-the-art performance on $11$ out of
$18$ benchmark datasets while doing on par on the remaining datasets.
Related papers
- Constrained Latent Action Policies for Model-Based Offline Reinforcement Learning [5.012314384895537]
In offline reinforcement learning, a policy is learned using a static dataset in the absence of costly feedback from the environment.
We propose Constrained Latent Action Policies (C-LAP) which learns a generative model of the joint distribution of observations and actions.
arXiv Detail & Related papers (2024-11-07T09:35:22Z) - COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically
for Model-Based RL [50.385005413810084]
Dyna-style model-based reinforcement learning contains two phases: model rollouts to generate sample for policy learning and real environment exploration.
$textttCOPlanner$ is a planning-driven framework for model-based methods to address the inaccurately learned dynamics model problem.
arXiv Detail & Related papers (2023-10-11T06:10:07Z) - DOMAIN: MilDly COnservative Model-BAsed OfflINe Reinforcement Learning [14.952800864366512]
conservatism should be incorporated into the algorithm to balance accurate offline data and imprecise model data.
This paper proposes a milDly cOnservative Model-bAsed offlINe RL algorithm (DOMAIN) without estimating model uncertainty.
The results of extensive experiments show that DOMAIN outperforms prior RL algorithms on the D4RL dataset benchmark.
arXiv Detail & Related papers (2023-09-16T08:39:28Z) - Model-based Reinforcement Learning with Multi-step Plan Value Estimation [4.158979444110977]
We introduce multi-step plans to replace multi-step actions for model-based RL.
The new model-based reinforcement learning algorithm MPPVE shows a better utilization of the learned model and achieves a better sample efficiency than state-of-the-art model-based RL approaches.
arXiv Detail & Related papers (2022-09-12T18:22:11Z) - RAMBO-RL: Robust Adversarial Model-Based Offline Reinforcement Learning [11.183124892686239]
We present Robust Adversarial Model-Based Offline RL (RAMBO), a novel approach to model-based offline RL.
To achieve conservatism, we formulate the problem as a two-player zero sum game against an adversarial environment model.
We evaluate our approach on widely studied offline RL benchmarks, and demonstrate that our approach achieves state of the art performance.
arXiv Detail & Related papers (2022-04-26T20:42:14Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z) - Generative Temporal Difference Learning for Infinite-Horizon Prediction [101.59882753763888]
We introduce the $gamma$-model, a predictive model of environment dynamics with an infinite probabilistic horizon.
We discuss how its training reflects an inescapable tradeoff between training-time and testing-time compounding errors.
arXiv Detail & Related papers (2020-10-27T17:54:12Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z) - Model-Augmented Actor-Critic: Backpropagating through Paths [81.86992776864729]
Current model-based reinforcement learning approaches use the model simply as a learned black-box simulator.
We show how to make more effective use of the model by exploiting its differentiability.
arXiv Detail & Related papers (2020-05-16T19:18:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.