Model-Based Offline Reinforcement Learning with Pessimism-Modulated
Dynamics Belief
- URL: http://arxiv.org/abs/2210.06692v1
- Date: Thu, 13 Oct 2022 03:14:36 GMT
- Title: Model-Based Offline Reinforcement Learning with Pessimism-Modulated
Dynamics Belief
- Authors: Kaiyang Guo and Yunfeng Shao and Yanhui Geng
- Abstract summary: Model-based offline reinforcement learning (RL) aims to find highly rewarding policy, by leveraging a previously collected static dataset and a dynamics model.
In this work, we maintain a belief distribution over dynamics, and evaluate/optimize policy through biased sampling from the belief.
We show that the biased sampling naturally induces an updated dynamics belief with policy-dependent reweighting factor, termed Pessimism-Modulated Dynamics Belief.
- Score: 3.0036519884678894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Model-based offline reinforcement learning (RL) aims to find highly rewarding
policy, by leveraging a previously collected static dataset and a dynamics
model. While learned through reuse of static dataset, the dynamics model's
generalization ability hopefully promotes policy learning if properly utilized.
To that end, several works propose to quantify the uncertainty of predicted
dynamics, and explicitly apply it to penalize reward. However, as the dynamics
and the reward are intrinsically different factors in context of MDP,
characterizing the impact of dynamics uncertainty through reward penalty may
incur unexpected tradeoff between model utilization and risk avoidance. In this
work, we instead maintain a belief distribution over dynamics, and
evaluate/optimize policy through biased sampling from the belief. The sampling
procedure, biased towards pessimism, is derived based on an alternating Markov
game formulation of offline RL. We formally show that the biased sampling
naturally induces an updated dynamics belief with policy-dependent reweighting
factor, termed Pessimism-Modulated Dynamics Belief. To improve policy, we
devise an iterative regularized policy optimization algorithm for the game,
with guarantee of monotonous improvement under certain condition. To make
practical, we further devise an offline RL algorithm to approximately find the
solution. Empirical results show that the proposed approach achieves
state-of-the-art performance on a wide range of benchmark tasks.
Related papers
- A Bayesian Approach to Robust Inverse Reinforcement Learning [54.24816623644148]
We consider a Bayesian approach to offline model-based inverse reinforcement learning (IRL)
The proposed framework differs from existing offline model-based IRL approaches by performing simultaneous estimation of the expert's reward function and subjective model of environment dynamics.
Our analysis reveals a novel insight that the estimated policy exhibits robust performance when the expert is believed to have a highly accurate model of the environment.
arXiv Detail & Related papers (2023-09-15T17:37:09Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Model-based Offline Reinforcement Learning with Local Misspecification [35.75701143290119]
We present a model-based offline reinforcement learning policy performance lower bound that explicitly captures dynamics model misspecification and distribution mismatch.
We propose an empirical algorithm for optimal offline policy selection.
arXiv Detail & Related papers (2023-01-26T21:26:56Z) - When to Update Your Model: Constrained Model-based Reinforcement
Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL)
Our follow-up derived bounds reveal the relationship between model shifts and performance improvement.
A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z) - Model Generation with Provable Coverability for Offline Reinforcement
Learning [14.333861814143718]
offline optimization with dynamics-aware policy provides a new perspective for policy learning and out-of-distribution generalization.
But due to the limitation under the offline setting, the learned model could not mimic real dynamics well enough to support reliable out-of-distribution exploration.
We propose an algorithm to generate models optimizing their coverage for the real dynamics.
arXiv Detail & Related papers (2022-06-01T08:34:09Z) - DROMO: Distributionally Robust Offline Model-based Policy Optimization [0.0]
We consider the problem of offline reinforcement learning with model-based control.
We propose distributionally robust offline model-based policy optimization (DROMO)
arXiv Detail & Related papers (2021-09-15T13:25:14Z) - Autoregressive Dynamics Models for Offline Policy Evaluation and
Optimization [60.73540999409032]
We show that expressive autoregressive dynamics models generate different dimensions of the next state and reward sequentially conditioned on previous dimensions.
We also show that autoregressive dynamics models are useful for offline policy optimization by serving as a way to enrich the replay buffer.
arXiv Detail & Related papers (2021-04-28T16:48:44Z) - Foresee then Evaluate: Decomposing Value Estimation with Latent Future
Prediction [37.06232589005015]
Value function is the central notion of Reinforcement Learning (RL)
We propose Value Decomposition with Future Prediction (VDFP)
We analytically decompose the value function into a latent future dynamics part and a policy-independent trajectory return part, inducing a way to model latent dynamics and returns separately in value estimation.
arXiv Detail & Related papers (2021-03-03T07:28:56Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.