Model-based Offline Reinforcement Learning with Local Misspecification
- URL: http://arxiv.org/abs/2301.11426v1
- Date: Thu, 26 Jan 2023 21:26:56 GMT
- Title: Model-based Offline Reinforcement Learning with Local Misspecification
- Authors: Kefan Dong, Yannis Flet-Berliac, Allen Nie, Emma Brunskill
- Abstract summary: We present a model-based offline reinforcement learning policy performance lower bound that explicitly captures dynamics model misspecification and distribution mismatch.
We propose an empirical algorithm for optimal offline policy selection.
- Score: 35.75701143290119
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present a model-based offline reinforcement learning policy performance
lower bound that explicitly captures dynamics model misspecification and
distribution mismatch and we propose an empirical algorithm for optimal offline
policy selection. Theoretically, we prove a novel safe policy improvement
theorem by establishing pessimism approximations to the value function. Our key
insight is to jointly consider selecting over dynamics models and policies: as
long as a dynamics model can accurately represent the dynamics of the
state-action pairs visited by a given policy, it is possible to approximate the
value of that particular policy. We analyze our lower bound in the LQR setting
and also show competitive performance to previous lower bounds on policy
selection across a set of D4RL tasks.
Related papers
- SAMBO-RL: Shifts-aware Model-based Offline Reinforcement Learning [9.88109749688605]
Model-based Offline Reinforcement Learning trains policies based on offline datasets and model dynamics.
This paper disentangles the problem into two key components: model bias and policy shift.
We introduce Shifts-aware Model-based Offline Reinforcement Learning (SAMBO-RL)
arXiv Detail & Related papers (2024-08-23T04:25:09Z) - Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states.
The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Hallucinated Adversarial Control for Conservative Offline Policy
Evaluation [64.94009515033984]
We study the problem of conservative off-policy evaluation (COPE) where given an offline dataset of environment interactions, we seek to obtain a (tight) lower bound on a policy's performance.
We introduce HAMBO, which builds on an uncertainty-aware learned model of the transition dynamics.
We prove that the resulting COPE estimates are valid lower bounds, and, under regularity conditions, show their convergence to the true expected return.
arXiv Detail & Related papers (2023-03-02T08:57:35Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Model-Based Offline Reinforcement Learning with Pessimism-Modulated
Dynamics Belief [3.0036519884678894]
Model-based offline reinforcement learning (RL) aims to find highly rewarding policy, by leveraging a previously collected static dataset and a dynamics model.
In this work, we maintain a belief distribution over dynamics, and evaluate/optimize policy through biased sampling from the belief.
We show that the biased sampling naturally induces an updated dynamics belief with policy-dependent reweighting factor, termed Pessimism-Modulated Dynamics Belief.
arXiv Detail & Related papers (2022-10-13T03:14:36Z) - Model Generation with Provable Coverability for Offline Reinforcement
Learning [14.333861814143718]
offline optimization with dynamics-aware policy provides a new perspective for policy learning and out-of-distribution generalization.
But due to the limitation under the offline setting, the learned model could not mimic real dynamics well enough to support reliable out-of-distribution exploration.
We propose an algorithm to generate models optimizing their coverage for the real dynamics.
arXiv Detail & Related papers (2022-06-01T08:34:09Z) - Model-Based Offline Meta-Reinforcement Learning with Regularization [63.35040401948943]
offline Meta-RL is emerging as a promising approach to address these challenges.
MerPO learns a meta-model for efficient task structure inference and an informative meta-policy.
We show that MerPO offers guaranteed improvement over both the behavior policy and the meta-policy.
arXiv Detail & Related papers (2022-02-07T04:15:20Z) - DROMO: Distributionally Robust Offline Model-based Policy Optimization [0.0]
We consider the problem of offline reinforcement learning with model-based control.
We propose distributionally robust offline model-based policy optimization (DROMO)
arXiv Detail & Related papers (2021-09-15T13:25:14Z) - Minimax Model Learning [42.65032356835701]
We present a novel off-policy loss function for learning a transition model in model-based reinforcement learning.
Our loss is derived from the off-policy policy evaluation objective with an emphasis on correcting distribution shift.
arXiv Detail & Related papers (2021-03-02T23:16:36Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.