VIPO: Value Function Inconsistency Penalized Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2504.11944v1
- Date: Wed, 16 Apr 2025 10:23:44 GMT
- Title: VIPO: Value Function Inconsistency Penalized Offline Reinforcement Learning
- Authors: Xuyang Chen, Guojian Wang, Keyu Yan, Lin Zhao,
- Abstract summary: VIPO is a novel model-based offline RL algorithm that incorporates self-supervised feedback from value estimation to enhance model training.<n> VIPO achieves state-of-the-art performance on almost all tasks in both D4RL and NeoRL benchmarks.
- Score: 10.834719840008308
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline reinforcement learning (RL) learns effective policies from pre-collected datasets, offering a practical solution for applications where online interactions are risky or costly. Model-based approaches are particularly advantageous for offline RL, owing to their data efficiency and generalizability. However, due to inherent model errors, model-based methods often artificially introduce conservatism guided by heuristic uncertainty estimation, which can be unreliable. In this paper, we introduce VIPO, a novel model-based offline RL algorithm that incorporates self-supervised feedback from value estimation to enhance model training. Specifically, the model is learned by additionally minimizing the inconsistency between the value learned directly from the offline data and the one estimated from the model. We perform comprehensive evaluations from multiple perspectives to show that VIPO can learn a highly accurate model efficiently and consistently outperform existing methods. It offers a general framework that can be readily integrated into existing model-based offline RL algorithms to systematically enhance model accuracy. As a result, VIPO achieves state-of-the-art performance on almost all tasks in both D4RL and NeoRL benchmarks.
Related papers
- Model-based Offline Reinforcement Learning with Lower Expectile Q-Learning [6.345851712811528]
We introduce a novel model-based offline RL method, Lower Expectile Q-learning (LEQ)<n>LEQ provides a low-bias model-based value estimation via lower expectile regression of $lambda$-returns.<n>Our studies demonstrate that lower expectile regression, $lambda$-returns, and critic training on offline data are all crucial for LEQ.
arXiv Detail & Related papers (2024-06-30T13:44:59Z) - SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets [32.496818080222646]
We propose a new approach to model-based offline reinforcement learning.
We provide a theoretical guarantee of model uncertainty and performance bound of SeMOPO.
Experimental results show that our method substantially outperforms all baseline methods.
arXiv Detail & Related papers (2024-06-13T15:16:38Z) - Deep autoregressive density nets vs neural ensembles for model-based
offline reinforcement learning [2.9158689853305693]
We consider a model-based reinforcement learning algorithm that infers the system dynamics from the available data and performs policy optimization on imaginary model rollouts.
This approach is vulnerable to exploiting model errors which can lead to catastrophic failures on the real system.
We show that better performance can be obtained with a single well-calibrated autoregressive model on the D4RL benchmark.
arXiv Detail & Related papers (2024-02-05T10:18:15Z) - MOTO: Offline Pre-training to Online Fine-tuning for Model-based Robot
Learning [52.101643259906915]
We study the problem of offline pre-training and online fine-tuning for reinforcement learning from high-dimensional observations.
Existing model-based offline RL methods are not suitable for offline-to-online fine-tuning in high-dimensional domains.
We propose an on-policy model-based method that can efficiently reuse prior data through model-based value expansion and policy regularization.
arXiv Detail & Related papers (2024-01-06T21:04:31Z) - A Unified Framework for Alternating Offline Model Training and Policy
Learning [62.19209005400561]
In offline model-based reinforcement learning, we learn a dynamic model from historically collected data, and utilize the learned model and fixed datasets for policy learning.
We develop an iterative offline MBRL framework, where we maximize a lower bound of the true expected return.
With the proposed unified model-policy learning framework, we achieve competitive performance on a wide range of continuous-control offline reinforcement learning datasets.
arXiv Detail & Related papers (2022-10-12T04:58:51Z) - Simplifying Model-based RL: Learning Representations, Latent-space
Models, and Policies with One Objective [142.36200080384145]
We propose a single objective which jointly optimize a latent-space model and policy to achieve high returns while remaining self-consistent.
We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods.
arXiv Detail & Related papers (2022-09-18T03:51:58Z) - Double Check Your State Before Trusting It: Confidence-Aware
Bidirectional Offline Model-Based Imagination [31.805991958408438]
We propose to augment the offline dataset by using trained bidirectional dynamics models and rollout policies with double check.
Our method, confidence-aware bidirectional offline model-based imagination, generates reliable samples and can be combined with any model-free offline RL method.
arXiv Detail & Related papers (2022-06-16T08:00:44Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z) - Offline Reinforcement Learning from Images with Latent Space Models [60.69745540036375]
offline reinforcement learning (RL) refers to the problem of learning policies from a static dataset of environment interactions.
We build on recent advances in model-based algorithms for offline RL, and extend them to high-dimensional visual observation spaces.
Our approach is both tractable in practice and corresponds to maximizing a lower bound of the ELBO in the unknown POMDP.
arXiv Detail & Related papers (2020-12-21T18:28:17Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.