Related papers: Imagination-Limited Q-Learning for Offline Reinforcement Learning

Imagination-Limited Q-Learning for Offline Reinforcement Learning

URL: http://arxiv.org/abs/2505.12211v1
Date: Sun, 18 May 2025 03:05:21 GMT
Title: Imagination-Limited Q-Learning for Offline Reinforcement Learning
Authors: Wenhui Liu, Zhijian Wu, Jingchao Wang, Dingjiang Huang, Shuigeng Zhou,
Abstract summary: We propose an Imagination-Limited Q-learning (ILQ) method to balance exploitation and restriction.<n>Specifically, we utilize the dynamics model to imagine OOD action-values, and then clip the imagined values with the maximum behavior values.<n>Our method achieves state-of-the-art performance on a wide range of tasks in the D4RL benchmark.
Score: 18.8976065411658
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Offline reinforcement learning seeks to derive improved policies entirely from historical data but often struggles with over-optimistic value estimates for out-of-distribution (OOD) actions. This issue is typically mitigated via policy constraint or conservative value regularization methods. However, these approaches may impose overly constraints or biased value estimates, potentially limiting performance improvements. To balance exploitation and restriction, we propose an Imagination-Limited Q-learning (ILQ) method, which aims to maintain the optimism that OOD actions deserve within appropriate limits. Specifically, we utilize the dynamics model to imagine OOD action-values, and then clip the imagined values with the maximum behavior values. Such design maintains reasonable evaluation of OOD actions to the furthest extent, while avoiding its over-optimism. Theoretically, we prove the convergence of the proposed ILQ under tabular Markov decision processes. Particularly, we demonstrate that the error bound between estimated values and optimality values of OOD state-actions possesses the same magnitude as that of in-distribution ones, thereby indicating that the bias in value estimates is effectively mitigated. Empirically, our method achieves state-of-the-art performance on a wide range of tasks in the D4RL benchmark.

Related papers

Taming OOD Actions for Offline Reinforcement Learning: An Advantage-Based Approach [11.836153064242811]
offline reinforcement learning (RL) aims to learn decision-making policies from fixed datasets without online interactions.<n>We propose Advantage-based Diffusion Actor-Critic (ADAC) as a novel method that systematically evaluates OOD actions.<n>ADAC achieves state-of-the-art performance on almost all tasks in the D4RL benchmark.
arXiv Detail & Related papers (2025-05-08T10:57:28Z)
Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization [23.817251267022847]
We propose the Behavior-Supported Policy Optimization (BSPO) method to mitigate the reward over-optimization issue.<n>BSPO reduces the generation of OOD responses during the reinforcement learning process.<n> Empirical results show that BSPO outperforms baselines in preventing reward over-optimization.
arXiv Detail & Related papers (2025-03-23T16:20:59Z)
Strategically Conservative Q-Learning [89.17906766703763]
offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility. The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions. We propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate.
arXiv Detail & Related papers (2024-06-06T22:09:46Z)
When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z)
Conservative State Value Estimation for Offline Reinforcement Learning [36.416504941791224]
Conservative State Value Estimation (CSVE) learns conservative V-function via directly imposing penalty on OOD states. We develop a practical actor-critic algorithm in which the critic does the conservative value estimation by additionally sampling and penalizing the states empharound the dataset. We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods and is strongly competitive among recent SOTA methods.
arXiv Detail & Related papers (2023-02-14T08:13:55Z)
Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning [125.8224674893018]
Offline Reinforcement Learning (RL) aims to learn policies from previously collected datasets without exploring the environment. Applying off-policy algorithms to offline RL usually fails due to the extrapolation error caused by the out-of-distribution (OOD) actions. We propose Pessimistic Bootstrapping for offline RL (PBRL), a purely uncertainty-driven offline algorithm without explicit policy constraints.
arXiv Detail & Related papers (2022-02-23T15:27:16Z)
Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy. We propose an offline RL method that never needs to evaluate actions outside of the dataset. This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)
Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation. We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z)
Offline Contextual Bandits with Overparameterized Models [52.788628474552276]
We ask whether the same phenomenon occurs for offline contextual bandits. We show that this discrepancy is due to the emphaction-stability of their objectives. In experiments with large neural networks, this gap between action-stable value-based objectives and unstable policy-based objectives leads to significant performance differences.
arXiv Detail & Related papers (2020-06-27T13:52:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.