Taming OOD Actions for Offline Reinforcement Learning: An Advantage-Based Approach
- URL: http://arxiv.org/abs/2505.05126v3
- Date: Sat, 21 Jun 2025 13:05:20 GMT
- Title: Taming OOD Actions for Offline Reinforcement Learning: An Advantage-Based Approach
- Authors: Xuyang Chen, Keyu Yan, Lin Zhao,
- Abstract summary: offline reinforcement learning (RL) aims to learn decision-making policies from fixed datasets without online interactions.<n>We propose Advantage-based Diffusion Actor-Critic (ADAC) as a novel method that systematically evaluates OOD actions.<n>ADAC achieves state-of-the-art performance on almost all tasks in the D4RL benchmark.
- Score: 11.836153064242811
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline reinforcement learning (RL) aims to learn decision-making policies from fixed datasets without online interactions, providing a practical solution where online data collection is expensive or risky. However, offline RL often suffers from distribution shift, resulting in inaccurate evaluation and substantial overestimation on out-of-distribution (OOD) actions. To address this, existing approaches incorporate conservatism by indiscriminately discouraging all OOD actions, thereby hindering the agent's ability to generalize and exploit beneficial ones. In this paper, we propose Advantage-based Diffusion Actor-Critic (ADAC), a novel method that systematically evaluates OOD actions using the batch-optimal value function. Based on this evaluation, ADAC defines an advantage function to modulate the Q-function update, enabling more precise assessment of OOD action quality. We design a custom PointMaze environment and collect datasets to visually reveal that advantage modulation can effectively identify and select superior OOD actions. Extensive experiments show that ADAC achieves state-of-the-art performance on almost all tasks in the D4RL benchmark, with particularly clear margins on the more challenging tasks.
Related papers
- Imagination-Limited Q-Learning for Offline Reinforcement Learning [18.8976065411658]
We propose an Imagination-Limited Q-learning (ILQ) method to balance exploitation and restriction.<n>Specifically, we utilize the dynamics model to imagine OOD action-values, and then clip the imagined values with the maximum behavior values.<n>Our method achieves state-of-the-art performance on a wide range of tasks in the D4RL benchmark.
arXiv Detail & Related papers (2025-05-18T03:05:21Z) - Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.92083784393418]
Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance.<n>We propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization [23.817251267022847]
We propose the Behavior-Supported Policy Optimization (BSPO) method to mitigate the reward over-optimization issue.<n>BSPO reduces the generation of OOD responses during the reinforcement learning process.<n> Empirical results show that BSPO outperforms baselines in preventing reward over-optimization.
arXiv Detail & Related papers (2025-03-23T16:20:59Z) - Out-of-Distribution Detection using Synthetic Data Generation [21.612592503592143]
In- and out-of-distribution (OOD) inputs are crucial for reliable deployment of classification systems.<n>We present a method that harnesses the generative capabilities of Large Language Models (LLMs) to create high-quality synthetic OOD proxies.
arXiv Detail & Related papers (2025-02-05T16:22:09Z) - Offline Reinforcement Learning with OOD State Correction and OOD Action Suppression [47.598803055066554]
In offline reinforcement learning (RL), addressing the out-of-distribution (OOD) action issue has been a focus.
We argue that there exists an OOD state issue that also impairs performance yet has been underexplored.
We propose SCAS, a simple yet effective approach that unifies OOD state correction and OOD action suppression in offline RL.
arXiv Detail & Related papers (2024-10-25T09:01:37Z) - Strategically Conservative Q-Learning [89.17906766703763]
offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility.
The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions.
We propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate.
arXiv Detail & Related papers (2024-06-06T22:09:46Z) - Skeleton-OOD: An End-to-End Skeleton-Based Model for Robust Out-of-Distribution Human Action Detection [17.85872085904999]
We propose a novel end-to-end skeleton-based model called Skeleton-OOD.<n>Skeleton-OOD is committed to improving the effectiveness of OOD tasks while ensuring the accuracy of ID recognition.<n>Our findings underscore the effectiveness of classic OOD detection techniques in the context of skeleton-based action recognition tasks.
arXiv Detail & Related papers (2024-05-31T05:49:37Z) - Fast Decision Boundary based Out-of-Distribution Detector [7.04686607977352]
Out-of-Distribution (OOD) detection is essential for the safe deployment of AI systems.
Existing feature space methods, while effective, often incur significant computational overhead.
We propose a computationally-efficient OOD detector without using auxiliary models.
arXiv Detail & Related papers (2023-12-15T19:50:32Z) - AUTO: Adaptive Outlier Optimization for Online Test-Time OOD Detection [81.49353397201887]
Out-of-distribution (OOD) detection is crucial to deploying machine learning models in open-world applications.
We introduce a novel paradigm called test-time OOD detection, which utilizes unlabeled online data directly at test time to improve OOD detection performance.
We propose adaptive outlier optimization (AUTO), which consists of an in-out-aware filter, an ID memory bank, and a semantically-consistent objective.
arXiv Detail & Related papers (2023-03-22T02:28:54Z) - Out-of-distribution Detection with Implicit Outlier Transformation [72.73711947366377]
Outlier exposure (OE) is powerful in out-of-distribution (OOD) detection.
We propose a novel OE-based approach that makes the model perform well for unseen OOD situations.
arXiv Detail & Related papers (2023-03-09T04:36:38Z) - Dealing with the Unknown: Pessimistic Offline Reinforcement Learning [25.30634466168587]
We propose a Pessimistic Offline Reinforcement Learning (PessORL) algorithm to actively lead the agent back to the area where it is familiar.
We focus on problems caused by out-of-distribution (OOD) states, and deliberately penalize high values at states that are absent in the training dataset.
arXiv Detail & Related papers (2021-11-09T22:38:58Z) - Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration.
Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states.
We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z) - ATOM: Robustifying Out-of-distribution Detection Using Outlier Mining [51.19164318924997]
Adrial Training with informative Outlier Mining improves robustness of OOD detection.
ATOM achieves state-of-the-art performance under a broad family of classic and adversarial OOD evaluation tasks.
arXiv Detail & Related papers (2020-06-26T20:58:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.