Taming OOD Actions for Offline Reinforcement Learning: An Advantage-Based Approach
- URL: http://arxiv.org/abs/2505.05126v1
- Date: Thu, 08 May 2025 10:57:28 GMT
- Title: Taming OOD Actions for Offline Reinforcement Learning: An Advantage-Based Approach
- Authors: Xuyang Chen, Keyu Yan, Lin Zhao,
- Abstract summary: offline reinforcement learning (RL) aims to learn decision-making policies from fixed datasets without online interactions.<n>We propose Advantage-based Diffusion Actor-Critic (ADAC) as a novel method that systematically evaluates OOD actions.<n>ADAC achieves state-of-the-art performance on almost all tasks in the D4RL benchmark.
- Score: 11.836153064242811
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline reinforcement learning (RL) aims to learn decision-making policies from fixed datasets without online interactions, providing a practical solution where online data collection is expensive or risky. However, offline RL often suffers from distribution shift, resulting in inaccurate evaluation and substantial overestimation on out-of-distribution (OOD) actions. To address this, existing approaches incorporate conservatism by indiscriminately discouraging all OOD actions, thereby hindering the agent's ability to generalize and exploit beneficial ones. In this paper, we propose Advantage-based Diffusion Actor-Critic (ADAC), a novel method that systematically evaluates OOD actions using the batch-optimal value function. Based on this evaluation, ADAC defines an advantage function to modulate the Q-function update, enabling more precise assessment of OOD action quality. We design a custom PointMaze environment and collect datasets to visually reveal that advantage modulation can effectively identify and select superior OOD actions. Extensive experiments show that ADAC achieves state-of-the-art performance on almost all tasks in the D4RL benchmark, with particularly clear margins on the more challenging tasks.
Related papers
- ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training [15.70383059978939]
We study how to improve large foundation vision--action (VLA) systems through online reinforcement learning (RL) in real-world settings.<n>In practice, the value function is estimated from trajectory fragments collected from different data sources.<n>We propose ALOE, an action-level off-policy evaluation framework for VLA post-training.
arXiv Detail & Related papers (2026-02-13T07:46:37Z) - Imagination-Limited Q-Learning for Offline Reinforcement Learning [18.8976065411658]
We propose an Imagination-Limited Q-learning (ILQ) method to balance exploitation and restriction.<n>Specifically, we utilize the dynamics model to imagine OOD action-values, and then clip the imagined values with the maximum behavior values.<n>Our method achieves state-of-the-art performance on a wide range of tasks in the D4RL benchmark.
arXiv Detail & Related papers (2025-05-18T03:05:21Z) - Adaptive Scoring and Thresholding with Human Feedback for Robust Out-of-Distribution Detection [6.192472816262214]
Machine Learning (ML) models are trained on in-distribution (ID) data but often encounter out-of-distribution (OOD) inputs during deployment.<n>Recent works have focused on designing scoring functions to quantify OOD uncertainty.<n>We propose a human-in-the-loop framework that emphsafely updates both scoring functions and thresholds on the fly based on real-world OOD inputs.
arXiv Detail & Related papers (2025-05-05T00:25:14Z) - Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.92083784393418]
Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance.<n>We propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization [23.817251267022847]
We propose the Behavior-Supported Policy Optimization (BSPO) method to mitigate the reward over-optimization issue.<n>BSPO reduces the generation of OOD responses during the reinforcement learning process.<n> Empirical results show that BSPO outperforms baselines in preventing reward over-optimization.
arXiv Detail & Related papers (2025-03-23T16:20:59Z) - Out-of-Distribution Detection using Synthetic Data Generation [21.612592503592143]
In- and out-of-distribution (OOD) inputs are crucial for reliable deployment of classification systems.<n>We present a method that harnesses the generative capabilities of Large Language Models (LLMs) to create high-quality synthetic OOD proxies.
arXiv Detail & Related papers (2025-02-05T16:22:09Z) - Offline Reinforcement Learning with OOD State Correction and OOD Action Suppression [47.598803055066554]
In offline reinforcement learning (RL), addressing the out-of-distribution (OOD) action issue has been a focus.
We argue that there exists an OOD state issue that also impairs performance yet has been underexplored.
We propose SCAS, a simple yet effective approach that unifies OOD state correction and OOD action suppression in offline RL.
arXiv Detail & Related papers (2024-10-25T09:01:37Z) - Strategically Conservative Q-Learning [89.17906766703763]
offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility.
The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions.
We propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate.
arXiv Detail & Related papers (2024-06-06T22:09:46Z) - Skeleton-OOD: An End-to-End Skeleton-Based Model for Robust Out-of-Distribution Human Action Detection [17.85872085904999]
We propose a novel end-to-end skeleton-based model called Skeleton-OOD.<n>Skeleton-OOD is committed to improving the effectiveness of OOD tasks while ensuring the accuracy of ID recognition.<n>Our findings underscore the effectiveness of classic OOD detection techniques in the context of skeleton-based action recognition tasks.
arXiv Detail & Related papers (2024-05-31T05:49:37Z) - Fast Decision Boundary based Out-of-Distribution Detector [7.04686607977352]
Out-of-Distribution (OOD) detection is essential for the safe deployment of AI systems.
Existing feature space methods, while effective, often incur significant computational overhead.
We propose a computationally-efficient OOD detector without using auxiliary models.
arXiv Detail & Related papers (2023-12-15T19:50:32Z) - Augmenting Unsupervised Reinforcement Learning with Self-Reference [63.68018737038331]
Humans possess the ability to draw on past experiences explicitly when learning new tasks.
We propose the Self-Reference (SR) approach, an add-on module explicitly designed to leverage historical information.
Our approach achieves state-of-the-art results in terms of Interquartile Mean (IQM) performance and Optimality Gap reduction on the Unsupervised Reinforcement Learning Benchmark.
arXiv Detail & Related papers (2023-11-16T09:07:34Z) - AUTO: Adaptive Outlier Optimization for Online Test-Time OOD Detection [81.49353397201887]
Out-of-distribution (OOD) detection is crucial to deploying machine learning models in open-world applications.
We introduce a novel paradigm called test-time OOD detection, which utilizes unlabeled online data directly at test time to improve OOD detection performance.
We propose adaptive outlier optimization (AUTO), which consists of an in-out-aware filter, an ID memory bank, and a semantically-consistent objective.
arXiv Detail & Related papers (2023-03-22T02:28:54Z) - Out-of-distribution Detection with Implicit Outlier Transformation [72.73711947366377]
Outlier exposure (OE) is powerful in out-of-distribution (OOD) detection.
We propose a novel OE-based approach that makes the model perform well for unseen OOD situations.
arXiv Detail & Related papers (2023-03-09T04:36:38Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - Dealing with the Unknown: Pessimistic Offline Reinforcement Learning [25.30634466168587]
We propose a Pessimistic Offline Reinforcement Learning (PessORL) algorithm to actively lead the agent back to the area where it is familiar.
We focus on problems caused by out-of-distribution (OOD) states, and deliberately penalize high values at states that are absent in the training dataset.
arXiv Detail & Related papers (2021-11-09T22:38:58Z) - Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration.
Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states.
We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z) - ATOM: Robustifying Out-of-distribution Detection Using Outlier Mining [51.19164318924997]
Adrial Training with informative Outlier Mining improves robustness of OOD detection.
ATOM achieves state-of-the-art performance under a broad family of classic and adversarial OOD evaluation tasks.
arXiv Detail & Related papers (2020-06-26T20:58:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.