Related papers: Option-aware Temporally Abstracted Value for Offline Goal-Conditioned Reinforcement Learning

Option-aware Temporally Abstracted Value for Offline Goal-Conditioned Reinforcement Learning

URL: http://arxiv.org/abs/2505.12737v2
Date: Tue, 04 Nov 2025 02:26:57 GMT
Title: Option-aware Temporally Abstracted Value for Offline Goal-Conditioned Reinforcement Learning
Authors: Hongjoon Ahn, Heewoong Choi, Jisu Han, Taesup Moon,
Abstract summary: offline goal-conditioned reinforcement learning (GCRL) offers a practical learning paradigm in which goal-reaching policies are trained from abundant state-action trajectory datasets.<n>In this paper, we propose a simple yet effective solution: Option-aware Temporally Abstracted value learning, dubbed OTA, which incorporates temporal abstraction into the temporal-difference learning process.<n>We experimentally show that the high-level policy learned using OTA achieves strong performance on complex tasks from OGBench, a recently proposed offline GCRL benchmark.
Score: 19.341894845618445
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Offline goal-conditioned reinforcement learning (GCRL) offers a practical learning paradigm in which goal-reaching policies are trained from abundant state-action trajectory datasets without additional environment interaction. However, offline GCRL still struggles with long-horizon tasks, even with recent advances that employ hierarchical policy structures, such as HIQL. Identifying the root cause of this challenge, we observe the following insight. Firstly, performance bottlenecks mainly stem from the high-level policy's inability to generate appropriate subgoals. Secondly, when learning the high-level policy in the long-horizon regime, the sign of the advantage estimate frequently becomes incorrect. Thus, we argue that improving the value function to produce a clear advantage estimate for learning the high-level policy is essential. In this paper, we propose a simple yet effective solution: Option-aware Temporally Abstracted value learning, dubbed OTA, which incorporates temporal abstraction into the temporal-difference learning process. By modifying the value update to be option-aware, our approach contracts the effective horizon length, enabling better advantage estimates even in long-horizon regimes. We experimentally show that the high-level policy learned using the OTA value function achieves strong performance on complex tasks from OGBench, a recently proposed offline GCRL benchmark, including maze navigation and visual robotic manipulation environments.

Related papers

ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training [15.70383059978939]
We study how to improve large foundation vision--action (VLA) systems through online reinforcement learning (RL) in real-world settings.<n>In practice, the value function is estimated from trajectory fragments collected from different data sources.<n>We propose ALOE, an action-level off-policy evaluation framework for VLA post-training.
arXiv Detail & Related papers (2026-02-13T07:46:37Z)
Physics-informed Value Learner for Offline Goal-Conditioned Reinforcement Learning [20.424372965054832]
We propose a emphPhysics-informed (Pi) regularized loss for value learning, derived from the Eikonal Partial Differential Equation (PDE)<n>Unlike generic gradient penalties that are primarily used to stabilize training, our formulation is grounded in continuous-time optimal control and encourages value functions to align with cost-to-go structures.<n>The proposed regularizer is broadly compatible with temporal-difference-based value learning and can be integrated into existing Offline GCRL algorithms.
arXiv Detail & Related papers (2025-09-08T15:08:42Z)
Test-time Offline Reinforcement Learning on Goal-related Experience [50.94457794664909]
Research in foundation models has shown that performance can be substantially improved through test-time training.<n>We propose a novel self-supervised data selection criterion, which selects transitions from an offline dataset according to their relevance to the current state.<n>Our goal-conditioned test-time training (GC-TTT) algorithm applies this routine in a receding-horizon fashion during evaluation, adapting the policy to the current trajectory as it is being rolled out.
arXiv Detail & Related papers (2025-07-24T21:11:39Z)
Flattening Hierarchies with Policy Bootstrapping [2.3940819037450987]
We introduce an algorithm to train a flat (non-hierarchical) goal-conditioned policy by bootstrapping on subgoal-conditioned policies with advantage-weighted importance sampling.<n>Our approach eliminates the need for a generative model over the (sub)goal space, which we find is key for scaling to high-dimensional control in large state spaces.
arXiv Detail & Related papers (2025-05-20T23:31:30Z)
Exploiting Hybrid Policy in Reinforcement Learning for Interpretable Temporal Logic Manipulation [12.243491328213217]
Reinforcement Learning (RL) based methods have been increasingly explored for robot learning.<n>We propose a Temporal-Logic-guided Hybrid policy framework (HyTL) which leverages three-level decision layers to improve the agent's performance.<n>We evaluate HyTL on four challenging manipulation tasks, which demonstrate its effectiveness and interpretability.
arXiv Detail & Related papers (2024-12-29T03:34:53Z)
Offline Policy Learning via Skill-step Abstraction for Long-horizon Goal-Conditioned Tasks [7.122367852177223]
We present an offline GC policy learning' framework tailored for tackling long-horizon GC tasks. In the framework, a GC policy is progressively learned offline in conjunction with the incremental modeling of skill-step abstractions on the data. We demonstrate the superiority and efficiency of our GLvSA framework in adapting GC policies to a wide range of long-horizon goals.
arXiv Detail & Related papers (2024-08-21T03:05:06Z)
Offline Reinforcement Learning from Datasets with Structured Non-Stationarity [50.35634234137108]
Current Reinforcement Learning (RL) is often limited by the large amount of data needed to learn a successful policy. We address a novel Offline RL problem setting in which, while collecting the dataset, the transition and reward functions gradually change between episodes but stay constant within each episode. We propose a method based on Contrastive Predictive Coding that identifies this non-stationarity in the offline dataset, accounts for it when training a policy, and predicts it during evaluation.
arXiv Detail & Related papers (2024-05-23T02:41:36Z)
Action-Quantized Offline Reinforcement Learning for Robotic Skill Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data. In this paper, we propose an adaptive scheme for action quantization. We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z)
Efficient Learning of High Level Plans from Play [57.29562823883257]
We present Efficient Learning of High-Level Plans from Play (ELF-P), a framework for robotic learning that bridges motion planning and deep RL. We demonstrate that ELF-P has significantly better sample efficiency than relevant baselines over multiple realistic manipulation tasks.
arXiv Detail & Related papers (2023-03-16T20:09:47Z)
Flow to Control: Offline Reinforcement Learning with Lossless Primitive Discovery [31.49638957903016]
offline reinforcement learning (RL) enables the agent to effectively learn from logged data. We show that our method has a good representation ability for policies and achieves superior performance in most tasks.
arXiv Detail & Related papers (2022-12-02T11:35:51Z)
Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy. We propose an offline RL method that never needs to evaluate actions outside of the dataset. This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)
Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.