Related papers: Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL

Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL

URL: http://arxiv.org/abs/2602.03389v1
Date: Tue, 03 Feb 2026 11:11:03 GMT
Title: Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL
Authors: Jinwoo Choi, Sang-Hyun Lee, Seung-Woo Seo,
Abstract summary: We propose a novel framework that reformulates hierarchical decision-making as autoregressive sequence modeling.<n>CoGHP consistently outperforms strong offline baselines, demonstrating improved performance on long-horizon tasks.
Score: 25.40364932514488
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Offline goal-conditioned reinforcement learning remains challenging for long-horizon tasks. While hierarchical approaches mitigate this issue by decomposing tasks, most existing methods rely on separate high- and low-level networks and generate only a single intermediate subgoal, making them inadequate for complex tasks that require coordinating multiple intermediate decisions. To address this limitation, we draw inspiration from the chain-of-thought paradigm and propose the Chain-of-Goals Hierarchical Policy (CoGHP), a novel framework that reformulates hierarchical decision-making as autoregressive sequence modeling within a unified architecture. Given a state and a final goal, CoGHP autoregressively generates a sequence of latent subgoals followed by the primitive action, where each latent subgoal acts as a reasoning step that conditions subsequent predictions. To implement this efficiently, we pioneer the use of an MLP-Mixer backbone, which supports cross-token communication and captures structural relationships among state, goal, latent subgoals, and action. Across challenging navigation and manipulation benchmarks, CoGHP consistently outperforms strong offline baselines, demonstrating improved performance on long-horizon tasks.

Related papers

HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents [19.63866851076813]
HiMAC is a hierarchical agentic RL framework that decomposes long-horizon decision-making into macro-level planning and micro-level execution.<n>Our results show that introducing structured hierarchy, rather than increasing model scale alone, is a key factor for enabling robust long-horizon agentic intelligence.
arXiv Detail & Related papers (2026-03-01T08:09:03Z)
MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z)
ReCAP: Recursive Context-Aware Reasoning and Planning for Large Language Model Agents [61.51091799997476]
We introduce ReCAP (Recursive Context-Aware Reasoning and Planning), a hierarchical framework with shared context for reasoning and planning in large language models (LLMs)<n>ReCAP combines three key mechanisms: plan-ahead decomposition, structured re-injection of parent plans, and memory-efficient execution.<n>Experiments demonstrate that ReCAP substantially improves subgoal alignment and success rates on various long-horizon reasoning benchmarks.
arXiv Detail & Related papers (2025-10-27T20:03:55Z)
Reinforcement Learning with Anticipation: A Hierarchical Approach for Long-Horizon Tasks [3.79187263097166]
Solving long-horizon goal-conditioned tasks remains a significant challenge in reinforcement learning.<n>We introduce Reinforcement Learning with Anticipation (RLA), a principled and potentially scalable framework designed to address these limitations.<n>Key feature of RLA is the training of the anticipation model, which is guided by a principle of value geometric consistency.
arXiv Detail & Related papers (2025-09-06T00:10:15Z)
Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning [5.274804664403783]
Strict Subgoal Execution (SSE) is a graph-based hierarchical RL framework that enforces single-step subgoal reachability.<n>We show that SSE consistently outperforms existing goal-conditioned RL and hierarchical RL approaches in both efficiency and success rate.
arXiv Detail & Related papers (2025-06-26T06:35:42Z)
Flattening Hierarchies with Policy Bootstrapping [5.528896840956629]
We introduce an algorithm to train a flat (non-hierarchical) goal-conditioned policy by bootstrapping on subgoal-conditioned policies with advantage-weighted importance sampling.<n>Our approach eliminates the need for a generative model over the (sub)goal space, which we find is key for scaling to high-dimensional control in large state spaces.
arXiv Detail & Related papers (2025-05-20T23:31:30Z)
Offline Multi-agent Reinforcement Learning via Score Decomposition [51.23590397383217]
offline cooperative multi-agent reinforcement learning (MARL) faces unique challenges due to distributional shifts.<n>This work is the first work to explicitly address the distributional gap between offline and online MARL.
arXiv Detail & Related papers (2025-05-09T11:42:31Z)
Semantically Aligned Task Decomposition in Multi-Agent Reinforcement Learning [56.26889258704261]
We propose a novel "disentangled" decision-making method, Semantically Aligned task decomposition in MARL (SAMA) SAMA prompts pretrained language models with chain-of-thought that can suggest potential goals, provide suitable goal decomposition and subgoal allocation as well as self-reflection-based replanning. SAMA demonstrates considerable advantages in sample efficiency compared to state-of-the-art ASG methods.
arXiv Detail & Related papers (2023-05-18T10:37:54Z)
Imitating Graph-Based Planning with Goal-Conditioned Policies [72.61631088613048]
We present a self-imitation scheme which distills a subgoal-conditioned policy into the target-goal-conditioned policy. We empirically show that our method can significantly boost the sample-efficiency of the existing goal-conditioned RL methods.
arXiv Detail & Related papers (2023-03-20T14:51:10Z)
Planning to Practice: Efficient Online Fine-Tuning by Composing Goals in Latent Space [76.46113138484947]
General-purpose robots require diverse repertoires of behaviors to complete challenging tasks in real-world unstructured environments. To address this issue, goal-conditioned reinforcement learning aims to acquire policies that can reach goals for a wide range of tasks on command. We propose Planning to Practice, a method that makes it practical to train goal-conditioned policies for long-horizon tasks.
arXiv Detail & Related papers (2022-05-17T06:58:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.