Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL
- URL: http://arxiv.org/abs/2602.03389v1
- Date: Tue, 03 Feb 2026 11:11:03 GMT
- Title: Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL
- Authors: Jinwoo Choi, Sang-Hyun Lee, Seung-Woo Seo,
- Abstract summary: We propose a novel framework that reformulates hierarchical decision-making as autoregressive sequence modeling.<n>CoGHP consistently outperforms strong offline baselines, demonstrating improved performance on long-horizon tasks.
- Score: 25.40364932514488
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline goal-conditioned reinforcement learning remains challenging for long-horizon tasks. While hierarchical approaches mitigate this issue by decomposing tasks, most existing methods rely on separate high- and low-level networks and generate only a single intermediate subgoal, making them inadequate for complex tasks that require coordinating multiple intermediate decisions. To address this limitation, we draw inspiration from the chain-of-thought paradigm and propose the Chain-of-Goals Hierarchical Policy (CoGHP), a novel framework that reformulates hierarchical decision-making as autoregressive sequence modeling within a unified architecture. Given a state and a final goal, CoGHP autoregressively generates a sequence of latent subgoals followed by the primitive action, where each latent subgoal acts as a reasoning step that conditions subsequent predictions. To implement this efficiently, we pioneer the use of an MLP-Mixer backbone, which supports cross-token communication and captures structural relationships among state, goal, latent subgoals, and action. Across challenging navigation and manipulation benchmarks, CoGHP consistently outperforms strong offline baselines, demonstrating improved performance on long-horizon tasks.
Related papers
- HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents [19.63866851076813]
HiMAC is a hierarchical agentic RL framework that decomposes long-horizon decision-making into macro-level planning and micro-level execution.<n>Our results show that introducing structured hierarchy, rather than increasing model scale alone, is a key factor for enabling robust long-horizon agentic intelligence.
arXiv Detail & Related papers (2026-03-01T08:09:03Z) - MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z) - ReCAP: Recursive Context-Aware Reasoning and Planning for Large Language Model Agents [61.51091799997476]
We introduce ReCAP (Recursive Context-Aware Reasoning and Planning), a hierarchical framework with shared context for reasoning and planning in large language models (LLMs)<n>ReCAP combines three key mechanisms: plan-ahead decomposition, structured re-injection of parent plans, and memory-efficient execution.<n>Experiments demonstrate that ReCAP substantially improves subgoal alignment and success rates on various long-horizon reasoning benchmarks.
arXiv Detail & Related papers (2025-10-27T20:03:55Z) - Reinforcement Learning with Anticipation: A Hierarchical Approach for Long-Horizon Tasks [3.79187263097166]
Solving long-horizon goal-conditioned tasks remains a significant challenge in reinforcement learning.<n>We introduce Reinforcement Learning with Anticipation (RLA), a principled and potentially scalable framework designed to address these limitations.<n>Key feature of RLA is the training of the anticipation model, which is guided by a principle of value geometric consistency.
arXiv Detail & Related papers (2025-09-06T00:10:15Z) - Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning [5.274804664403783]
Strict Subgoal Execution (SSE) is a graph-based hierarchical RL framework that enforces single-step subgoal reachability.<n>We show that SSE consistently outperforms existing goal-conditioned RL and hierarchical RL approaches in both efficiency and success rate.
arXiv Detail & Related papers (2025-06-26T06:35:42Z) - Flattening Hierarchies with Policy Bootstrapping [5.528896840956629]
We introduce an algorithm to train a flat (non-hierarchical) goal-conditioned policy by bootstrapping on subgoal-conditioned policies with advantage-weighted importance sampling.<n>Our approach eliminates the need for a generative model over the (sub)goal space, which we find is key for scaling to high-dimensional control in large state spaces.
arXiv Detail & Related papers (2025-05-20T23:31:30Z) - Offline Multi-agent Reinforcement Learning via Score Decomposition [51.23590397383217]
offline cooperative multi-agent reinforcement learning (MARL) faces unique challenges due to distributional shifts.<n>This work is the first work to explicitly address the distributional gap between offline and online MARL.
arXiv Detail & Related papers (2025-05-09T11:42:31Z) - Semantically Aligned Task Decomposition in Multi-Agent Reinforcement
Learning [56.26889258704261]
We propose a novel "disentangled" decision-making method, Semantically Aligned task decomposition in MARL (SAMA)
SAMA prompts pretrained language models with chain-of-thought that can suggest potential goals, provide suitable goal decomposition and subgoal allocation as well as self-reflection-based replanning.
SAMA demonstrates considerable advantages in sample efficiency compared to state-of-the-art ASG methods.
arXiv Detail & Related papers (2023-05-18T10:37:54Z) - Imitating Graph-Based Planning with Goal-Conditioned Policies [72.61631088613048]
We present a self-imitation scheme which distills a subgoal-conditioned policy into the target-goal-conditioned policy.
We empirically show that our method can significantly boost the sample-efficiency of the existing goal-conditioned RL methods.
arXiv Detail & Related papers (2023-03-20T14:51:10Z) - Planning to Practice: Efficient Online Fine-Tuning by Composing Goals in
Latent Space [76.46113138484947]
General-purpose robots require diverse repertoires of behaviors to complete challenging tasks in real-world unstructured environments.
To address this issue, goal-conditioned reinforcement learning aims to acquire policies that can reach goals for a wide range of tasks on command.
We propose Planning to Practice, a method that makes it practical to train goal-conditioned policies for long-horizon tasks.
arXiv Detail & Related papers (2022-05-17T06:58:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.