Global Commander and Local Operative: A Dual-Agent Framework for Scene Navigation
- URL: http://arxiv.org/abs/2602.18941v1
- Date: Sat, 21 Feb 2026 19:19:55 GMT
- Title: Global Commander and Local Operative: A Dual-Agent Framework for Scene Navigation
- Authors: Kaiming Jin, Yuefan Wu, Shengqiong Wu, Bobo Li, Shuicheng Yan, Tat-Seng Chua,
- Abstract summary: Vision-and-Language Scene navigation is a fundamental capability for embodied human-AI.<n>We introduce DACo, a planning-grounding decoupled architecture that disentangles global deliberation from local grounding.<n>By disentangling global reasoning from local action, DACo alleviates cognitive overload and improves long-horizon stability.
- Score: 96.88162755522342
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vision-and-Language Scene navigation is a fundamental capability for embodied human-AI collaboration, requiring agents to follow natural language instructions to execute coherent action sequences in complex environments. Existing approaches either rely on multiple agents, incurring high coordination and resource costs, or adopt a single-agent paradigm, which overloads the agent with both global planning and local perception, often leading to degraded reasoning and instruction drift in long-horizon settings. To address these issues, we introduce DACo, a planning-grounding decoupled architecture that disentangles global deliberation from local grounding. Concretely, it employs a Global Commander for high-level strategic planning and a Local Operative for egocentric observing and fine-grained execution. By disentangling global reasoning from local action, DACo alleviates cognitive overload and improves long-horizon stability. The framework further integrates dynamic subgoal planning and adaptive replanning to enable structured and resilient navigation. Extensive evaluations on R2R, REVERIE, and R4R demonstrate that DACo achieves 4.9%, 6.5%, 5.4% absolute improvements over the best-performing baselines in zero-shot settings, and generalizes effectively across both closed-source (e.g., GPT-4o) and open-source (e.g., Qwen-VL Series) backbones. DACo provides a principled and extensible paradigm for robust long-horizon navigation. Project page: https://github.com/ChocoWu/DACo
Related papers
- MA-CoNav: A Master-Slave Multi-Agent Framework with Hierarchical Collaboration and Dual-Level Reflection for Long-Horizon Embodied VLN [1.7508558850131373]
Vision-Language Navigation (VLN) aims to empower robots with the ability to perform long-horizon navigation in unfamiliar environments.<n>This paper proposes MA-CoNav, a Multi-Agent Collaborative Navigation framework.
arXiv Detail & Related papers (2026-03-03T14:16:02Z) - WebAnchor: Anchoring Agent Planning to Stabilize Long-Horizon Web Reasoning [82.12501258760814]
Large Language Model(LLM)-based agents have shown strong capabilities in web information seeking.<n>Plan anchor is where the first reasoning step disproportionately impacts downstream behavior in long-horizon web reasoning tasks.<n>We propose Anchor-GRPO, a two-stage RL framework that decouples planning and execution.
arXiv Detail & Related papers (2026-01-06T16:36:40Z) - Hybrid Motion Planning with Deep Reinforcement Learning for Mobile Robot Navigation [0.0]
Hybrid Motion Planning with Deep Reinforcement Learning (HMP-DRL)<n>We propose a graph-based global planner to generate a path, which is integrated into a local DRL policy via a sequence of checkpoints encoded in both the state space and reward function.<n>To ensure social compliance, the local planner employs an entity-aware reward structure that dynamically adjusts safety margins and penalties based on the semantic type of surrounding agents.
arXiv Detail & Related papers (2025-12-31T05:58:57Z) - Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control [72.43808515668947]
We introduce Hi-Agent, a trainable hierarchical vision-language agent for mobile control.<n>Hi-Agent features a high-level reasoning model and a low-level action model that are jointly optimized.<n>Hi-Agent achieves a new State-Of-The-Art (SOTA) 87.9% task success rate on the Android-in-the-Wild (AitW) benchmark.
arXiv Detail & Related papers (2025-10-16T07:38:21Z) - DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation [73.80968452950854]
Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural language instructions through free-form 3D spaces.<n>Existing VLN-CE approaches typically use a two-stage waypoint planning framework.<n>We propose DAgger Diffusion Navigation (DifNav) as an end-to-end optimized VLN-CE policy.
arXiv Detail & Related papers (2025-08-13T02:51:43Z) - PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning [19.480628850056522]
Large Language Models (LLMs) have shown remarkable advancements in tackling agent-oriented tasks.<n>Current approaches predominantly rely on supervised fine-tuning, which often leads models to memorize established task completion trajectories.<n>We introduce an adaptive global plan-based agent paradigm AdaPlan, aiming to synergize high-level explicit guidance with execution.
arXiv Detail & Related papers (2025-08-01T06:17:11Z) - NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning [97.88246428240872]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.<n>Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.<n>This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - Think Global, Act Local: Dual-scale Graph Transformer for
Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding.
We build a topological map on-the-fly to enable efficient exploration in global action space.
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.