Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning
- URL: http://arxiv.org/abs/2601.09111v1
- Date: Wed, 14 Jan 2026 03:22:16 GMT
- Title: Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning
- Authors: Yang Li, Aming Wu, Zihao Zhang, Yahong Han,
- Abstract summary: Vision-Language Navigation aims to enable agents to navigate to a target location based on language instructions.<n>Recent research indicates that by means of fast and slow cognition systems, human beings could generate stable policies.<n>We propose the slow4fast-VLN, establishing a dynamic interactive fast-slow reasoning framework.
- Score: 58.25418970608328
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Navigation aims to enable agents to navigate to a target location based on language instructions. Traditional VLN often follows a close-set assumption, i.e., training and test data share the same style of the input images and instructions. However, the real world is open and filled with various unseen environments, posing enormous difficulties for close-set methods. To this end, we focus on the General Scene Adaptation (GSA-VLN) task, aiming to learn generalized navigation ability by introducing diverse environments and inconsistent intructions.Towards this task, when facing unseen environments and instructions, the challenge mainly lies in how to enable the agent to dynamically produce generalized strategies during the navigation process. Recent research indicates that by means of fast and slow cognition systems, human beings could generate stable policies, which strengthen their adaptation for open world. Inspired by this idea, we propose the slow4fast-VLN, establishing a dynamic interactive fast-slow reasoning framework. The fast-reasoning module, an end-to-end strategy network, outputs actions via real-time input. It accumulates execution records in a history repository to build memory. The slow-reasoning module analyze the memories generated by the fast-reasoning module. Through deep reflection, it extracts experiences that enhance the generalization ability of decision-making. These experiences are structurally stored and used to continuously optimize the fast-reasoning module. Unlike traditional methods that treat fast-slow reasoning as independent mechanisms, our framework enables fast-slow interaction. By leveraging the experiences from slow reasoning. This interaction allows the system to continuously adapt and efficiently execute navigation tasks when facing unseen scenarios.
Related papers
- VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory [43.2995099083993]
VLA models have shown promising potential in embodied navigation by unifying perception and planning.<n>Most existing VLA models rely on reactive mappings directly from observations to actions.<n>We propose VLingNav, a VLA model for embodied navigation grounded in linguistic-driven cognition.
arXiv Detail & Related papers (2026-01-13T15:43:43Z) - Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents [43.5771856761934]
Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments.<n>We propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents.
arXiv Detail & Related papers (2025-08-11T05:50:30Z) - NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments [67.18144414660681]
Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to execute sequential navigation actions in complex environments guided by natural language instructions.<n>Inspired by human cognition, we present NavMorph, a self-evolving world model framework that enhances environmental understanding and decision-making in VLN-CE tasks.
arXiv Detail & Related papers (2025-06-30T02:20:00Z) - Efficient and Generalizable Environmental Understanding for Visual Navigation [14.10058573339022]
Visual Navigation is a core task in Embodied AI, enabling agents to navigate complex environments toward given objectives.<n>We propose Causality-Aware Navigation (CAN), which incorporates a Causal Understanding Module to enhance the agent's environmental understanding capability.
arXiv Detail & Related papers (2025-06-18T11:47:02Z) - Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation [35.71602601385161]
We present a novel vision-language model (VLM)-based navigation framework.<n>Our approach enhances spatial reasoning and decision-making in long-horizon tasks.<n> Experimental results demonstrate that the proposed method surpasses previous state-of-the-art approaches in object navigation tasks.
arXiv Detail & Related papers (2025-02-20T04:41:40Z) - SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts [54.11162991206203]
This paper consolidates diverse navigation tasks into a unified and generic framework.<n>We propose a novel State-Adaptive Mixture of Experts (SAME) model that effectively enables an agent to infer decisions.
arXiv Detail & Related papers (2024-12-07T06:12:53Z) - DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control [53.80518003412016]
Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research.
We study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair.
We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls.
arXiv Detail & Related papers (2024-07-20T05:39:28Z) - Visual-Language Navigation Pretraining via Prompt-based Environmental
Self-exploration [83.96729205383501]
We introduce prompt-based learning to achieve fast adaptation for language embeddings.
Our model can adapt to diverse vision-language navigation tasks, including VLN and REVERIE.
arXiv Detail & Related papers (2022-03-08T11:01:24Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.