Related papers: Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation

Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation

URL: http://arxiv.org/abs/2505.20897v2
Date: Sun, 22 Jun 2025 13:53:33 GMT
Title: Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation
Authors: Pingrui Zhang, Yifei Su, Pengyuan Wu, Dong An, Li Zhang, Zhigang Wang, Dong Wang, Yan Ding, Bin Zhao, Xuelong Li,
Abstract summary: Vision-and-Language Navigation (VLN) requires the agent to navigate by following natural instructions under partial observability.<n>Recent methods mitigate this by imagining future scenes, yet they rely on vision-based synthesis.<n>We propose to adaptively imagine key environmental semantics via textit form, enabling a more reliable and efficient strategy.
Score: 46.84289573813059
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-and-Language Navigation (VLN) requires the agent to navigate by following natural instructions under partial observability, making it difficult to align perception with language. Recent methods mitigate this by imagining future scenes, yet they rely on vision-based synthesis, leading to high computational cost and redundant details. To this end, we propose to adaptively imagine key environmental semantics via \textit{language} form, enabling a more reliable and efficient strategy. Specifically, we introduce a novel Adaptive Text Dreamer (ATD), a dual-branch self-guided imagination policy built upon a large language model (LLM). ATD is designed with a human-like left-right brain architecture, where the left brain focuses on logical integration, and the right brain is responsible for imaginative prediction of future scenes. To achieve this, we fine-tune only the Q-former within both brains to efficiently activate domain-specific knowledge in the LLM, enabling dynamic updates of logical reasoning and imagination during navigation. Furthermore, we introduce a cross-interaction mechanism to regularize the imagined outputs and inject them into a navigation expert module, allowing ATD to jointly exploit both the reasoning capacity of the LLM and the expertise of the navigation model. We conduct extensive experiments on the R2R benchmark, where ATD achieves state-of-the-art performance with fewer parameters. The code is \href{https://github.com/zhangpingrui/Adaptive-Text-Dreamer}{here}.

Related papers

Recursive Visual Imagination and Adaptive Linguistic Grounding for Vision Language Navigation [25.111153186227728]
Vision Language Navigation (VLN) typically requires agents to navigate to specified objects or remote regions in unknown scenes by obeying linguistic commands.<n>Current agents suffer from overly detailed scene representation and ambiguous vision-language alignment.<n>We propose a navigation policy by summarizing along-the-way visual perceptions, which are adaptively aligned with commands to enhance linguistic grounding.
arXiv Detail & Related papers (2025-07-29T02:40:07Z)
EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation [111.0993686148283]
We propose a novel sElf-improving embodied reasoning framework for boosting Vision-Language Navigation, dubbed EvolveNav.<n>Our EvolveNav consists of two stages: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with formalized CoT labels to activate the model's navigational reasoning capabilities and increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity.
arXiv Detail & Related papers (2025-06-02T11:28:32Z)
Cog-GA: A Large Language Models-based Generative Agent for Vision-Language Navigation in Continuous Environments [19.818370526976974]
Vision Language Navigation in Continuous Environments (VLN-CE) represents a frontier in embodied AI. We introduce Cog-GA, a generative agent founded on large language models (LLMs) tailored for VLN-CE tasks. Cog-GA employs a dual-pronged strategy to emulate human-like cognitive processes.
arXiv Detail & Related papers (2024-09-04T08:30:03Z)
NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning [97.88246428240872]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.<n>Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.<n>This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z)
LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN) Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z)
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action [76.71101507291473]
We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories. We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
arXiv Detail & Related papers (2022-07-10T10:41:50Z)
Pathdreamer: A World Model for Indoor Navigation [62.78410447776939]
We introduce Pathdreamer, a visual world model for agents navigating in novel indoor environments. Given one or more previous visual observations, Pathdreamer generates plausible high-resolution 360 visual observations. In regions of high uncertainty, Pathdreamer can predict diverse scenes, allowing an agent to sample multiple realistic outcomes.
arXiv Detail & Related papers (2021-05-18T18:13:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.