Related papers: Agent Journey Beyond RGB: Unveiling Hybrid Semantic-Spatial Environmental Representations for Vision-and-Language Navigation

Agent Journey Beyond RGB: Unveiling Hybrid Semantic-Spatial Environmental Representations for Vision-and-Language Navigation

URL: http://arxiv.org/abs/2412.06465v3
Date: Thu, 12 Dec 2024 03:56:01 GMT
Title: Agent Journey Beyond RGB: Unveiling Hybrid Semantic-Spatial Environmental Representations for Vision-and-Language Navigation
Authors: Xuesong Zhang, Yunbo Xu, Jia Li, Zhenzhen Hu, Richnag Hong,
Abstract summary: Navigating unseen environments based on natural language instructions remains difficult for egocentric agents in Vision-and-Language Navigation (VLN)<n>We propose a versatile Semantic Understanding and Spatial Awareness architecture to facilitate navigation.<n>We show that SUSA hybrid semantic-spatial representations effectively enhance navigation performance, setting new state-of-the-art performance across three VLN benchmarks (REVERIE, R2R, and SOON)
Score: 15.302043040651368
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Navigating unseen environments based on natural language instructions remains difficult for egocentric agents in Vision-and-Language Navigation (VLN). While recent advancements have yielded promising outcomes, they primarily rely on RGB images for environmental representation, often overlooking the underlying semantic knowledge and spatial cues. Intuitively, humans inherently ground textual semantics within the spatial layout during indoor navigation. Inspired by this, we propose a versatile Semantic Understanding and Spatial Awareness (SUSA) architecture to facilitate navigation. SUSA includes a Textual Semantic Understanding (TSU) module, which narrows the modality gap between instructions and environments by generating and associating the descriptions of environmental landmarks in the agent's immediate surroundings. Additionally, a Depth-based Spatial Perception (DSP) module incrementally constructs a depth exploration map, enabling a more nuanced comprehension of environmental layouts. Experimental results demonstrate that SUSA hybrid semantic-spatial representations effectively enhance navigation performance, setting new state-of-the-art performance across three VLN benchmarks (REVERIE, R2R, and SOON). The source code will be publicly available.

Related papers

NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments [67.18144414660681]
Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to execute sequential navigation actions in complex environments guided by natural language instructions.<n>Inspired by human cognition, we present NavMorph, a self-evolving world model framework that enhances environmental understanding and decision-making in VLN-CE tasks.
arXiv Detail & Related papers (2025-06-30T02:20:00Z)
Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech [39.74416731035842]
M2SE-VTTS aims to take the environmental image as the prompt to synthesize the reverberant speech for the spoken content. We propose a novel multi-modal and multi-scale spatial environment understanding scheme to achieve immersive VTTS. Our model outperforms the advanced baselines in environmental speech generation.
arXiv Detail & Related papers (2024-12-16T03:25:23Z)
UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation [71.97405667493477]
We introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN. It enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features. UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.
arXiv Detail & Related papers (2024-11-25T02:44:59Z)
Cog-GA: A Large Language Models-based Generative Agent for Vision-Language Navigation in Continuous Environments [19.818370526976974]
Vision Language Navigation in Continuous Environments (VLN-CE) represents a frontier in embodied AI. We introduce Cog-GA, a generative agent founded on large language models (LLMs) tailored for VLN-CE tasks. Cog-GA employs a dual-pronged strategy to emulate human-like cognitive processes.
arXiv Detail & Related papers (2024-09-04T08:30:03Z)
Vision and Language Navigation in the Real World via Online Visual Language Mapping [18.769171505280127]
Vision-and-language navigation (VLN) methods are mainly evaluated in simulation. We propose a novel framework to address the VLN task in the real world. We evaluate the proposed pipeline on an Interbotix LoCoBot WX250 in an unseen lab environment.
arXiv Detail & Related papers (2023-10-16T20:44:09Z)
Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for Navigation Instruction Generation [70.76686546473994]
We introduce a novel speaker model textscKefa for navigation instruction generation. The proposed KEFA speaker achieves state-of-the-art instruction generation performance for both indoor and outdoor scenes.
arXiv Detail & Related papers (2023-07-25T09:39:59Z)
Learning Navigational Visual Representations with Semantic Map Supervision [85.91625020847358]
We propose a navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps. Ego$2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation.
arXiv Detail & Related papers (2023-07-23T14:01:05Z)
Graph based Environment Representation for Vision-and-Language Navigation in Continuous Environments [20.114506226598508]
Vision-and-Language Navigation in Continuous Environments (VLN-CE) is a navigation task that requires an agent to follow a language instruction in a realistic environment. We propose a new environment representation in order to solve the above problems.
arXiv Detail & Related papers (2023-01-11T08:04:18Z)
CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations [98.30038910061894]
Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions. We propose CLEAR: Cross-Lingual and Environment-Agnostic Representations. Our language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task.
arXiv Detail & Related papers (2022-07-05T17:38:59Z)
Towards Navigation by Reasoning over Spatial Configurations [20.324906029170457]
We show the importance of spatial semantics in grounding navigation instructions into visual perceptions. We propose a neural agent that uses the elements of spatial configurations and investigate their influence on the navigation agent's reasoning ability.
arXiv Detail & Related papers (2021-05-14T14:04:23Z)
Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN) It is compartmentalized enough to accurately memorize the percepts during navigation. It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z)
Environment-agnostic Multitask Learning for Natural Language Grounded Navigation [88.69873520186017]
We introduce a multitask navigation model that can be seamlessly trained on Vision-Language Navigation (VLN) and Navigation from Dialog History (NDH) tasks. Experiments show that environment-agnostic multitask learning significantly reduces the performance gap between seen and unseen environments.
arXiv Detail & Related papers (2020-03-01T09:06:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.