Related papers: Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation

Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation

URL: http://arxiv.org/abs/2503.11006v1
Date: Fri, 14 Mar 2025 02:05:16 GMT
Title: Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation
Authors: Yifan Xie, Binkai Ou, Fei Ma, Yaohua Liu,
Abstract summary: Vision and Language Navigation (VLN) requires an agent to navigate through environments following natural language instructions.<n>Existing methods often struggle with effectively integrating visual observations and instruction details during navigation.<n>We propose OIKG, a novel framework that addresses these limitations through two key components.
Score: 7.150985186031763
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision and Language Navigation (VLN) requires an agent to navigate through environments following natural language instructions. However, existing methods often struggle with effectively integrating visual observations and instruction details during navigation, leading to suboptimal path planning and limited success rates. In this paper, we propose OIKG (Observation-graph Interaction and Key-detail Guidance), a novel framework that addresses these limitations through two key components: (1) an observation-graph interaction module that decouples angular and visual information while strengthening edge representations in the navigation space, and (2) a key-detail guidance module that dynamically extracts and utilizes fine-grained location and object information from instructions. By enabling more precise cross-modal alignment and dynamic instruction interpretation, our approach significantly improves the agent's ability to follow complex navigation instructions. Extensive experiments on the R2R and RxR datasets demonstrate that OIKG achieves state-of-the-art performance across multiple evaluation metrics, validating the effectiveness of our method in enhancing navigation precision through better observation-instruction alignment.

Related papers

GoViG: Goal-Conditioned Visual Navigation Instruction Generation [69.79110149746506]
We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to autonomously generate precise and contextually coherent navigation instructions.<n>GoViG exclusively leverages raw egocentric visual data, substantially improving its adaptability to unseen and unstructured environments.
arXiv Detail & Related papers (2025-08-13T07:05:17Z)
Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations [4.483463511271561]
Vision-Language Navigation (VLN) enables intelligent agents to navigate environments by integrating visual perception and natural language instructions.<n>Existing datasets primarily focus on global instruction-trajectory matching, neglecting sub-instruction-level and entity-level alignments.<n>We propose FCA-NIG, a generative framework that automatically constructs navigation instructions with dual-level fine-grained cross-modal annotations.
arXiv Detail & Related papers (2025-06-10T08:36:51Z)
OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation [96.46961207887722]
OVER-NAV aims to go over and beyond the current arts of IVLN techniques. To fully exploit the interpreted navigation data, we introduce a structured representation, coded Omnigraph.
arXiv Detail & Related papers (2024-03-26T02:34:48Z)
TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation [11.591176410027224]
This paper presents a Vision-Language Navigation (VLN) agent based on Large Language Models (LLMs) We propose the Thinking, Interacting, and Action framework to compensate for the shortcomings of LLMs in environmental perception. Our approach also outperformed some supervised learning-based methods, highlighting its efficacy in zero-shot navigation.
arXiv Detail & Related papers (2024-03-13T05:22:39Z)
Aligning Knowledge Graph with Visual Perception for Object-goal Navigation [16.32780793344835]
We propose the Aligning Knowledge Graph with Visual Perception (AKGVP) method for object-goal navigation. Our approach introduces continuous modeling of the hierarchical scene architecture and leverages visual-language pre-training to align natural language description with visual perception. The integration of a continuous knowledge graph architecture and multimodal feature alignment empowers the navigator with a remarkable zero-shot navigation capability.
arXiv Detail & Related papers (2024-02-29T06:31:18Z)
NavHint: Vision and Language Navigation Agent with a Hint Generator [31.322331792911598]
We provide indirect supervision to the navigation agent through a hint generator that provides detailed visual descriptions. The hint generator assists the navigation agent in developing a global understanding of the visual environment. We evaluate our method on the R2R and R4R datasets and achieve state-of-the-art on several metrics.
arXiv Detail & Related papers (2024-02-04T16:23:16Z)
KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation [61.08389704326803]
Vision-and-language navigation (VLN) is the task to enable an embodied agent to navigate to a remote location following the natural language instruction in real scenes. Most of the previous approaches utilize the entire features or object-centric features to represent navigable candidates. We propose a Knowledge Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent navigation ability.
arXiv Detail & Related papers (2023-03-28T08:00:46Z)
Contrastive Instruction-Trajectory Learning for Vision-Language Navigation [66.16980504844233]
A vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction. Previous works fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions. We propose a Contrastive Instruction-Trajectory Learning framework that explores invariance across similar data samples and variance across different ones to learn distinctive representations for robust navigation.
arXiv Detail & Related papers (2021-12-08T06:32:52Z)
SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation [57.12508968239015]
This work presents a transformer-based vision-and-language navigation (VLN) agent. It uses two different visual encoders -- a scene classification network and an object detector. Scene features contribute high-level contextual information that supports object-level processing.
arXiv Detail & Related papers (2021-10-27T03:29:34Z)
Adversarial Reinforced Instruction Attacker for Robust Vision-Language Navigation [145.84123197129298]
Language instruction plays an essential role in the natural language grounded navigation tasks. We exploit to train a more robust navigator which is capable of dynamically extracting crucial factors from the long instruction. Specifically, we propose a Dynamic Reinforced Instruction Attacker (DR-Attacker), which learns to mislead the navigator to move to the wrong target.
arXiv Detail & Related papers (2021-07-23T14:11:31Z)
Language and Visual Entity Relationship Graph for Agent Navigation [54.059606864535304]
Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions. We propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision. Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art.
arXiv Detail & Related papers (2020-10-19T08:25:55Z)
Sub-Instruction Aware Vision-and-Language Navigation [46.99329933894108]
Vision-and-language navigation requires an agent to navigate through a real 3D environment following natural language instructions. We focus on the granularity of the visual and language sequences as well as the traceability of agents through the completion of an instruction. We propose effective sub-instruction attention and shifting modules that select and attend to a single sub-instruction at each time-step.
arXiv Detail & Related papers (2020-04-06T14:44:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.