Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations
- URL: http://arxiv.org/abs/2506.08566v1
- Date: Tue, 10 Jun 2025 08:36:51 GMT
- Title: Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations
- Authors: Yibo Cui, Liang Xie, Yu Zhao, Jiawei Sun, Erwei Yin,
- Abstract summary: Vision-Language Navigation (VLN) enables intelligent agents to navigate environments by integrating visual perception and natural language instructions.<n>Existing datasets primarily focus on global instruction-trajectory matching, neglecting sub-instruction-level and entity-level alignments.<n>We propose FCA-NIG, a generative framework that automatically constructs navigation instructions with dual-level fine-grained cross-modal annotations.
- Score: 4.483463511271561
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Navigation (VLN) enables intelligent agents to navigate environments by integrating visual perception and natural language instructions, yet faces significant challenges due to the scarcity of fine-grained cross-modal alignment annotations. Existing datasets primarily focus on global instruction-trajectory matching, neglecting sub-instruction-level and entity-level alignments critical for accurate navigation action decision-making. To address this limitation, we propose FCA-NIG, a generative framework that automatically constructs navigation instructions with dual-level fine-grained cross-modal annotations. In this framework, an augmented trajectory is first divided into sub-trajectories, which are then processed through GLIP-based landmark detection, crafted instruction construction, OFA-Speaker based R2R-like instruction generation, and CLIP-powered entity selection, generating sub-instruction-trajectory pairs with entity-landmark annotations. Finally, these sub-pairs are aggregated to form a complete instruction-trajectory pair. The framework generates the FCA-R2R dataset, the first large-scale augmentation dataset featuring precise sub-instruction-sub-trajectory and entity-landmark alignments. Extensive experiments demonstrate that training with FCA-R2R significantly improves the performance of multiple state-of-the-art VLN agents, including SF, EnvDrop, RecBERT, and HAMT. Incorporating sub-instruction-trajectory alignment enhances agents' state awareness and decision accuracy, while entity-landmark alignment further boosts navigation performance and generalization. These results highlight the effectiveness of FCA-NIG in generating high-quality, scalable training data without manual annotation, advancing fine-grained cross-modal learning in complex navigation tasks.
Related papers
- DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition [59.203152078315235]
Open-Vocabulary Multi-Label Recognition (OV-MLR) aims to identify multiple seen and unseen object categories within an image.<n> Vision-Language Pre-training models offer a strong open-vocabulary foundation, but struggle with fine-grained localization under weak supervision.<n>We propose the Dual Adaptive Refinement Transfer (DART) framework to overcome these limitations.
arXiv Detail & Related papers (2025-08-07T17:22:33Z) - EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation [111.0993686148283]
We propose a novel sElf-improving embodied reasoning framework for boosting Vision-Language Navigation, dubbed EvolveNav.<n>Our EvolveNav consists of two stages: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with formalized CoT labels to activate the model's navigational reasoning capabilities and increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity.
arXiv Detail & Related papers (2025-06-02T11:28:32Z) - ST-Booster: An Iterative SpatioTemporal Perception Booster for Vision-and-Language Navigation in Continuous Environments [1.9566515100805284]
VLN-CE requires agents to navigate continuous spaces based on natural language instructions.<n>This paper introduces ST-Booster, a navigation booster that enhances performance through multi-granularity perception and instruction-aware reasoning.<n>Extensive experiments and performance analyses are conducted, demonstrating that ST-Booster outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2025-04-14T03:29:08Z) - Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation [7.150985186031763]
Vision and Language Navigation (VLN) requires an agent to navigate through environments following natural language instructions.<n>Existing methods often struggle with effectively integrating visual observations and instruction details during navigation.<n>We propose OIKG, a novel framework that addresses these limitations through two key components.
arXiv Detail & Related papers (2025-03-14T02:05:16Z) - Without Paired Labeled Data: End-to-End Self-Supervised Learning for Drone-view Geo-Localization [2.733505168507872]
Drone-view Geo-Localization (DVGL) aims to achieve accurate localization of drones by retrieving the most relevant GPS-tagged satellite images.<n>Existing methods heavily rely on strictly pre-paired drone-satellite images for supervised learning.<n>We propose an end-to-end self-supervised learning method with a shallow backbone network.
arXiv Detail & Related papers (2025-02-17T02:53:08Z) - DELAN: Dual-Level Alignment for Vision-and-Language Navigation by Cross-Modal Contrastive Learning [40.87681228125296]
Vision-and-Language navigation (VLN) requires an agent to navigate in unseen environment by following natural language instruction.
For task completion, the agent needs to align and integrate various navigation modalities, including instruction, observation and navigation history.
arXiv Detail & Related papers (2024-04-02T14:40:04Z) - Bidirectional Trained Tree-Structured Decoder for Handwritten
Mathematical Expression Recognition [51.66383337087724]
The Handwritten Mathematical Expression Recognition (HMER) task is a critical branch in the field of OCR.
Recent studies have demonstrated that incorporating bidirectional context information significantly improves the performance of HMER models.
We propose the Mirror-Flipped Symbol Layout Tree (MF-SLT) and Bidirectional Asynchronous Training (BAT) structure.
arXiv Detail & Related papers (2023-12-31T09:24:21Z) - Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language
Navigation [23.94546957057613]
Cross-modal alignment is one key challenge for Vision-and-Language Navigation (VLN)
We propose a novel Grounded Entity-Landmark Adaptive (GELA) pre-training paradigm for VLN tasks.
arXiv Detail & Related papers (2023-08-24T06:25:20Z) - ULN: Towards Underspecified Vision-and-Language Navigation [77.81257404252132]
Underspecified vision-and-Language Navigation (ULN) is a new setting for vision-and-Language Navigation (VLN)
We propose a VLN framework that consists of a classification module, a navigation agent, and an Exploitation-to-Exploration (E2E) module.
Our framework is more robust and outperforms the baselines on ULN by 10% relative success rate across all levels.
arXiv Detail & Related papers (2022-10-18T17:45:06Z) - Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains.
We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z) - Contrastive Instruction-Trajectory Learning for Vision-Language
Navigation [66.16980504844233]
A vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction.
Previous works fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions.
We propose a Contrastive Instruction-Trajectory Learning framework that explores invariance across similar data samples and variance across different ones to learn distinctive representations for robust navigation.
arXiv Detail & Related papers (2021-12-08T06:32:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.