Fine-Tuning Vision-Language Models for Visual Navigation Assistance
- URL: http://arxiv.org/abs/2509.07488v1
- Date: Tue, 09 Sep 2025 08:08:35 GMT
- Title: Fine-Tuning Vision-Language Models for Visual Navigation Assistance
- Authors: Xiao Li, Bharat Gandhi, Ming Zhan, Mohit Nehra, Zhicheng Zhang, Yuchen Sun, Meijia Song, Naisheng Zhang, Xi Wang,
- Abstract summary: We address vision-language-driven indoor navigation to assist visually impaired individuals in reaching a target location using images and natural language guidance.<n>Our approach integrates vision and language models to generate step-by-step navigational instructions, enhancing accessibility and independence.
- Score: 28.43430422119113
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We address vision-language-driven indoor navigation to assist visually impaired individuals in reaching a target location using images and natural language guidance. Traditional navigation systems are ineffective indoors due to the lack of precise location data. Our approach integrates vision and language models to generate step-by-step navigational instructions, enhancing accessibility and independence. We fine-tune the BLIP-2 model with Low Rank Adaptation (LoRA) on a manually annotated indoor navigation dataset. We propose an evaluation metric that refines the BERT F1 score by emphasizing directional and sequential variables, providing a more comprehensive measure of navigational performance. After applying LoRA, the model significantly improved in generating directional instructions, overcoming limitations in the original BLIP-2 model.
Related papers
- History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation [64.51891404034164]
Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments.<n>Existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects.<n>This work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which integrates the two aspects through a coarse-to-fine navigation pipeline.
arXiv Detail & Related papers (2025-12-16T09:16:07Z) - Floorplan2Guide: LLM-Guided Floorplan Parsing for BLV Indoor Navigation [4.3114959617830015]
We propose a novel navigation approach that transforms floor plans into navigable knowledge graphs and generate human-readable navigation instructions.<n> Floorplan2Guide integrates a large language model (LLM) to extract spatial information from architectural layouts.<n>Results indicate that few-shot learning improves navigation accuracy in comparison to zero-shot learning on simulated and real-world evaluations.
arXiv Detail & Related papers (2025-12-13T04:49:26Z) - PIG-Nav: Key Insights for Pretrained Image Goal Navigation Models [16.820485795257195]
PIG-Nav (Pretrained Image-Goal Navigation) is a new approach that further investigates pretraining strategies for vision-based navigation models.<n>We identify two critical design choices that consistently improve the performance of pretrained navigation models.<n>Our model achieves an average improvement of 22.6% in zero-shot settings and a 37.5% improvement in fine-tuning settings over existing visual navigation foundation models.
arXiv Detail & Related papers (2025-07-23T05:34:20Z) - NavigScene: Bridging Local Perception and Global Navigation for Beyond-Visual-Range Autonomous Driving [10.597463021650382]
NavigScene is an auxiliary navigation-guided natural language dataset that simulates a human-like driving environment within autonomous driving systems.<n>We develop three paradigms to leverage NavigScene: (1) Navigation-guided Reasoning, which enhances vision-language models by incorporating navigation context into the prompting approach; (2) Navigation-guided Preference Optimization, a reinforcement learning method that extends Direct Preference Optimization to improve vision-language model responses; and (3) Navigation-guided Vision-Language-Action model, which integrates navigation guidance and vision-language models with conventional driving models through feature fusion.
arXiv Detail & Related papers (2025-07-07T17:37:01Z) - VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning [77.34267241692706]
Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions.<n>We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions.
arXiv Detail & Related papers (2025-06-20T17:59:59Z) - Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation [7.150985186031763]
Vision and Language Navigation (VLN) requires an agent to navigate through environments following natural language instructions.<n>Existing methods often struggle with effectively integrating visual observations and instruction details during navigation.<n>We propose OIKG, a novel framework that addresses these limitations through two key components.
arXiv Detail & Related papers (2025-03-14T02:05:16Z) - NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning [97.88246428240872]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.<n>Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.<n>This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for
Navigation Instruction Generation [70.76686546473994]
We introduce a novel speaker model textscKefa for navigation instruction generation.
The proposed KEFA speaker achieves state-of-the-art instruction generation performance for both indoor and outdoor scenes.
arXiv Detail & Related papers (2023-07-25T09:39:59Z) - NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large
Language Models [17.495162643127003]
We introduce the NavGPT to reveal the reasoning capability of GPT models in complex embodied scenes.
NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status.
We show that NavGPT is capable of generating high-quality navigational instructions from observations and actions along a path.
arXiv Detail & Related papers (2023-05-26T14:41:06Z) - ULN: Towards Underspecified Vision-and-Language Navigation [77.81257404252132]
Underspecified vision-and-Language Navigation (ULN) is a new setting for vision-and-Language Navigation (VLN)
We propose a VLN framework that consists of a classification module, a navigation agent, and an Exploitation-to-Exploration (E2E) module.
Our framework is more robust and outperforms the baselines on ULN by 10% relative success rate across all levels.
arXiv Detail & Related papers (2022-10-18T17:45:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.