Related papers: A Navigation Framework Utilizing Vision-Language Models

A Navigation Framework Utilizing Vision-Language Models

URL: http://arxiv.org/abs/2506.10172v1
Date: Wed, 11 Jun 2025 20:51:58 GMT
Title: A Navigation Framework Utilizing Vision-Language Models
Authors: Yicheng Duan, Kaiyu tang,
Abstract summary: Vision-and-Language Navigation (VLN) presents a complex challenge in embodied AI.<n>Recent advances in large vision-language models (LVLMs), such as CLIP and Flamingo, have significantly improved multimodal understanding.<n>We propose a modular, plug-and-play navigation framework that decouples vision-language understanding from action planning.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-and-Language Navigation (VLN) presents a complex challenge in embodied AI, requiring agents to interpret natural language instructions and navigate through visually rich, unfamiliar environments. Recent advances in large vision-language models (LVLMs), such as CLIP and Flamingo, have significantly improved multimodal understanding but introduced new challenges related to computational cost and real-time deployment. In this project, we propose a modular, plug-and-play navigation framework that decouples vision-language understanding from action planning. By integrating a frozen vision-language model, Qwen2.5-VL-7B-Instruct, with lightweight planning logic, we aim to achieve flexible, fast, and adaptable navigation without extensive model fine-tuning. Our framework leverages prompt engineering, structured history management, and a two-frame visual input strategy to enhance decision-making continuity across navigation steps. We evaluate our system on the Room-to-Room benchmark within the VLN-CE setting using the Matterport3D dataset and Habitat-Lab simulation environment. Although our initial results reveal challenges in generalizing to unseen environments under strict evaluation settings, our modular approach lays a foundation for scalable and efficient navigation systems, highlighting promising directions for future improvement through enhanced environmental priors and expanded multimodal input integration.

Related papers

VLMPlanner: Integrating Visual Language Models with Motion Planning [18.633637485218802]
VLMPlanner is a hybrid framework that combines a learning-based real-time planner with a vision-language model (VLM) capable of reasoning over raw images.<n>We develop the Context-Adaptive Inference Gate mechanism that enables the VLM to mimic human driving behavior.
arXiv Detail & Related papers (2025-07-27T16:15:21Z)
NavigScene: Bridging Local Perception and Global Navigation for Beyond-Visual-Range Autonomous Driving [10.597463021650382]
NavigScene is an auxiliary navigation-guided natural language dataset that simulates a human-like driving environment within autonomous driving systems.<n>We develop three paradigms to leverage NavigScene: (1) Navigation-guided Reasoning, which enhances vision-language models by incorporating navigation context into the prompting approach; (2) Navigation-guided Preference Optimization, a reinforcement learning method that extends Direct Preference Optimization to improve vision-language model responses; and (3) Navigation-guided Vision-Language-Action model, which integrates navigation guidance and vision-language models with conventional driving models through feature fusion.
arXiv Detail & Related papers (2025-07-07T17:37:01Z)
NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments [67.18144414660681]
Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to execute sequential navigation actions in complex environments guided by natural language instructions.<n>Inspired by human cognition, we present NavMorph, a self-evolving world model framework that enhances environmental understanding and decision-making in VLN-CE tasks.
arXiv Detail & Related papers (2025-06-30T02:20:00Z)
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning [77.34267241692706]
Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions.<n>We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions.
arXiv Detail & Related papers (2025-06-20T17:59:59Z)
Think Hierarchically, Act Dynamically: Hierarchical Multi-modal Fusion and Reasoning for Vision-and-Language Navigation [11.23342183103283]
Vision-and-Language Navigation (VLN) aims to enable embodied agents to follow natural language instructions and reach target locations in real-world environments.<n>We propose a Multi-level Fusion and Reasoning Architecture (MFRA) to enhance the agent's ability to reason over visual observations, language instructions and navigation history.
arXiv Detail & Related papers (2025-04-23T08:41:27Z)
Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation [35.71602601385161]
We present a novel vision-language model (VLM)-based navigation framework.<n>Our approach enhances spatial reasoning and decision-making in long-horizon tasks.<n> Experimental results demonstrate that the proposed method surpasses previous state-of-the-art approaches in object navigation tasks.
arXiv Detail & Related papers (2025-02-20T04:41:40Z)
Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method [94.74003109176581]
Long-Horizon Vision-Language Navigation (LH-VLN) is a novel VLN task that emphasizes long-term planning and decision consistency across consecutive subtasks.<n>Our platform, benchmark and method supply LH-VLN with a robust data generation pipeline, comprehensive model evaluation dataset, reasonable metrics, and a novel VLN model.
arXiv Detail & Related papers (2024-12-12T09:08:13Z)
Navigation World Models [68.58459393846461]
We introduce a controllable video generation model that predicts future visual observations based on past observations and navigation actions.<n>In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal.<n>Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy.
arXiv Detail & Related papers (2024-12-04T18:59:45Z)
Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs [95.8010627763483]
Mobility VLA is a hierarchical Vision-Language-Action (VLA) navigation policy that combines the environment understanding and common sense reasoning power of long-context VLMs. We show that Mobility VLA has a high end-to-end success rates on previously unsolved multimodal instructions.
arXiv Detail & Related papers (2024-07-10T15:49:07Z)
Cognitive Planning for Object Goal Navigation using Generative AI Models [0.979851640406258]
We present a novel framework for solving the object goal navigation problem that generates efficient exploration strategies. Our approach enables a robot to navigate unfamiliar environments by leveraging Large Language Models (LLMs) and Large Vision-Language Models (LVLMs)
arXiv Detail & Related papers (2024-03-30T10:54:59Z)
Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration [83.96729205383501]
We introduce prompt-based learning to achieve fast adaptation for language embeddings. Our model can adapt to diverse vision-language navigation tasks, including VLN and REVERIE.
arXiv Detail & Related papers (2022-03-08T11:01:24Z)
Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN) It is compartmentalized enough to accurately memorize the percepts during navigation. It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.