Related papers: Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces

Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces

URL: http://arxiv.org/abs/2508.02917v1
Date: Mon, 04 Aug 2025 21:45:21 GMT
Title: Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces
Authors: Vebjørn Haug Kåsene, Pierre Lison,
Abstract summary: Vision-and-Language Navigation (VLN) enables autonomous robots to navigate unfamiliar environments by following natural language instructions.<n>Current VLN systems rely on models specifically designed and optimized for navigation, leaving the potential of off-the-shelf LVLMs underexplored.<n>This paper investigates whether off-the-shelf LVLMs can effectively support VLN tasks and whether such models can support both low-level and panoramic action paradigms.
Score: 2.2406151150434894
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-and-Language Navigation (VLN) refers to the task of enabling autonomous robots to navigate unfamiliar environments by following natural language instructions. While recent Large Vision-Language Models (LVLMs) have shown promise in this task, most current VLM systems rely on models specifically designed and optimized for navigation, leaving the potential of off-the-shelf LVLMs underexplored. Furthermore, while older VLN approaches used low-level action spaces with egocentric views and atomic actions (such as "turn left" or "move forward"), newer models tend to favor panoramic action spaces with discrete navigable viewpoints. This paper investigates (1) whether off-the-shelf LVLMs (fine-tuned without architectural modifications or simulator-based training) can effectively support VLN tasks and (2) whether such models can support both low-level and panoramic action paradigms. To this end, we fine-tune the open-source model Qwen2.5-VL-3B-Instruct on the Room-to-Room (R2R) dataset and evaluate its empirical performance across both low-level and panoramic action spaces. The best resulting model achieves a 41% success rate on the R2R test set, demonstrating that while off-the-shelf LVLMs can learn to perform Vision-and-Language Navigation, they still lag behind models specifically designed for this task.

Related papers

TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation [70.23578202012048]
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch.<n>We propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone.<n>To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment.<n>With the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction.
arXiv Detail & Related papers (2026-03-03T13:28:07Z)
Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation [18.136190060725102]
Beyond-the-View Navigation (BVN) requires agents to locate distant, unseen targets without dense and step-by-step guidance.<n>Existing large language model (LLM)-based methods often suffer from short-sighted behaviors due to their reliance on short-horimzon supervision.<n>We propose SparseVideoNav, achieving sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon.
arXiv Detail & Related papers (2026-02-05T16:16:13Z)
Fine-Tuning Vision-Language Models for Visual Navigation Assistance [28.43430422119113]
We address vision-language-driven indoor navigation to assist visually impaired individuals in reaching a target location using images and natural language guidance.<n>Our approach integrates vision and language models to generate step-by-step navigational instructions, enhancing accessibility and independence.
arXiv Detail & Related papers (2025-09-09T08:08:35Z)
EdgeVLA: Efficient Vision-Language-Action Models [0.4005096060512278]
This paper introduces Edge VLA, a novel approach designed to significantly enhance the inference speed of Vision-Language-Action (VLA) models.<n>We achieve this through two key innovations: 1) Eliminating the autoregressive requirement for end-effector position prediction, leading to a 7x speedup in inference, and 2) Leveraging the efficiency of Small Language Models (SLMs)<n>Our early results demonstrate that EVLA achieves comparable training characteristics to OpenVLA while offering substantial gains in inference speed and memory efficiency.
arXiv Detail & Related papers (2025-07-18T16:15:09Z)
Unified Vision-Language-Action Model [86.68814779303429]
We present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences.<n>Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge.<n>We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.
arXiv Detail & Related papers (2025-06-24T17:59:57Z)
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning [77.34267241692706]
Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions.<n>We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions.
arXiv Detail & Related papers (2025-06-20T17:59:59Z)
HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation [54.03004125910057]
We show that hierarchical vision-language-action models can be more effective in utilizing off-domain data than standard monolithic VLA models.<n>We show that, with the hierarchical design, the high-level VLM can transfer across significant domain gaps between the off-domain finetuning data and real-robot testing scenarios.
arXiv Detail & Related papers (2025-02-08T07:50:22Z)
Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation [64.84996994779443]
We propose a novel Affordances-Oriented Planner for continuous vision-language navigation (VLN) task. Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making. Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2024-07-08T12:52:46Z)
Sim-to-Real Transfer via 3D Feature Fields for Vision-and-Language Navigation [38.04404612393027]
Vision-and-language navigation (VLN) enables the agent to navigate to a remote location in 3D environments following the natural language instruction. In this work, we propose a sim-to-real transfer approach to endow the monocular robots with panoramic traversability perception and panoramic semantic understanding. Our VLN system outperforms previous SOTA monocular VLN methods in R2R-CE and RxR-CE benchmarks within the simulation environments and is also validated in real-world environments.
arXiv Detail & Related papers (2024-06-14T07:50:09Z)
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models [70.25499865569353]
We introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert. Our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench.
arXiv Detail & Related papers (2024-03-20T09:42:43Z)
NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning [97.88246428240872]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.<n>Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.<n>This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z)
ULN: Towards Underspecified Vision-and-Language Navigation [77.81257404252132]
Underspecified vision-and-Language Navigation (ULN) is a new setting for vision-and-Language Navigation (VLN) We propose a VLN framework that consists of a classification module, a navigation agent, and an Exploitation-to-Exploration (E2E) module. Our framework is more robust and outperforms the baselines on ULN by 10% relative success rate across all levels.
arXiv Detail & Related papers (2022-10-18T17:45:06Z)
Reinforced Structured State-Evolution for Vision-Language Navigation [42.46176089721314]
Vision-and-language Navigation (VLN) task requires an embodied agent to navigate to a remote location following a natural language instruction. Previous methods usually adopt a sequence model (e.g., Transformer and LSTM) as the navigator. We propose a novel Structured state-Evolution (SEvol) model to effectively maintain the environment layout clues for VLN.
arXiv Detail & Related papers (2022-04-20T07:51:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.