Related papers: Boosting Zero-Shot VLN via Abstract Obstacle Map-Based Waypoint Prediction with TopoGraph-and-VisitInfo-Aware Prompting

Boosting Zero-Shot VLN via Abstract Obstacle Map-Based Waypoint Prediction with TopoGraph-and-VisitInfo-Aware Prompting

URL: http://arxiv.org/abs/2509.20499v1
Date: Wed, 24 Sep 2025 19:21:39 GMT
Title: Boosting Zero-Shot VLN via Abstract Obstacle Map-Based Waypoint Prediction with TopoGraph-and-VisitInfo-Aware Prompting
Authors: Boqi Li, Siyuan Li, Weiyi Wang, Anran Li, Zhong Cao, Henry X. Liu,
Abstract summary: Vision-language navigation (VLN) has emerged as a key task for embodied agents with broad practical applications.<n>We propose a zero-shot framework that integrates a simplified yet effective waypoint predictor with a multimodal large language model (MLLM)<n>Experiments on R2R-CE and RxR-CE show that our method achieves state-of-the-art zero-shot performance, with success rates of 41% and 36%, respectively.
Score: 18.325003967982827
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: With the rapid progress of foundation models and robotics, vision-language navigation (VLN) has emerged as a key task for embodied agents with broad practical applications. We address VLN in continuous environments, a particularly challenging setting where an agent must jointly interpret natural language instructions, perceive its surroundings, and plan low-level actions. We propose a zero-shot framework that integrates a simplified yet effective waypoint predictor with a multimodal large language model (MLLM). The predictor operates on an abstract obstacle map, producing linearly reachable waypoints, which are incorporated into a dynamically updated topological graph with explicit visitation records. The graph and visitation information are encoded into the prompt, enabling reasoning over both spatial structure and exploration history to encourage exploration and equip MLLM with local path planning for error correction. Extensive experiments on R2R-CE and RxR-CE show that our method achieves state-of-the-art zero-shot performance, with success rates of 41% and 36%, respectively, outperforming prior state-of-the-art methods.

Related papers

TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation [70.23578202012048]
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch.<n>We propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone.<n>To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment.<n>With the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction.
arXiv Detail & Related papers (2026-03-03T13:28:07Z)
VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation [52.00474922315126]
We present VLN-Zero, a vision-language navigation framework for unseen environments.<n>We use vision-language models to efficiently construct symbolic scene graphs and enable zero-shot neurosymbolic navigation.<n>VLN-Zero achieves 2x higher success rate compared to state-of-the-art zero-shot models, outperforms most fine-tuned baselines, and reaches goal locations in half the time.
arXiv Detail & Related papers (2025-09-23T03:23:03Z)
WoMAP: World Models For Embodied Open-Vocabulary Object Localization [8.947213246332764]
WoMAP (World Models for Active Perception) is a recipe for training open-vocabulary object localization policies.<n>We show that WoMAP achieves strong generalization and sim-to-real transfer on a TidyBot.
arXiv Detail & Related papers (2025-06-02T12:35:14Z)
Exploring Spatial Representation to Enhance LLM Reasoning in Aerial Vision-Language Navigation [11.267956604072845]
Aerial Vision-and-Language Navigation (VLN) is a novel task enabling Unmanned Aerial Vehicles (UAVs) to navigate in outdoor environments through natural language instructions and visual cues.<n>We propose a training-free, zero-shot framework for aerial VLN tasks, where the large language model (LLM) is leveraged as the agent for action prediction.
arXiv Detail & Related papers (2024-10-11T03:54:48Z)
Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation [64.84996994779443]
We propose a novel Affordances-Oriented Planner for continuous vision-language navigation (VLN) task. Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making. Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2024-07-08T12:52:46Z)
Mind the Gap: Improving Success Rate of Vision-and-Language Navigation by Revisiting Oracle Success Routes [25.944819618283613]
Vision-and-Language Navigation (VLN) aims to navigate to the target location by following a given instruction. We make the first attempt to tackle a long-ignored problem in VLN: narrowing the gap between Success Rate (SR) and Oracle Success Rate (OSR)
arXiv Detail & Related papers (2023-08-07T01:43:25Z)
Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding. We build a topological map on-the-fly to enable efficient exploration in global action space. The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z)
Waypoint Models for Instruction-guided Navigation in Continuous Environments [68.2912740006109]
We develop a class of language-conditioned waypoint prediction networks to examine this question. We measure task performance and estimated execution time on a profiled LoCoBot robot. Our models outperform prior work in VLN-CE and set a new state-of-the-art on the public leaderboard.
arXiv Detail & Related papers (2021-10-05T17:55:49Z)
SOON: Scenario Oriented Object Navigation with Graph-based Exploration [102.74649829684617]
The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the 'holy grail' goals of intelligent robots. Most visual navigation benchmarks focus on navigating toward a target from a fixed starting point, guided by an elaborate set of instructions that depicts step-by-step. This approach deviates from real-world problems in which human-only describes what the object and its surrounding look like and asks the robot to start navigation from anywhere.
arXiv Detail & Related papers (2021-03-31T15:01:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.