VPN: Visual Prompt Navigation
- URL: http://arxiv.org/abs/2508.01766v2
- Date: Tue, 05 Aug 2025 07:35:17 GMT
- Title: VPN: Visual Prompt Navigation
- Authors: Shuo Feng, Zihan Wang, Yuchen Li, Rui Kong, Hengyi Cai, Shuaiqiang Wang, Gim Hee Lee, Piji Li, Shuqiang Jiang,
- Abstract summary: Visual Prompt Navigation (VPN) is a novel paradigm that guides agents to navigate using only user-provided visual prompts.<n>VPN primarily focuses on marking the visual navigation trajectory on a top-down view of a scene.<n>VPN is more friendly for non-expert users and reduces interpretive ambiguity.
- Score: 86.7782248763078
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user-provided visual prompts within 2D top-view maps. This visual prompt primarily focuses on marking the visual navigation trajectory on a top-down view of a scene, offering intuitive and spatially grounded guidance without relying on language instructions. It is more friendly for non-expert users and reduces interpretive ambiguity. We build VPN tasks in both discrete and continuous navigation settings, constructing two new datasets, R2R-VP and R2R-CE-VP, by extending existing R2R and R2R-CE episodes with corresponding visual prompts. Furthermore, we introduce VPNet, a dedicated baseline network to handle the VPN tasks, with two data augmentation strategies: view-level augmentation (altering initial headings and prompt orientations) and trajectory-level augmentation (incorporating diverse trajectories from large-scale 3D scenes), to enhance navigation performance. Extensive experiments evaluate how visual prompt forms, top-view map formats, and data augmentation strategies affect the performance of visual prompt navigation. The code is available at https://github.com/farlit/VPN.
Related papers
- Do Visual Imaginations Improve Vision-and-Language Navigation Agents? [16.503837141587447]
Vision-and-Language Navigation (VLN) agents are tasked with navigating an unseen environment using natural language instructions.<n>We study if visual representations of sub-goals implied by the instructions can serve as navigational cues and lead to increased navigation performance.
arXiv Detail & Related papers (2025-03-20T17:53:12Z) - Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts [37.20272055902246]
Vision-and-Language Navigation with Multi-modal Prompts (VLN-MP) is a novel task augmenting traditional VLN by integrating both natural language and images in instructions.
VLN-MP not only maintains backward compatibility by effectively handling text-only prompts but also consistently shows advantages with different quantities and relevance of visual prompts.
arXiv Detail & Related papers (2024-06-04T11:06:13Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - ESceme: Vision-and-Language Navigation with Episodic Scene Memory [72.69189330588539]
Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes.
We introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent's memories of past visits when it enters the current scene.
arXiv Detail & Related papers (2023-03-02T07:42:07Z) - SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language
Navigation [57.12508968239015]
This work presents a transformer-based vision-and-language navigation (VLN) agent.
It uses two different visual encoders -- a scene classification network and an object detector.
Scene features contribute high-level contextual information that supports object-level processing.
arXiv Detail & Related papers (2021-10-27T03:29:34Z) - VTNet: Visual Transformer Network for Object Goal Navigation [36.15625223586484]
We introduce a Visual Transformer Network (VTNet) for learning informative visual representation in navigation.
In a nutshell, VTNet embeds object and region features with their location cues as spatial-aware descriptors.
Experiments in the artificial environment AI2-Thor demonstrate that VTNet significantly outperforms state-of-the-art methods in unseen testing environments.
arXiv Detail & Related papers (2021-05-20T01:23:15Z) - Know What and Know Where: An Object-and-Room Informed Sequential BERT
for Indoor Vision-Language Navigation [120.90387630691816]
Vision-and-Language Navigation (VLN) requires an agent to navigate to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas.
Most existing methods take words in instructions and discrete views of each panorama as the minimal unit of encoding.
We propose an object-informed sequential BERT to encode visual perceptions and linguistic instructions at the same fine-grained level.
arXiv Detail & Related papers (2021-04-09T02:44:39Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.