TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2502.07306v1
- Date: Tue, 11 Feb 2025 07:09:37 GMT
- Title: TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation
- Authors: Navid Rajabi, Jana Kosecka,
- Abstract summary: We propose a modular approach for Vision-Language Navigation (VLN)
We use state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs) in a zero-shot setting.
We demonstrate superior performance compared to other approaches that use joint semantic maps.
- Score: 3.2688425993442696
- License:
- Abstract: In this work, we propose a modular approach for the Vision-Language Navigation (VLN) task by decomposing the problem into four sub-modules that use state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs) in a zero-shot setting. Given navigation instruction in natural language, we first prompt LLM to extract the landmarks and the order in which they are visited. Assuming the known model of the environment, we retrieve the top-k locations of the last landmark and generate $k$ path hypotheses from the starting location to the last landmark using the shortest path algorithm on the topological map of the environment. Each path hypothesis is represented by a sequence of panoramas. We then use dynamic programming to compute the alignment score between the sequence of panoramas and the sequence of landmark names, which match scores obtained from VLM. Finally, we compute the nDTW metric between the hypothesis that yields the highest alignment score to evaluate the path fidelity. We demonstrate superior performance compared to other approaches that use joint semantic maps like VLMaps \cite{vlmaps} on the complex R2R-Habitat \cite{r2r} instruction dataset and quantify in detail the effect of visual grounding on navigation performance.
Related papers
- NavTopo: Leveraging Topological Maps For Autonomous Navigation Of a Mobile Robot [1.0550841723235613]
We propose a full navigation pipeline based on topological map and two-level path planning.
The pipeline localizes in the graph by matching neural network descriptors and 2D projections of the input point clouds.
We test our approach in a large indoor photo-relaistic simulated environment and compare it to a metric map-based approach based on popular metric mapping method RTAB-MAP.
arXiv Detail & Related papers (2024-10-15T10:54:49Z) - PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation [30.710806048991923]
Vision and language navigation is a task that requires an agent to navigate according to a natural language instruction.
Recent methods predict sub-goals on constructed topology map at each step to enable long-term action planning.
We propose an alternative method that facilitates navigation planning by considering the alignment between instructions and directed fidelity trajectories.
arXiv Detail & Related papers (2024-07-16T08:22:18Z) - PivotNet: Vectorized Pivot Learning for End-to-end HD Map Construction [10.936405710245625]
We propose a simple yet effective architecture named PivotNet, which adopts unified pivot-based map representations.
PivotNet is remarkably superior to other SOTAs by 5.9 mAP at least.
arXiv Detail & Related papers (2023-08-31T05:43:46Z) - Towards Unified Token Learning for Vision-Language Tracking [65.96561538356315]
We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task.
Our proposed framework serializes language description and bounding box into a sequence of discrete tokens.
In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
arXiv Detail & Related papers (2023-08-27T13:17:34Z) - GridMM: Grid Memory Map for Vision-and-Language Navigation [40.815400962166535]
Vision-and-language navigation (VLN) enables the agent to navigate to a remote location following the natural language instruction in 3D environments.
We build the top-down egocentric and dynamically growing Grid Memory Map to structure the visited environment.
From a global perspective, historical observations are projected into a unified grid map in a top-down view, which can better represent the spatial relations of the environment.
arXiv Detail & Related papers (2023-07-24T16:02:42Z) - BEVBert: Multimodal Map Pre-training for Language-guided Navigation [75.23388288113817]
We propose a new map-based pre-training paradigm that is spatial-aware for use in vision-and-language navigation (VLN)
We build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map.
Based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal.
arXiv Detail & Related papers (2022-12-08T16:27:54Z) - Cross-modal Map Learning for Vision and Language Navigation [82.04247028482244]
We consider the problem of Vision-and-Language Navigation (VLN)
In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations.
We propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions.
arXiv Detail & Related papers (2022-03-10T03:30:12Z) - Think Global, Act Local: Dual-scale Graph Transformer for
Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding.
We build a topological map on-the-fly to enable efficient exploration in global action space.
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z) - Average Outward Flux Skeletons for Environment Mapping and Topology
Matching [15.93458380913065]
We consider how to extract a road map of an initially-unknown 2-dimensional environment via an online procedure that robustly computes a retraction of its boundaries.
The proposed algorithm results in smooth safe paths for the robot's navigation needs.
arXiv Detail & Related papers (2021-11-27T06:29:57Z) - Neighbor-view Enhanced Model for Vision and Language Navigation [78.90859474564787]
Vision and Language Navigation (VLN) requires an agent to navigate to a target location by following natural language instructions.
In this work, we propose a multi- module Neighbor-View Enhanced Model (NvEM) to adaptively incorporate visual contexts from neighbor views.
arXiv Detail & Related papers (2021-07-15T09:11:02Z) - Object-and-Action Aware Model for Visual Language Navigation [70.33142095637515]
Vision-and-Language Navigation (VLN) is unique in that it requires turning relatively general natural-language instructions into robot agent actions.
We propose an Object-and-Action Aware Model (OAAM) that processes these two different forms of natural language based instruction separately.
This enables each process to match object-centered/action-centered instruction to their own counterpart visual perception/action orientation flexibly.
arXiv Detail & Related papers (2020-07-29T06:32:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.