Iterative Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2210.03087v3
- Date: Sun, 24 Dec 2023 05:37:26 GMT
- Title: Iterative Vision-and-Language Navigation
- Authors: Jacob Krantz, Shurjo Banerjee, Wang Zhu, Jason Corso, Peter Anderson,
Stefan Lee and Jesse Thomason
- Abstract summary: Iterative Vision-and-Language Navigation (IVLN) is a paradigm for evaluating language-guided agents navigating in a persistent environment over time.
Existing benchmarks erase the agent's memory at the beginning of every episode, testing the ability to perform cold-start navigation with no prior information.
We present discrete and continuous Iterative Room-to-Room (IR2R) benchmarks comprising about 400 tours each in 80 indoor scenes.
- Score: 21.529113549298764
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Iterative Vision-and-Language Navigation (IVLN), a paradigm for
evaluating language-guided agents navigating in a persistent environment over
time. Existing Vision-and-Language Navigation (VLN) benchmarks erase the
agent's memory at the beginning of every episode, testing the ability to
perform cold-start navigation with no prior information. However, deployed
robots occupy the same environment for long periods of time. The IVLN paradigm
addresses this disparity by training and evaluating VLN agents that maintain
memory across tours of scenes that consist of up to 100 ordered
instruction-following Room-to-Room (R2R) episodes, each defined by an
individual language instruction and a target path. We present discrete and
continuous Iterative Room-to-Room (IR2R) benchmarks comprising about 400 tours
each in 80 indoor scenes. We find that extending the implicit memory of
high-performing transformer VLN agents is not sufficient for IVLN, but agents
that build maps can benefit from environment persistence, motivating a renewed
focus on map-building agents in VLN.
Related papers
- OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation [96.46961207887722]
OVER-NAV aims to go over and beyond the current arts of IVLN techniques.
To fully exploit the interpreted navigation data, we introduce a structured representation, coded Omnigraph.
arXiv Detail & Related papers (2024-03-26T02:34:48Z) - Continual Vision-and-Language Navigation [18.20829279972436]
Vision-and-Language Navigation (VLN) agents navigate to a destination using natural language instructions and the visual information they observe.
Existing methods for training VLN agents presuppose fixed datasets, leading to a significant limitation.
We present the Continual Vision-and-Language Navigation (CVLN) paradigm, designed to evaluate agents trained through a continual learning process.
arXiv Detail & Related papers (2024-03-22T09:15:36Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - ESceme: Vision-and-Language Navigation with Episodic Scene Memory [72.69189330588539]
Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes.
We introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent's memories of past visits when it enters the current scene.
arXiv Detail & Related papers (2023-03-02T07:42:07Z) - A New Path: Scaling Vision-and-Language Navigation with Synthetic
Instructions and Imitation Learning [70.14372215250535]
Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments.
Given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding.
We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory.
The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets.
arXiv Detail & Related papers (2022-10-06T17:59:08Z) - Airbert: In-domain Pretraining for Vision-and-Language Navigation [91.03849833486974]
Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions.
Recent methods explore pretraining to improve generalization of VLN agents.
We introduce BnB, a large-scale and diverse in-domain VLN dataset.
arXiv Detail & Related papers (2021-08-20T10:58:09Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.