Related papers: Continual Vision-and-Language Navigation

Continual Vision-and-Language Navigation

URL: http://arxiv.org/abs/2403.15049v1
Date: Fri, 22 Mar 2024 09:15:36 GMT
Title: Continual Vision-and-Language Navigation
Authors: Seongjun Jeong, Gi-Cheon Kang, Seongho Choi, Joochan Kim, Byoung-Tak Zhang,
Abstract summary: Vision-and-Language Navigation (VLN) agents navigate to a destination using natural language instructions and the visual information they observe. Existing methods for training VLN agents presuppose fixed datasets, leading to a significant limitation. We present the Continual Vision-and-Language Navigation (CVLN) paradigm, designed to evaluate agents trained through a continual learning process.
Score: 18.20829279972436
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision-and-Language Navigation (VLN) agents navigate to a destination using natural language instructions and the visual information they observe. Existing methods for training VLN agents presuppose fixed datasets, leading to a significant limitation: the introduction of new environments necessitates retraining with previously encountered environments to preserve their knowledge. This makes it difficult to train VLN agents that operate in the ever-changing real world. To address this limitation, we present the Continual Vision-and-Language Navigation (CVLN) paradigm, designed to evaluate agents trained through a continual learning process. For the training and evaluation of CVLN agents, we re-arrange existing VLN datasets to propose two datasets: CVLN-I, focused on navigation via initial-instruction interpretation, and CVLN-D, aimed at navigation through dialogue with other agents. Furthermore, we propose two novel rehearsal-based methods for CVLN, Perplexity Replay (PerpR) and Episodic Self-Replay (ESR). PerpR prioritizes replaying challenging episodes based on action perplexity, while ESR replays previously predicted action logits to preserve learned behaviors. We demonstrate the effectiveness of the proposed methods on CVLN through extensive experiments.

Related papers

Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation [67.31811007549489]
We propose a Rewriting-driven AugMentation (RAM) paradigm for Vision-Language Navigation (VLN) Benefiting from our rewriting mechanism, new observation-instruction can be obtained in both simulator-free and labor-saving manners to promote generalization. Experiments on both the discrete environments (R2R, REVERIE, and R4R) and continuous environments (R2R-CE) show the superior performance and impressive generalization ability of our method.
arXiv Detail & Related papers (2025-03-23T13:18:17Z)
UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation [71.97405667493477]
We introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN. It enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features. UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.
arXiv Detail & Related papers (2024-11-25T02:44:59Z)
Vision-Language Navigation with Continual Learning [10.850410419782424]
Vision-language navigation (VLN) is a critical domain within embedded intelligence. We propose the Vision-Language Navigation with Continual Learning paradigm to address this challenge. In this paradigm, agents incrementally learn new environments while retaining previously acquired knowledge.
arXiv Detail & Related papers (2024-09-04T09:28:48Z)
Vision-and-Language Navigation Generative Pretrained Transformer [0.0]
Vision-and-Language Navigation Generative Pretrained Transformer (VLN-GPT) Adopts transformer decoder model (GPT2) to model trajectory sequence dependencies, bypassing the need for historical encoding modules. Performance assessments on the VLN dataset reveal that VLN-GPT surpasses complex state-of-the-art encoder-based models.
arXiv Detail & Related papers (2024-05-27T09:42:04Z)
Scaling Vision-and-Language Navigation With Offline RL [35.624579441774685]
We introduce a new problem setup of VLN-ORL which studies VLN using suboptimal demonstration data. We introduce a simple and effective reward-conditioned approach that can account for dataset suboptimality for training VLN agents. Our experiments demonstrate that the proposed reward-conditioned approach leads to significant performance improvements.
arXiv Detail & Related papers (2024-03-27T11:13:20Z)
NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning [101.56342075720588]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions. Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability. This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z)
Vision-Language Models Provide Promptable Representations for Reinforcement Learning [67.40524195671479]
We propose a novel approach that uses the vast amounts of general and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data for embodied reinforcement learning (RL) We show that our approach can use chain-of-thought prompting to produce representations of common-sense semantic reasoning, improving policy performance in novel scenes by 1.5 times.
arXiv Detail & Related papers (2024-02-05T00:48:56Z)
Prompt-based Context- and Domain-aware Pretraining for Vision and Language Navigation [19.793659852435486]
We propose a novel Prompt-bAsed coNtext- and inDoor-Aware (PANDA) pretraining framework to address these problems. In the indoor-aware stage, we apply an efficient tuning paradigm to learn deep visual prompts from an indoor dataset. In the context-aware stage, we design a set of hard context prompts to capture the sequence-level semantics in the instruction.
arXiv Detail & Related papers (2023-09-07T11:58:34Z)
Anticipating the Unseen Discrepancy for Vision and Language Navigation [63.399180481818405]
Vision-Language Navigation requires the agent to follow natural language instructions to reach a specific target. The large discrepancy between seen and unseen environments makes it challenging for the agent to generalize well. We propose Unseen Discrepancy Anticipating Vision and Language Navigation (DAVIS) that learns to generalize to unseen environments via encouraging test-time visual consistency.
arXiv Detail & Related papers (2022-09-10T19:04:40Z)
HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation [33.38079488853708]
Previous pre-training methods for Vision-and-Language Navigation (VLN) either lack the ability to predict future actions or ignore contexts. We propose a novel pre-training paradigm that exploit the past observations and support future action prediction. Our navigation action prediction is also enhanced by the task of Action Prediction with History.
arXiv Detail & Related papers (2022-03-22T10:17:12Z)
Contrastive Instruction-Trajectory Learning for Vision-Language Navigation [66.16980504844233]
A vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction. Previous works fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions. We propose a Contrastive Instruction-Trajectory Learning framework that explores invariance across similar data samples and variance across different ones to learn distinctive representations for robust navigation.
arXiv Detail & Related papers (2021-12-08T06:32:52Z)
SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments [7.5606260987453116]
This paper presents a novel approach for the Vision-and-Language Navigation (VLN) task in continuous 3D environments. Existing end-to-end learning-based methods struggle at this task as they focus mostly on raw visual observations. We present a hybrid transformer-recurrence model which focuses on combining classical semantic mapping techniques with a learning-based method.
arXiv Detail & Related papers (2021-08-26T17:57:02Z)
Airbert: In-domain Pretraining for Vision-and-Language Navigation [91.03849833486974]
Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions. Recent methods explore pretraining to improve generalization of VLN agents. We introduce BnB, a large-scale and diverse in-domain VLN dataset.
arXiv Detail & Related papers (2021-08-20T10:58:09Z)
Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training [150.35927365127176]
We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks. By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions. It learns more effectively in new tasks and generalizes better in a previously unseen environment.
arXiv Detail & Related papers (2020-02-25T03:08:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.