Related papers: SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments

SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments

URL: http://arxiv.org/abs/2108.11945v1
Date: Thu, 26 Aug 2021 17:57:02 GMT
Title: SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments
Authors: Muhammad Zubair Irshad, Niluthpol Chowdhury Mithun, Zachary Seymour, Han-Pang Chiu, Supun Samarasekera, Rakesh Kumar
Abstract summary: This paper presents a novel approach for the Vision-and-Language Navigation (VLN) task in continuous 3D environments. Existing end-to-end learning-based methods struggle at this task as they focus mostly on raw visual observations. We present a hybrid transformer-recurrence model which focuses on combining classical semantic mapping techniques with a learning-based method.
Score: 7.5606260987453116
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents a novel approach for the Vision-and-Language Navigation (VLN) task in continuous 3D environments, which requires an autonomous agent to follow natural language instructions in unseen environments. Existing end-to-end learning-based VLN methods struggle at this task as they focus mostly on utilizing raw visual observations and lack the semantic spatio-temporal reasoning capabilities which is crucial in generalizing to new environments. In this regard, we present a hybrid transformer-recurrence model which focuses on combining classical semantic mapping techniques with a learning-based method. Our method creates a temporal semantic memory by building a top-down local ego-centric semantic map and performs cross-modal grounding to align map and language modalities to enable effective learning of VLN policy. Empirical results in a photo-realistic long-horizon simulation environment show that the proposed approach outperforms a variety of state-of-the-art methods and baselines with over 22% relative improvement in SPL in prior unseen environments.

Related papers

Without Paired Labeled Data: An End-to-End Self-Supervised Paradigm for UAV-View Geo-Localization [2.733505168507872]
UAV-View Geo-Localization (UVGL) aims to achieve accurate localization of unmanned aerial vehicles (UAVs) by retrieving the most relevant GPS-tagged satellite images. Existing methods heavily rely on pre-paired UAV-satellite images for supervised learning. We propose an end-to-end self-supervised UVGL method to overcome these limitations.
arXiv Detail & Related papers (2025-02-17T02:53:08Z)
UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation [71.97405667493477]
We introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN. It enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features. UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.
arXiv Detail & Related papers (2024-11-25T02:44:59Z)
Vision-Language Navigation with Continual Learning [10.850410419782424]
Vision-language navigation (VLN) is a critical domain within embedded intelligence. We propose the Vision-Language Navigation with Continual Learning paradigm to address this challenge. In this paradigm, agents incrementally learn new environments while retaining previously acquired knowledge.
arXiv Detail & Related papers (2024-09-04T09:28:48Z)
Continual Vision-and-Language Navigation [18.20829279972436]
Vision-and-Language Navigation (VLN) agents navigate to a destination using natural language instructions and visual cues. We propose Continual Vision-and-Language Navigation (CVLN) paradigm for agents to continually learn and adapt to changing environments.
arXiv Detail & Related papers (2024-03-22T09:15:36Z)
Causality-based Cross-Modal Representation Learning for Vision-and-Language Navigation [15.058687283978077]
Vision-and-Language Navigation (VLN) has gained significant research interest in recent years due to its potential applications in real-world scenarios. Existing VLN methods struggle with the issue of spurious associations, resulting in poor generalization with a significant performance gap between seen and unseen environments. We propose a unified framework CausalVLN based on the causal learning paradigm to train a robust navigator capable of learning unbiased feature representations.
arXiv Detail & Related papers (2024-03-06T02:01:38Z)
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features. S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z)
BEVBert: Multimodal Map Pre-training for Language-guided Navigation [75.23388288113817]
We propose a new map-based pre-training paradigm that is spatial-aware for use in vision-and-language navigation (VLN) We build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map. Based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal.
arXiv Detail & Related papers (2022-12-08T16:27:54Z)
Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration [83.96729205383501]
We introduce prompt-based learning to achieve fast adaptation for language embeddings. Our model can adapt to diverse vision-language navigation tasks, including VLN and REVERIE.
arXiv Detail & Related papers (2022-03-08T11:01:24Z)
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration [48.01536973731182]
We introduce a new vision-and-language pretraining method called ROSITA. It integrates the cross- and intra-modal knowledge in a unified scene graph to enhance the semantic alignments. ROSITA significantly outperforms existing state-of-the-art methods on three typical vision-and-language tasks over six benchmark datasets.
arXiv Detail & Related papers (2021-08-16T13:16:58Z)
Learning to Continuously Optimize Wireless Resource in a Dynamic Environment: A Bilevel Optimization Perspective [52.497514255040514]
This work develops a new approach that enables data-driven methods to continuously learn and optimize resource allocation strategies in a dynamic environment. We propose to build the notion of continual learning into wireless system design, so that the learning model can incrementally adapt to the new episodes. Our design is based on a novel bilevel optimization formulation which ensures certain fairness" across different data samples.
arXiv Detail & Related papers (2021-05-03T07:23:39Z)
Environment-agnostic Multitask Learning for Natural Language Grounded Navigation [88.69873520186017]
We introduce a multitask navigation model that can be seamlessly trained on Vision-Language Navigation (VLN) and Navigation from Dialog History (NDH) tasks. Experiments show that environment-agnostic multitask learning significantly reduces the performance gap between seen and unseen environments.
arXiv Detail & Related papers (2020-03-01T09:06:31Z)
Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling [65.99956848461915]
Vision-and-Language Navigation (VLN) is a task where agents must decide how to move through a 3D environment to reach a goal. One of the problems of the VLN task is data scarcity since it is difficult to collect enough navigation paths with human-annotated instructions for interactive environments. We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data.
arXiv Detail & Related papers (2019-11-17T18:02:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.