SASRA: Semantically-aware Spatio-temporal Reasoning Agent for
Vision-and-Language Navigation in Continuous Environments
- URL: http://arxiv.org/abs/2108.11945v1
- Date: Thu, 26 Aug 2021 17:57:02 GMT
- Title: SASRA: Semantically-aware Spatio-temporal Reasoning Agent for
Vision-and-Language Navigation in Continuous Environments
- Authors: Muhammad Zubair Irshad, Niluthpol Chowdhury Mithun, Zachary Seymour,
Han-Pang Chiu, Supun Samarasekera, Rakesh Kumar
- Abstract summary: This paper presents a novel approach for the Vision-and-Language Navigation (VLN) task in continuous 3D environments.
Existing end-to-end learning-based methods struggle at this task as they focus mostly on raw visual observations.
We present a hybrid transformer-recurrence model which focuses on combining classical semantic mapping techniques with a learning-based method.
- Score: 7.5606260987453116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a novel approach for the Vision-and-Language Navigation
(VLN) task in continuous 3D environments, which requires an autonomous agent to
follow natural language instructions in unseen environments. Existing
end-to-end learning-based VLN methods struggle at this task as they focus
mostly on utilizing raw visual observations and lack the semantic
spatio-temporal reasoning capabilities which is crucial in generalizing to new
environments. In this regard, we present a hybrid transformer-recurrence model
which focuses on combining classical semantic mapping techniques with a
learning-based method. Our method creates a temporal semantic memory by
building a top-down local ego-centric semantic map and performs cross-modal
grounding to align map and language modalities to enable effective learning of
VLN policy. Empirical results in a photo-realistic long-horizon simulation
environment show that the proposed approach outperforms a variety of
state-of-the-art methods and baselines with over 22% relative improvement in
SPL in prior unseen environments.
Related papers
- UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation [71.97405667493477]
We introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN.
It enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features.
UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.
arXiv Detail & Related papers (2024-11-25T02:44:59Z) - Vision-Language Navigation with Continual Learning [10.850410419782424]
Vision-language navigation (VLN) is a critical domain within embedded intelligence.
We propose the Vision-Language Navigation with Continual Learning paradigm to address this challenge.
In this paradigm, agents incrementally learn new environments while retaining previously acquired knowledge.
arXiv Detail & Related papers (2024-09-04T09:28:48Z) - Causality-based Cross-Modal Representation Learning for
Vision-and-Language Navigation [15.058687283978077]
Vision-and-Language Navigation (VLN) has gained significant research interest in recent years due to its potential applications in real-world scenarios.
Existing VLN methods struggle with the issue of spurious associations, resulting in poor generalization with a significant performance gap between seen and unseen environments.
We propose a unified framework CausalVLN based on the causal learning paradigm to train a robust navigator capable of learning unbiased feature representations.
arXiv Detail & Related papers (2024-03-06T02:01:38Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - BEVBert: Multimodal Map Pre-training for Language-guided Navigation [75.23388288113817]
We propose a new map-based pre-training paradigm that is spatial-aware for use in vision-and-language navigation (VLN)
We build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map.
Based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal.
arXiv Detail & Related papers (2022-12-08T16:27:54Z) - Visual-Language Navigation Pretraining via Prompt-based Environmental
Self-exploration [83.96729205383501]
We introduce prompt-based learning to achieve fast adaptation for language embeddings.
Our model can adapt to diverse vision-language navigation tasks, including VLN and REVERIE.
arXiv Detail & Related papers (2022-03-08T11:01:24Z) - ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and
Intra-modal Knowledge Integration [48.01536973731182]
We introduce a new vision-and-language pretraining method called ROSITA.
It integrates the cross- and intra-modal knowledge in a unified scene graph to enhance the semantic alignments.
ROSITA significantly outperforms existing state-of-the-art methods on three typical vision-and-language tasks over six benchmark datasets.
arXiv Detail & Related papers (2021-08-16T13:16:58Z) - Learning to Continuously Optimize Wireless Resource in a Dynamic
Environment: A Bilevel Optimization Perspective [52.497514255040514]
This work develops a new approach that enables data-driven methods to continuously learn and optimize resource allocation strategies in a dynamic environment.
We propose to build the notion of continual learning into wireless system design, so that the learning model can incrementally adapt to the new episodes.
Our design is based on a novel bilevel optimization formulation which ensures certain fairness" across different data samples.
arXiv Detail & Related papers (2021-05-03T07:23:39Z) - Environment-agnostic Multitask Learning for Natural Language Grounded
Navigation [88.69873520186017]
We introduce a multitask navigation model that can be seamlessly trained on Vision-Language Navigation (VLN) and Navigation from Dialog History (NDH) tasks.
Experiments show that environment-agnostic multitask learning significantly reduces the performance gap between seen and unseen environments.
arXiv Detail & Related papers (2020-03-01T09:06:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.