Ground then Navigate: Language-guided Navigation in Dynamic Scenes
- URL: http://arxiv.org/abs/2209.11972v1
- Date: Sat, 24 Sep 2022 09:51:09 GMT
- Title: Ground then Navigate: Language-guided Navigation in Dynamic Scenes
- Authors: Kanishk Jain, Varun Chhangani, Amogh Tiwari, K. Madhava Krishna and
Vineet Gandhi
- Abstract summary: We investigate the Vision-and-Language Navigation (VLN) problem in the context of autonomous driving in outdoor settings.
We solve the problem by explicitly grounding the navigable regions corresponding to the textual command.
We provide extensive qualitative and quantitive empirical results to validate the efficacy of the proposed approach.
- Score: 13.870303451896248
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate the Vision-and-Language Navigation (VLN) problem in the
context of autonomous driving in outdoor settings. We solve the problem by
explicitly grounding the navigable regions corresponding to the textual
command. At each timestamp, the model predicts a segmentation mask
corresponding to the intermediate or the final navigable region. Our work
contrasts with existing efforts in VLN, which pose this task as a node
selection problem, given a discrete connected graph corresponding to the
environment. We do not assume the availability of such a discretised map. Our
work moves towards continuity in action space, provides interpretability
through visual feedback and allows VLN on commands requiring finer manoeuvres
like "park between the two cars". Furthermore, we propose a novel meta-dataset
CARLA-NAV to allow efficient training and validation. The dataset comprises
pre-recorded training sequences and a live environment for validation and
testing. We provide extensive qualitative and quantitive empirical results to
validate the efficacy of the proposed approach.
Related papers
- Prompt-based Context- and Domain-aware Pretraining for Vision and
Language Navigation [19.793659852435486]
We propose a novel Prompt-bAsed coNtext- and inDoor-Aware (PANDA) pretraining framework to address these problems.
In the indoor-aware stage, we apply an efficient tuning paradigm to learn deep visual prompts from an indoor dataset.
In the context-aware stage, we design a set of hard context prompts to capture the sequence-level semantics in the instruction.
arXiv Detail & Related papers (2023-09-07T11:58:34Z) - Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language
Navigation [23.94546957057613]
Cross-modal alignment is one key challenge for Vision-and-Language Navigation (VLN)
We propose a novel Grounded Entity-Landmark Adaptive (GELA) pre-training paradigm for VLN tasks.
arXiv Detail & Related papers (2023-08-24T06:25:20Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - BEVBert: Multimodal Map Pre-training for Language-guided Navigation [75.23388288113817]
We propose a new map-based pre-training paradigm that is spatial-aware for use in vision-and-language navigation (VLN)
We build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map.
Based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal.
arXiv Detail & Related papers (2022-12-08T16:27:54Z) - Self-Supervised Visual Place Recognition by Mining Temporal and Feature
Neighborhoods [17.852415436033436]
We propose a novel framework named textitTF-VPR that uses temporal neighborhoods and learnable feature neighborhoods to discover unknown spatial neighborhoods.
Our method follows an iterative training paradigm which alternates between: (1) representation learning with data augmentation, (2) positive set expansion to include the current feature space neighbors, and (3) positive set contraction via geometric verification.
arXiv Detail & Related papers (2022-08-19T12:59:46Z) - Cross-modal Map Learning for Vision and Language Navigation [82.04247028482244]
We consider the problem of Vision-and-Language Navigation (VLN)
In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations.
We propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions.
arXiv Detail & Related papers (2022-03-10T03:30:12Z) - NEAT: Neural Attention Fields for End-to-End Autonomous Driving [59.60483620730437]
We present NEural ATtention fields (NEAT), a novel representation that enables efficient reasoning for imitation learning models.
NEAT is a continuous function which maps locations in Bird's Eye View (BEV) scene coordinates to waypoints and semantics.
In a new evaluation setting involving adverse environmental conditions and challenging scenarios, NEAT outperforms several strong baselines and achieves driving scores on par with the privileged CARLA expert.
arXiv Detail & Related papers (2021-09-09T17:55:28Z) - Vision-Language Navigation with Random Environmental Mixup [112.94609558723518]
Vision-language Navigation (VLN) tasks require an agent to navigate step-by-step while perceiving the visual observations and comprehending a natural language instruction.
Previous works have proposed various data augmentation methods to reduce data bias.
We propose the Random Environmental Mixup (REM) method, which generates cross-connected house scenes as augmented data via mixuping environment.
arXiv Detail & Related papers (2021-06-15T04:34:26Z) - Self-Point-Flow: Self-Supervised Scene Flow Estimation from Point Clouds
with Optimal Transport and Random Walk [59.87525177207915]
We develop a self-supervised method to establish correspondences between two point clouds to approximate scene flow.
Our method achieves state-of-the-art performance among self-supervised learning methods.
arXiv Detail & Related papers (2021-05-18T03:12:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.