Related papers: Ground then Navigate: Language-guided Navigation in Dynamic Scenes

Ground then Navigate: Language-guided Navigation in Dynamic Scenes

URL: http://arxiv.org/abs/2209.11972v1
Date: Sat, 24 Sep 2022 09:51:09 GMT
Title: Ground then Navigate: Language-guided Navigation in Dynamic Scenes
Authors: Kanishk Jain, Varun Chhangani, Amogh Tiwari, K. Madhava Krishna and Vineet Gandhi
Abstract summary: We investigate the Vision-and-Language Navigation (VLN) problem in the context of autonomous driving in outdoor settings. We solve the problem by explicitly grounding the navigable regions corresponding to the textual command. We provide extensive qualitative and quantitive empirical results to validate the efficacy of the proposed approach.
Score: 13.870303451896248
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We investigate the Vision-and-Language Navigation (VLN) problem in the context of autonomous driving in outdoor settings. We solve the problem by explicitly grounding the navigable regions corresponding to the textual command. At each timestamp, the model predicts a segmentation mask corresponding to the intermediate or the final navigable region. Our work contrasts with existing efforts in VLN, which pose this task as a node selection problem, given a discrete connected graph corresponding to the environment. We do not assume the availability of such a discretised map. Our work moves towards continuity in action space, provides interpretability through visual feedback and allows VLN on commands requiring finer manoeuvres like "park between the two cars". Furthermore, we propose a novel meta-dataset CARLA-NAV to allow efficient training and validation. The dataset comprises pre-recorded training sequences and a live environment for validation and testing. We provide extensive qualitative and quantitive empirical results to validate the efficacy of the proposed approach.

Related papers

Prompt-based Context- and Domain-aware Pretraining for Vision and Language Navigation [19.793659852435486]
We propose a novel Prompt-bAsed coNtext- and inDoor-Aware (PANDA) pretraining framework to address these problems. In the indoor-aware stage, we apply an efficient tuning paradigm to learn deep visual prompts from an indoor dataset. In the context-aware stage, we design a set of hard context prompts to capture the sequence-level semantics in the instruction.
arXiv Detail & Related papers (2023-09-07T11:58:34Z)
Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation [23.94546957057613]
Cross-modal alignment is one key challenge for Vision-and-Language Navigation (VLN) We propose a novel Grounded Entity-Landmark Adaptive (GELA) pre-training paradigm for VLN tasks.
arXiv Detail & Related papers (2023-08-24T06:25:20Z)
Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning. This task unifies spatial and temporal localization in video. We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z)
BEVBert: Multimodal Map Pre-training for Language-guided Navigation [75.23388288113817]
We propose a new map-based pre-training paradigm that is spatial-aware for use in vision-and-language navigation (VLN) We build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map. Based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal.
arXiv Detail & Related papers (2022-12-08T16:27:54Z)
Cross-modal Map Learning for Vision and Language Navigation [82.04247028482244]
We consider the problem of Vision-and-Language Navigation (VLN) In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations. We propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions.
arXiv Detail & Related papers (2022-03-10T03:30:12Z)
NEAT: Neural Attention Fields for End-to-End Autonomous Driving [59.60483620730437]
We present NEural ATtention fields (NEAT), a novel representation that enables efficient reasoning for imitation learning models. NEAT is a continuous function which maps locations in Bird's Eye View (BEV) scene coordinates to waypoints and semantics. In a new evaluation setting involving adverse environmental conditions and challenging scenarios, NEAT outperforms several strong baselines and achieves driving scores on par with the privileged CARLA expert.
arXiv Detail & Related papers (2021-09-09T17:55:28Z)
Vision-Language Navigation with Random Environmental Mixup [112.94609558723518]
Vision-language Navigation (VLN) tasks require an agent to navigate step-by-step while perceiving the visual observations and comprehending a natural language instruction. Previous works have proposed various data augmentation methods to reduce data bias. We propose the Random Environmental Mixup (REM) method, which generates cross-connected house scenes as augmented data via mixuping environment.
arXiv Detail & Related papers (2021-06-15T04:34:26Z)
Self-Point-Flow: Self-Supervised Scene Flow Estimation from Point Clouds with Optimal Transport and Random Walk [59.87525177207915]
We develop a self-supervised method to establish correspondences between two point clouds to approximate scene flow. Our method achieves state-of-the-art performance among self-supervised learning methods.
arXiv Detail & Related papers (2021-05-18T03:12:42Z)
Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling [65.99956848461915]
Vision-and-Language Navigation (VLN) is a task where agents must decide how to move through a 3D environment to reach a goal. One of the problems of the VLN task is data scarcity since it is difficult to collect enough navigation paths with human-annotated instructions for interactive environments. We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data.
arXiv Detail & Related papers (2019-11-17T18:02:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.