Related papers: A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues

A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues

URL: http://arxiv.org/abs/2207.11717v1
Date: Sun, 24 Jul 2022 11:09:45 GMT
Title: A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues
Authors: Jason Armitage, Leonardo Impett, Rico Sennrich
Abstract summary: We implement a priority map module and pretrain on auxiliary tasks using low-sample datasets. A hierarchical process of trajectory planning addresses the core challenges of cross-modal alignment and feature-level localisation. The priority map module is integrated into a feature-location framework that doubles the task completion rates of standalone transformers.
Score: 34.55676068012246
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In a busy city street, a pedestrian surrounded by distractions can pick out a single sign if it is relevant to their route. Artificial agents in outdoor Vision-and-Language Navigation (VLN) are also confronted with detecting supervisory signal on environment features and location in inputs. To boost the prominence of relevant features in transformer-based architectures without costly preprocessing and pretraining, we take inspiration from priority maps - a mechanism described in neuropsychological studies. We implement a novel priority map module and pretrain on auxiliary tasks using low-sample datasets with high-level representations of routes and environment-related references to urban features. A hierarchical process of trajectory planning - with subsequent parameterised visual boost filtering on visual inputs and prediction of corresponding textual spans - addresses the core challenges of cross-modal alignment and feature-level localisation. The priority map module is integrated into a feature-location framework that doubles the task completion rates of standalone transformers and attains state-of-the-art performance on the Touchdown benchmark for VLN. Code and data are referenced in Appendix C.

Related papers

Task-Oriented Communications for Visual Navigation with Edge-Aerial Collaboration in Low Altitude Economy [16.62021190565778]
Vision-based methods face severe bandwidth, memory and processing constraints on lightweight UAVs. We propose a task-oriented communication framework, where UAVs equipped with multi-camera systems extract compact multi-view features and offload localization tasks to edge servers.
arXiv Detail & Related papers (2025-04-25T12:49:14Z)
TrajGEOS: Trajectory Graph Enhanced Orientation-based Sequential Network for Mobility Prediction [10.876862361004944]
We propose a textbfTrajectory textbfGraph textbfEnhanced textbfOrientation-based textbfSequential network (TrajGEOS) for next-location prediction tasks.
arXiv Detail & Related papers (2024-12-26T07:18:38Z)
PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation [30.710806048991923]
Vision and language navigation is a task that requires an agent to navigate according to a natural language instruction. Recent methods predict sub-goals on constructed topology map at each step to enable long-term action planning. We propose an alternative method that facilitates navigation planning by considering the alignment between instructions and directed fidelity trajectories.
arXiv Detail & Related papers (2024-07-16T08:22:18Z)
Towards Effective Next POI Prediction: Spatial and Semantic Augmentation with Remote Sensing Data [10.968721742000653]
We propose an effective deep-learning method within a two-step prediction framework. Our method first incorporates remote sensing data, capturing pivotal environmental context. We construct the QR-P graph for the user's historical trajectories to encapsulate historical travel knowledge.
arXiv Detail & Related papers (2024-03-22T04:22:36Z)
Vanishing-Point-Guided Video Semantic Segmentation of Driving Scenes [70.08318779492944]
We are the first to harness vanishing point (VP) priors for more effective segmentation. Our novel, efficient network for VSS, named VPSeg, incorporates two modules that utilize exactly this pair of static and dynamic VP priors.
arXiv Detail & Related papers (2024-01-27T01:01:58Z)
Towards Unified Token Learning for Vision-Language Tracking [65.96561538356315]
We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task. Our proposed framework serializes language description and bounding box into a sequence of discrete tokens. In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
arXiv Detail & Related papers (2023-08-27T13:17:34Z)
Top-Down Visual Attention from Analysis by Synthesis [87.47527557366593]
We consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision. We propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and controllable achieves top-down attention.
arXiv Detail & Related papers (2023-03-23T05:17:05Z)
BEVBert: Multimodal Map Pre-training for Language-guided Navigation [75.23388288113817]
We propose a new map-based pre-training paradigm that is spatial-aware for use in vision-and-language navigation (VLN) We build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map. Based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal.
arXiv Detail & Related papers (2022-12-08T16:27:54Z)
Predicting Dense and Context-aware Cost Maps for Semantic Robot Navigation [35.45993685414002]
We investigate the task of object goal navigation in unknown environments where the target is specified by a semantic label. We propose a deep neural network architecture and loss function to predict dense cost maps that implicitly contain semantic context. We also present a novel way of fusing mid-level visual representations in our architecture to provide additional semantic cues for cost map prediction.
arXiv Detail & Related papers (2022-10-17T11:43:19Z)
VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation [74.56282712099274]
This paper introduces VectorNet, a hierarchical graph neural network that exploits the spatial locality of individual road components represented by vectors. By operating on the vectorized high definition (HD) maps and agent trajectories, we avoid lossy rendering and computationally intensive ConvNet encoding steps. We evaluate VectorNet on our in-house behavior prediction benchmark and the recently released Argoverse forecasting dataset.
arXiv Detail & Related papers (2020-05-08T19:07:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.