MaAST: Map Attention with Semantic Transformersfor Efficient Visual
Navigation
- URL: http://arxiv.org/abs/2103.11374v1
- Date: Sun, 21 Mar 2021 12:01:23 GMT
- Title: MaAST: Map Attention with Semantic Transformersfor Efficient Visual
Navigation
- Authors: Zachary Seymour, Kowshik Thopalli, Niluthpol Mithun, Han-Pang Chiu,
Supun Samarasekera, Rakesh Kumar
- Abstract summary: This work focuses on performing better or comparable to the existing learning-based solutions for visual navigation for autonomous agents.
We propose a method to encode vital scene semantics into a semantically informed, top-down egocentric map representation.
We conduct experiments on 3-D reconstructed indoor PointGoal visual navigation and demonstrate the effectiveness of our approach.
- Score: 4.127128889779478
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual navigation for autonomous agents is a core task in the fields of
computer vision and robotics. Learning-based methods, such as deep
reinforcement learning, have the potential to outperform the classical
solutions developed for this task; however, they come at a significantly
increased computational load. Through this work, we design a novel approach
that focuses on performing better or comparable to the existing learning-based
solutions but under a clear time/computational budget. To this end, we propose
a method to encode vital scene semantics such as traversable paths, unexplored
areas, and observed scene objects -- alongside raw visual streams such as RGB,
depth, and semantic segmentation masks -- into a semantically informed,
top-down egocentric map representation. Further, to enable the effective use of
this information, we introduce a novel 2-D map attention mechanism, based on
the successful multi-layer Transformer networks. We conduct experiments on 3-D
reconstructed indoor PointGoal visual navigation and demonstrate the
effectiveness of our approach. We show that by using our novel attention schema
and auxiliary rewards to better utilize scene semantics, we outperform multiple
baselines trained with only raw inputs or implicit semantic information while
operating with an 80% decrease in the agent's experience.
Related papers
- Self-supervised Learning via Cluster Distance Prediction for Operating Room Context Awareness [44.15562068190958]
In the Operating Room, semantic segmentation is at the core of creating robots aware of clinical surroundings.
State-of-the-art semantic segmentation and activity recognition approaches are fully supervised, which is not scalable.
We propose a new 3D self-supervised task for OR scene understanding utilizing OR scene images captured with ToF cameras.
arXiv Detail & Related papers (2024-07-07T17:17:52Z) - Interactive Semantic Map Representation for Skill-based Visual Object
Navigation [43.71312386938849]
This paper introduces a new representation of a scene semantic map formed during the embodied agent interaction with the indoor environment.
We have implemented this representation into a full-fledged navigation approach called SkillTron.
The proposed approach makes it possible to form both intermediate goals for robot exploration and the final goal for object navigation.
arXiv Detail & Related papers (2023-11-07T16:30:12Z) - Learning Navigational Visual Representations with Semantic Map
Supervision [85.91625020847358]
We propose a navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps.
Ego$2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation.
arXiv Detail & Related papers (2023-07-23T14:01:05Z) - Learning with a Mole: Transferable latent spatial representations for
navigation without reconstruction [12.845774297648736]
In most end-to-end learning approaches the representation is latent and usually does not have a clearly defined interpretation.
In this work we propose to learn an actionable representation of the scene independently of the targeted downstream task.
The learned representation is optimized by a blind auxiliary agent trained to navigate with it on multiple short sub episodes branching out from a waypoint.
arXiv Detail & Related papers (2023-06-06T16:51:43Z) - How To Not Train Your Dragon: Training-free Embodied Object Goal
Navigation with Semantic Frontiers [94.46825166907831]
We present a training-free solution to tackle the object goal navigation problem in Embodied AI.
Our method builds a structured scene representation based on the classic visual simultaneous localization and mapping (V-SLAM) framework.
Our method propagates semantics on the scene graphs based on language priors and scene statistics to introduce semantic knowledge to the geometric frontiers.
arXiv Detail & Related papers (2023-05-26T13:38:33Z) - ALSO: Automotive Lidar Self-supervision by Occupancy estimation [70.70557577874155]
We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds.
The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled.
The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information.
arXiv Detail & Related papers (2022-12-12T13:10:19Z) - X-Distill: Improving Self-Supervised Monocular Depth via Cross-Task
Distillation [69.9604394044652]
We propose a novel method to improve the self-supervised training of monocular depth via cross-task knowledge distillation.
During training, we utilize a pretrained semantic segmentation teacher network and transfer its semantic knowledge to the depth network.
We extensively evaluate the efficacy of our proposed approach on the KITTI benchmark and compare it with the latest state of the art.
arXiv Detail & Related papers (2021-10-24T19:47:14Z) - Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations
in Instructional Videos [78.34818195786846]
We introduce the task of spatially localizing narrated interactions in videos.
Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations.
We propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training.
arXiv Detail & Related papers (2021-10-20T14:45:13Z) - Improving Target-driven Visual Navigation with Attention on 3D Spatial
Relationships [52.72020203771489]
We investigate target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes.
Our proposed method combines visual features and 3D spatial representations to learn navigation policy.
Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.