Trans4Map: Revisiting Holistic Top-down Mapping from Egocentric Images
to Allocentric Semantics with Vision Transformers
- URL: http://arxiv.org/abs/2207.06205v1
- Date: Wed, 13 Jul 2022 14:01:00 GMT
- Title: Trans4Map: Revisiting Holistic Top-down Mapping from Egocentric Images
to Allocentric Semantics with Vision Transformers
- Authors: Chang Chen, Jiaming Zhang, Kailun Yang, Kunyu Peng, Rainer
Stiefelhagen
- Abstract summary: We propose an end-to-end one-stage Transformer-based framework for Mapping, termed Trans4Map.
Trans4Map achieves state-of-the-art results, reducing 67.2% parameters, yet gaining a +3.25% mIoU and a +4.09% mBF1 improvements on the Matterport3D dataset.
- Score: 34.6312362205904
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans have an innate ability to sense their surroundings, as they can
extract the spatial representation from the egocentric perception and form an
allocentric semantic map via spatial transformation and memory updating.
However, endowing mobile agents with such a spatial sensing ability is still a
challenge, due to two difficulties: (1) the previous convolutional models are
limited by the local receptive field, thus, struggling to capture holistic
long-range dependencies during observation; (2) the excessive computational
budgets required for success, often lead to a separation of the mapping
pipeline into stages, resulting the entire mapping process inefficient. To
address these issues, we propose an end-to-end one-stage Transformer-based
framework for Mapping, termed Trans4Map. Our egocentric-to-allocentric mapping
process includes three steps: (1) the efficient transformer extracts the
contextual features from a batch of egocentric images; (2) the proposed
Bidirectional Allocentric Memory (BAM) module projects egocentric features into
the allocentric memory; (3) the map decoder parses the accumulated memory and
predicts the top-down semantic segmentation map. In contrast, Trans4Map
achieves state-of-the-art results, reducing 67.2% parameters, yet gaining a
+3.25% mIoU and a +4.09% mBF1 improvements on the Matterport3D dataset. Code
will be made publicly available at https://github.com/jamycheung/Trans4Map.
Related papers
- GenMapping: Unleashing the Potential of Inverse Perspective Mapping for Robust Online HD Map Construction [20.1127163541618]
We have designed a universal map generation framework, GenMapping.
The framework is established with a triadic synergy architecture, including principal and dual auxiliary branches.
A thorough array of experimental results shows that the proposed model surpasses current state-of-the-art methods in both semantic mapping and vectorized mapping, while also maintaining a rapid inference speed.
arXiv Detail & Related papers (2024-09-13T10:15:28Z) - Neural Semantic Surface Maps [52.61017226479506]
We present an automated technique for computing a map between two genus-zero shapes, which matches semantically corresponding regions to one another.
Our approach can generate semantic surface-to-surface maps, eliminating manual annotations or any 3D training data requirement.
arXiv Detail & Related papers (2023-09-09T16:21:56Z) - Efficient Map Sparsification Based on 2D and 3D Discretized Grids [47.22997560184043]
As a map grows larger, more memory is required and localization becomes inefficient.
Previous map sparsification methods add a quadratic term in mixed-integer programming to enforce a uniform distribution of selected landmarks.
In this paper, we formulate map sparsification in an efficient linear form and select uniformly distributed landmarks based on 2D discretized grids.
arXiv Detail & Related papers (2023-03-20T05:49:14Z) - Memory transformers for full context and high-resolution 3D Medical
Segmentation [76.93387214103863]
This paper introduces the Full resolutIoN mEmory (FINE) transformer to overcome this issue.
The core idea behind FINE is to learn memory tokens to indirectly model full range interactions.
Experiments on the BCV image segmentation dataset shows better performances than state-of-the-art CNN and transformer baselines.
arXiv Detail & Related papers (2022-10-11T10:11:05Z) - Sparse Semantic Map-Based Monocular Localization in Traffic Scenes Using
Learned 2D-3D Point-Line Correspondences [29.419138863851526]
Given a query image, the goal is to estimate the camera pose corresponding to the prior map.
Existing approaches rely heavily on dense point descriptors at the feature level to solve the registration problem.
We propose a sparse semantic map-based monocular localization method, which solves 2D-3D registration via a well-designed deep neural network.
arXiv Detail & Related papers (2022-10-10T10:29:07Z) - SHINE-Mapping: Large-Scale 3D Mapping Using Sparse Hierarchical Implicit
Neural Representations [37.733802382489515]
This paper addresses the problems of achieving large-scale 3D reconstructions with implicit representations using 3D LiDAR measurements.
We learn and store implicit features through an octree-based hierarchical structure, which is sparse and sparse.
Our experiments show that our 3D reconstructions are more accurate, complete, and memory-efficient than current state-of-the-art 3D mapping methods.
arXiv Detail & Related papers (2022-10-05T14:38:49Z) - MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation.
It simultaneously learns global semantic information and local spatial-detailed features.
Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z) - Semantic MapNet: Building Allocentric Semantic Maps and Representations
from Egocentric Views [50.844459908504476]
We study the task of semantic mapping - specifically, an embodied agent (a robot or an egocentric AI assistant) is given a tour of a new environment.
We build an allo top-down semantic map ("what is where?") from egocentric observations of an RGB-D camera with known pose.
We present SemanticMapNet (SMNet), which combines the strengths of projective camera geometry and neural representation learning.
arXiv Detail & Related papers (2020-10-02T20:44:46Z) - Gravitational Models Explain Shifts on Human Visual Attention [80.76475913429357]
Visual attention refers to the human brain's ability to select relevant sensory information for preferential processing.
Various methods to estimate saliency have been proposed in the last three decades.
We propose a gravitational model (GRAV) to describe the attentional shifts.
arXiv Detail & Related papers (2020-09-15T10:12:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.