Related papers: MapBERT: Bitwise Masked Modeling for Real-Time Semantic Mapping Generation

MapBERT: Bitwise Masked Modeling for Real-Time Semantic Mapping Generation

URL: http://arxiv.org/abs/2506.07350v1
Date: Mon, 09 Jun 2025 01:55:55 GMT
Title: MapBERT: Bitwise Masked Modeling for Real-Time Semantic Mapping Generation
Authors: Yijie Deng, Shuaihang Yuan, Congcong Wen, Hao Huang, Anthony Tzes, Geeta Chandra Raju Bethala, Yi Fang,
Abstract summary: MapBERT is a novel framework designed to model the distribution of unseen spaces.<n>We show that MapBERT achieves state-of-the-art semantic map generation.<n> Experiments on Gibson benchmarks show that MapBERT achieves state-of-the-art semantic map generation.
Score: 15.116320098263149
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Spatial awareness is a critical capability for embodied agents, as it enables them to anticipate and reason about unobserved regions. The primary challenge arises from learning the distribution of indoor semantics, complicated by sparse, imbalanced object categories and diverse spatial scales. Existing methods struggle to robustly generate unobserved areas in real time and do not generalize well to new environments. To this end, we propose \textbf{MapBERT}, a novel framework designed to effectively model the distribution of unseen spaces. Motivated by the observation that the one-hot encoding of semantic maps aligns naturally with the binary structure of bit encoding, we, for the first time, leverage a lookup-free BitVAE to encode semantic maps into compact bitwise tokens. Building on this, a masked transformer is employed to infer missing regions and generate complete semantic maps from limited observations. To enhance object-centric reasoning, we propose an object-aware masking strategy that masks entire object categories concurrently and pairs them with learnable embeddings, capturing implicit relationships between object embeddings and spatial tokens. By learning these relationships, the model more effectively captures indoor semantic distributions crucial for practical robotic tasks. Experiments on Gibson benchmarks show that MapBERT achieves state-of-the-art semantic map generation, balancing computational efficiency with accurate reconstruction of unobserved regions.

Related papers

Map Space Belief Prediction for Manipulation-Enhanced Mapping [35.04168032835369]
In this work, we address the problem of manipulation-enhanced semantic mapping.<n>A robot has to efficiently identify all objects in a cluttered shelf.<n>Our novel POMDP planner improves map completeness and accuracy over existing methods.
arXiv Detail & Related papers (2025-02-28T00:10:52Z)
Mapping High-level Semantic Regions in Indoor Environments without Object Recognition [50.624970503498226]
The present work proposes a method for semantic region mapping via embodied navigation in indoor environments. To enable region identification, the method uses a vision-to-language model to provide scene information for mapping. By projecting egocentric scene understanding into the global frame, the proposed method generates a semantic map as a distribution over possible region labels at each location.
arXiv Detail & Related papers (2024-03-11T18:09:50Z)
EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models [52.3015009878545]
We develop an image segmentor capable of generating fine-grained segmentation maps without any additional training. Our framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps. In extensive experiments, the produced segmentation maps are demonstrated to be well delineated and capture detailed parts of the images.
arXiv Detail & Related papers (2024-01-22T07:34:06Z)
Background Activation Suppression for Weakly Supervised Object Localization and Semantic Segmentation [84.62067728093358]
Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels. New paradigm has emerged by generating a foreground prediction map to achieve pixel-level localization. This paper presents two astonishing experimental observations on the object localization learning process.
arXiv Detail & Related papers (2023-09-22T15:44:10Z)
Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos [63.94040814459116]
Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence. We propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps. We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations.
arXiv Detail & Related papers (2023-08-19T09:12:13Z)
Sparse Instance Activation for Real-Time Instance Segmentation [72.23597664935684]
We propose a conceptually novel, efficient, and fully convolutional framework for real-time instance segmentation. SparseInst has extremely fast inference speed and achieves 40 FPS and 37.9 AP on the COCO benchmark.
arXiv Detail & Related papers (2022-03-24T03:15:39Z)
Lightweight Object-level Topological Semantic Mapping and Long-term Global Localization based on Graph Matching [19.706907816202946]
We present a novel lightweight object-level mapping and localization method with high accuracy and robustness. We use object-level features with both semantic and geometric information to model landmarks in the environment. Based on the proposed map, the robust localization is achieved by constructing a novel local semantic scene graph descriptor.
arXiv Detail & Related papers (2022-01-16T05:47:07Z)
Cross-Image Region Mining with Region Prototypical Network for Weakly Supervised Segmentation [45.39679291105364]
We propose a region network RPNet to explore the cross-image object diversity of the training set. Similar object parts across images are identified via region feature comparison. Experiments show that the proposed method generates more complete and accurate pseudo object masks.
arXiv Detail & Related papers (2021-08-17T02:51:02Z)
Exploiting latent representation of sparse semantic layers for improved short-term motion prediction with Capsule Networks [0.12183405753834559]
This paper explores use of Capsule Networks (CapsNets) in the context of learning a hierarchical representation of sparse semantic layers corresponding to small regions of the High-Definition (HD) map. By using an architecture based on CapsNets the model is able to retain hierarchical relationships between detected features within images whilst also preventing loss of spatial data often caused by the pooling operation. We show that our model achieves significant improvement over recently published works on prediction, whilst drastically reducing the overall size of the network.
arXiv Detail & Related papers (2021-03-02T11:13:43Z)
Rethinking Localization Map: Towards Accurate Object Perception with Self-Enhancement Maps [78.2581910688094]
This work introduces a novel self-enhancement method to harvest accurate object localization maps and object boundaries with only category labels as supervision. In particular, the proposed Self-Enhancement Maps achieve the state-of-the-art localization accuracy of 54.88% on ILSVRC.
arXiv Detail & Related papers (2020-06-09T12:35:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.