Related papers: Masked Depth Modeling for Spatial Perception

Masked Depth Modeling for Spatial Perception

URL: http://arxiv.org/abs/2601.17895v1
Date: Sun, 25 Jan 2026 16:13:49 GMT
Title: Masked Depth Modeling for Spatial Perception
Authors: Bin Tan, Changjiang Sun, Xiage Qin, Hanat Adai, Zelin Fu, Tianxiang Zhou, Han Zhang, Yinghao Xu, Xing Zhu, Yujun Shen, Nan Xue,
Abstract summary: LingBot-Depth is a depth completion model that refines depth maps through masked depth modeling.<n>It outperforms top-tier RGB-D cameras in terms of both depth precision and pixel coverage.<n>We release the code, checkpoint, and 3M RGB-depth pairs to the community of spatial perception.
Score: 44.0326843862591
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spatial visual perception is a fundamental requirement in physical-world applications like autonomous driving and robotic manipulation, driven by the need to interact with 3D environments. Capturing pixel-aligned metric depth using RGB-D cameras would be the most viable way, yet it usually faces obstacles posed by hardware limitations and challenging imaging conditions, especially in the presence of specular or texture-less surfaces. In this work, we argue that the inaccuracies from depth sensors can be viewed as "masked" signals that inherently reflect underlying geometric ambiguities. Building on this motivation, we present LingBot-Depth, a depth completion model which leverages visual context to refine depth maps through masked depth modeling and incorporates an automated data curation pipeline for scalable training. It is encouraging to see that our model outperforms top-tier RGB-D cameras in terms of both depth precision and pixel coverage. Experimental results on a range of downstream tasks further suggest that LingBot-Depth offers an aligned latent representation across RGB and depth modalities. We release the code, checkpoint, and 3M RGB-depth pairs (including 2M real data and 1M simulated data) to the community of spatial perception.

Related papers

Geometry-Aware Sparse Depth Sampling for High-Fidelity RGB-D Depth Completion in Robotic Systems [0.20999222360659608]
RGB-D and stereo vision sensors are widely used for industrial robotic systems that perform manipulation, inspection, and navigation tasks.<n>Current depth completion methods produce noisy, incomplete, or biased depth maps due to sensor limitations and environmental conditions.<n>We propose a normal-guided sparse depth sampling strategy that leverages PCA-based surface normal estimation on the RGB-D point cloud to compute a per-pixel depth reliability measure.<n> Experiments show that our geometry-aware sparse depth improves accuracy, reduces artifacts near edges and discontinuities, and produces more realistic training conditions.
arXiv Detail & Related papers (2025-12-09T04:14:05Z)
FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers [91.59069344768858]
We introduce Frequency-aware Positional Depth Embedding (FreqPDE) to equip 2D image features with spatial information for 3D detection transformer decoder.<n>FreqPDE combines the 2D image features and 3D position embeddings to generate 3D depth-aware features for query decoding.
arXiv Detail & Related papers (2025-10-17T07:36:54Z)
Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots [55.43376513158555]
Camera Depth Models (CDMs) are a simple plugin on daily-use depth cameras.<n>We develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern.<n>For the first time, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots.
arXiv Detail & Related papers (2025-09-02T17:29:38Z)
ClearDepth: Enhanced Stereo Perception of Transparent Objects for Robotic Manipulation [18.140839442955485]
We develop a vision transformer-based algorithm for stereo depth recovery of transparent objects.<n>Our method incorporates a parameter-aligned, domain-adaptive, and physically realistic Sim2Real simulation for efficient data generation.<n>Our experimental results demonstrate the model's exceptional Sim2Real generalizability in real-world scenarios.
arXiv Detail & Related papers (2024-09-13T15:44:38Z)
A Two-Stage Masked Autoencoder Based Network for Indoor Depth Completion [10.519644854849098]
We propose a two-step Transformer-based network for indoor depth completion. Our proposed network achieves the state-of-the-art performance on the Matterport3D dataset. In addition, to validate the importance of the depth completion task, we apply our methods to indoor 3D reconstruction.
arXiv Detail & Related papers (2024-06-14T07:42:27Z)
Beyond Visual Field of View: Perceiving 3D Environment with Echoes and Vision [51.385731364529306]
This paper focuses on perceiving and navigating 3D environments using echoes and RGB image. In particular, we perform depth estimation by fusing RGB image with echoes, received from multiple orientations. We show that the echoes provide holistic and in-expensive information about the 3D structures complementing the RGB image.
arXiv Detail & Related papers (2022-07-03T22:31:47Z)
High-Accuracy RGB-D Face Recognition via Segmentation-Aware Face Depth Estimation and Mask-Guided Attention Network [16.50097148165777]
Deep learning approaches have achieved highly accurate face recognition by training the models with very large face image datasets. Unlike the availability of large 2D face image datasets, there is a lack of large 3D face datasets available to the public. This paper proposes two CNN models to improve the RGB-D face recognition task.
arXiv Detail & Related papers (2021-12-22T07:46:23Z)
Facial Depth and Normal Estimation using Single Dual-Pixel Camera [81.02680586859105]
We introduce a DP-oriented Depth/Normal network that reconstructs the 3D facial geometry. It contains the corresponding ground-truth 3D models including depth map and surface normal in metric scale. It achieves state-of-the-art performances over recent DP-based depth/normal estimation methods.
arXiv Detail & Related papers (2021-11-25T05:59:27Z)
RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB Video [76.86512780916827]
We present the first real-time method for motion capture of skeletal pose and 3D surface geometry of hands from a single RGB camera. In order to address the inherent depth ambiguities in RGB data, we propose a novel multi-task CNN. We experimentally verify the individual components of our RGB two-hand tracking and 3D reconstruction pipeline.
arXiv Detail & Related papers (2021-06-22T12:53:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.