Rethinking Transparent Object Grasping: Depth Completion with Monocular Depth Estimation and Instance Mask
- URL: http://arxiv.org/abs/2508.02507v1
- Date: Mon, 04 Aug 2025 15:14:47 GMT
- Title: Rethinking Transparent Object Grasping: Depth Completion with Monocular Depth Estimation and Instance Mask
- Authors: Yaofeng Cheng, Xinkai Gao, Sen Zhang, Chao Zeng, Fusheng Zha, Lining Sun, Chenguang Yang,
- Abstract summary: ReMake is a novel depth completion framework guided by an instance mask and monocular depth estimation.<n>Our method outperforms existing approaches on both benchmark datasets and real-world scenarios.
- Score: 10.472380465235629
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Due to the optical properties, transparent objects often lead depth cameras to generate incomplete or invalid depth data, which in turn reduces the accuracy and reliability of robotic grasping. Existing approaches typically input the RGB-D image directly into the network to output the complete depth, expecting the model to implicitly infer the reliability of depth values. However, while effective in training datasets, such methods often fail to generalize to real-world scenarios, where complex light interactions lead to highly variable distributions of valid and invalid depth data. To address this, we propose ReMake, a novel depth completion framework guided by an instance mask and monocular depth estimation. By explicitly distinguishing transparent regions from non-transparent ones, the mask enables the model to concentrate on learning accurate depth estimation in these areas from RGB-D input during training. This targeted supervision reduces reliance on implicit reasoning and improves generalization to real-world scenarios. Additionally, monocular depth estimation provides depth context between the transparent object and its surroundings, enhancing depth prediction accuracy. Extensive experiments show that our method outperforms existing approaches on both benchmark datasets and real-world scenarios, demonstrating superior accuracy and generalization capability. Code and videos are available at https://chengyaofeng.github.io/ReMake.github.io/.
Related papers
- DCIRNet: Depth Completion with Iterative Refinement for Dexterous Grasping of Transparent and Reflective Objects [9.235004977824026]
We propose DCIRNet, a novel multimodal depth completion network for transparent and reflective objects.<n>Our approach incorporates an innovative multimodal feature fusion module designed to extract complementary information between RGB images and incomplete depth maps.<n>We achieve a $44%$ improvement in the grasp success rate for transparent and reflective objects.
arXiv Detail & Related papers (2025-06-11T08:04:22Z) - TransDiff: Diffusion-Based Method for Manipulating Transparent Objects Using a Single RGB-D Image [9.242427101416226]
We propose a single-view RGB-D-based depth completion framework, TransDiff, to achieve material-agnostic object grasping in desktop.<n>We leverage features extracted from RGB images, including semantic segmentation, edge maps, and normal maps, to condition the depth map generation process.<n>Our method learns an iterative denoising process that transforms a random depth distribution into a depth map, guided by initially refined depth information.
arXiv Detail & Related papers (2025-03-17T03:29:37Z) - Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion [51.69876947593144]
Existing methods for depth completion operate in tightly constrained settings.<n>Inspired by advances in monocular depth estimation, we reframe depth completion as an image-conditional depth map generation.<n>Marigold-DC builds on a pretrained latent diffusion model for monocular depth estimation and injects the depth observations as test-time guidance.
arXiv Detail & Related papers (2024-12-18T00:06:41Z) - Transparent Object Depth Completion [11.825680661429825]
The perception of transparent objects for grasp and manipulation remains a major challenge.
Existing robotic grasp methods which heavily rely on depth maps are not suitable for transparent objects due to their unique visual properties.
We propose an end-to-end network for transparent object depth completion that combines the strengths of single-view RGB-D based depth completion and multi-view depth estimation.
arXiv Detail & Related papers (2024-05-24T07:38:06Z) - Robust Depth Enhancement via Polarization Prompt Fusion Tuning [112.88371907047396]
We present a framework that leverages polarization imaging to improve inaccurate depth measurements from various depth sensors.
Our method first adopts a learning-based strategy where a neural network is trained to estimate a dense and complete depth map from polarization data and a sensor depth map from different sensors.
To further improve the performance, we propose a Polarization Prompt Fusion Tuning (PPFT) strategy to effectively utilize RGB-based models pre-trained on large-scale datasets.
arXiv Detail & Related papers (2024-04-05T17:55:33Z) - Confidence-Aware RGB-D Face Recognition via Virtual Depth Synthesis [48.59382455101753]
2D face recognition encounters challenges in unconstrained environments due to varying illumination, occlusion, and pose.
Recent studies focus on RGB-D face recognition to improve robustness by incorporating depth information.
In this work, we first construct a diverse depth dataset generated by 3D Morphable Models for depth model pre-training.
Then, we propose a domain-independent pre-training framework that utilizes readily available pre-trained RGB and depth models to separately perform face recognition without needing additional paired data for retraining.
arXiv Detail & Related papers (2024-03-11T09:12:24Z) - Depth-Relative Self Attention for Monocular Depth Estimation [23.174459018407003]
deep neural networks rely on various visual hints such as size, shade, and texture extracted from RGB information.
We propose a novel depth estimation model named RElative Depth Transformer (RED-T) that uses relative depth as guidance in self-attention.
We show that the proposed model achieves competitive results in monocular depth estimation benchmarks and is less biased to RGB information.
arXiv Detail & Related papers (2023-04-25T14:20:31Z) - Mind The Edge: Refining Depth Edges in Sparsely-Supervised Monocular Depth Estimation [42.19770683222846]
Monocular Depth Estimation (MDE) is a fundamental problem in computer vision with numerous applications.
In this paper we propose to learn to detect the location of depth edges from densely-supervised synthetic data.
We demonstrate significant gains in the accuracy of the depth edges with comparable per-pixel depth accuracy on several challenging datasets.
arXiv Detail & Related papers (2022-12-10T14:49:24Z) - SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for
Dynamic Scenes [58.89295356901823]
Self-supervised monocular depth estimation has shown impressive results in static scenes.
It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions.
We introduce an external pretrained monocular depth estimation model for generating single-image depth prior.
Our model can predict sharp and accurate depth maps, even when training from monocular videos of highly-dynamic scenes.
arXiv Detail & Related papers (2022-11-07T16:17:47Z) - Domain Randomization-Enhanced Depth Simulation and Restoration for
Perceiving and Grasping Specular and Transparent Objects [28.84776177634971]
We propose a powerful RGBD fusion network, SwinDRNet, for depth restoration.
We also propose Domain Randomization-Enhanced Depth Simulation (DREDS) approach to simulate an active stereo depth system.
We show that our depth restoration effectively boosts the performance of downstream tasks.
arXiv Detail & Related papers (2022-08-07T19:17:16Z) - Joint Learning of Salient Object Detection, Depth Estimation and Contour
Extraction [91.43066633305662]
We propose a novel multi-task and multi-modal filtered transformer (MMFT) network for RGB-D salient object detection (SOD)
Specifically, we unify three complementary tasks: depth estimation, salient object detection and contour estimation. The multi-task mechanism promotes the model to learn the task-aware features from the auxiliary tasks.
Experiments show that it not only significantly surpasses the depth-based RGB-D SOD methods on multiple datasets, but also precisely predicts a high-quality depth map and salient contour at the same time.
arXiv Detail & Related papers (2022-03-09T17:20:18Z) - Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular
Video Depth [90.33296913575818]
In some video-based scenarios such as video depth estimation and 3D scene reconstruction from a video, the unknown scale and shift residing in per-frame prediction may cause the depth inconsistency.
We propose a locally weighted linear regression method to recover the scale and shift with very sparse anchor points.
Our method can boost the performance of existing state-of-the-art approaches by 50% at most over several zero-shot benchmarks.
arXiv Detail & Related papers (2022-02-03T08:52:54Z) - Accurate RGB-D Salient Object Detection via Collaborative Learning [101.82654054191443]
RGB-D saliency detection shows impressive ability on some challenge scenarios.
We propose a novel collaborative learning framework where edge, depth and saliency are leveraged in a more efficient way.
arXiv Detail & Related papers (2020-07-23T04:33:36Z) - Occlusion-Aware Depth Estimation with Adaptive Normal Constraints [85.44842683936471]
We present a new learning-based method for multi-frame depth estimation from a color video.
Our method outperforms the state-of-the-art in terms of depth estimation accuracy.
arXiv Detail & Related papers (2020-04-02T07:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.