Related papers: DidSee: Diffusion-Based Depth Completion for Material-Agnostic Robotic Perception and Manipulation

DidSee: Diffusion-Based Depth Completion for Material-Agnostic Robotic Perception and Manipulation

URL: http://arxiv.org/abs/2506.21034v2
Date: Fri, 27 Jun 2025 01:36:33 GMT
Title: DidSee: Diffusion-Based Depth Completion for Material-Agnostic Robotic Perception and Manipulation
Authors: Wenzhou Lyu, Jialing Lin, Wenqi Ren, Ruihao Xia, Feng Qian, Yang Tang,
Abstract summary: Commercial RGB-D cameras often produce noisy, incomplete depth maps for non-Lambertian objects.<n>We propose textbfDidSee, a diffusion-based framework for depth completion on non-Lambertian objects.<n>DidSee achieves state-of-the-art performance on multiple benchmarks, demonstrates robust real-world generalization, and effectively improves downstream tasks.
Score: 33.87636820220007
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Commercial RGB-D cameras often produce noisy, incomplete depth maps for non-Lambertian objects. Traditional depth completion methods struggle to generalize due to the limited diversity and scale of training data. Recent advances exploit visual priors from pre-trained text-to-image diffusion models to enhance generalization in dense prediction tasks. However, we find that biases arising from training-inference mismatches in the vanilla diffusion framework significantly impair depth completion performance. Additionally, the lack of distinct visual features in non-Lambertian regions further hinders precise prediction. To address these issues, we propose \textbf{DidSee}, a diffusion-based framework for depth completion on non-Lambertian objects. First, we integrate a rescaled noise scheduler enforcing a zero terminal signal-to-noise ratio to eliminate signal leakage bias. Second, we devise a noise-agnostic single-step training formulation to alleviate error accumulation caused by exposure bias and optimize the model with a task-specific loss. Finally, we incorporate a semantic enhancer that enables joint depth completion and semantic segmentation, distinguishing objects from backgrounds and yielding precise, fine-grained depth maps. DidSee achieves state-of-the-art performance on multiple benchmarks, demonstrates robust real-world generalization, and effectively improves downstream tasks such as category-level pose estimation and robotic grasping.

Related papers

Depth Anything with Any Prior [64.39991799606146]
Prior Depth Anything is a framework that combines incomplete but precise metric information in depth measurement with relative but complete geometric structures in depth prediction.<n>We develop a conditioned monocular depth estimation (MDE) model to refine the inherent noise of depth priors.<n>Our model showcases impressive zero-shot generalization across depth completion, super-resolution, and inpainting over 7 real-world datasets.
arXiv Detail & Related papers (2025-05-15T17:59:50Z)
TransDiff: Diffusion-Based Method for Manipulating Transparent Objects Using a Single RGB-D Image [9.242427101416226]
We propose a single-view RGB-D-based depth completion framework, TransDiff, to achieve material-agnostic object grasping in desktop.<n>We leverage features extracted from RGB images, including semantic segmentation, edge maps, and normal maps, to condition the depth map generation process.<n>Our method learns an iterative denoising process that transforms a random depth distribution into a depth map, guided by initially refined depth information.
arXiv Detail & Related papers (2025-03-17T03:29:37Z)
Zero-shot Depth Completion via Test-time Alignment with Affine-invariant Depth Prior [15.802986215292309]
We propose a zero-shot depth completion method composed of an affine-invariant depth diffusion model and test-time alignment.<n>Our approach aligns the affine-invariant depth prior with metric-scale sparse measurements, enforcing them as hard constraints via an optimization loop at test-time.
arXiv Detail & Related papers (2025-02-10T10:38:33Z)
Revisiting Gradient-based Uncertainty for Monocular Depth Estimation [10.502852645001882]
We introduce gradient-based uncertainty estimation for monocular depth estimation models.<n>We demonstrate that our approach is effective in determining the uncertainty without re-training.<n>In particular, for models trained with monocular sequences and therefore most prone to uncertainty, our method outperforms related approaches.
arXiv Detail & Related papers (2025-02-09T17:21:41Z)
Generative Edge Detection with Stable Diffusion [52.870631376660924]
Edge detection is typically viewed as a pixel-level classification problem mainly addressed by discriminative methods. We propose a novel approach, named Generative Edge Detector (GED), by fully utilizing the potential of the pre-trained stable diffusion model. We conduct extensive experiments on multiple datasets and achieve competitive performance.
arXiv Detail & Related papers (2024-10-04T01:52:23Z)
What Matters When Repurposing Diffusion Models for General Dense Perception Tasks? [49.84679952948808]
Recent works show promising results by simply fine-tuning T2I diffusion models for dense perception tasks.<n>We conduct a thorough investigation into critical factors that affect transfer efficiency and performance when using diffusion priors.<n>Our work culminates in the development of GenPercept, an effective deterministic one-step fine-tuning paradigm tailed for dense visual perception tasks.
arXiv Detail & Related papers (2024-03-10T04:23:24Z)
Metrically Scaled Monocular Depth Estimation through Sparse Priors for Underwater Robots [0.0]
We formulate a deep learning model that fuses sparse depth measurements from triangulated features to improve the depth predictions. The network is trained in a supervised fashion on the forward-looking underwater dataset, FLSea. The method achieves real-time performance, running at 160 FPS on a laptop GPU and 7 FPS on a single CPU core.
arXiv Detail & Related papers (2023-10-25T16:32:31Z)
Monocular Depth Estimation using Diffusion Models [39.27361388836347]
We introduce innovations to address problems arising due to noisy, incomplete depth maps in training data. To cope with the limited availability of data for supervised training, we leverage pre-training on self-supervised image-to-image translation tasks. Our DepthGen model achieves SOTA performance on the indoor NYU dataset, and near SOTA results on the outdoor KITTI dataset.
arXiv Detail & Related papers (2023-02-28T18:08:21Z)
SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for Dynamic Scenes [58.89295356901823]
Self-supervised monocular depth estimation has shown impressive results in static scenes. It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions. We introduce an external pretrained monocular depth estimation model for generating single-image depth prior. Our model can predict sharp and accurate depth maps, even when training from monocular videos of highly-dynamic scenes.
arXiv Detail & Related papers (2022-11-07T16:17:47Z)
Gradient-based Uncertainty for Monocular Depth Estimation [5.7575052885308455]
In monocular depth estimation, disturbances in the image context, like moving objects or reflecting materials, can easily lead to erroneous predictions. We propose a post hoc uncertainty estimation approach for an already trained and thus fixed depth estimation model. Our approach achieves state-of-the-art uncertainty estimation results on the KITTI and NYU Depth V2 benchmarks without the need to retrain the neural network.
arXiv Detail & Related papers (2022-08-03T12:21:02Z)
Robust Depth Completion with Uncertainty-Driven Loss Functions [60.9237639890582]
We introduce uncertainty-driven loss functions to improve the robustness of depth completion and handle the uncertainty in depth completion. Our method has been tested on KITTI Depth Completion Benchmark and achieved the state-of-the-art robustness performance in terms of MAE, IMAE, and IRMSE metrics.
arXiv Detail & Related papers (2021-12-15T05:22:34Z)
Object-aware Monocular Depth Prediction with Instance Convolutions [72.98771405534937]
We propose a novel convolutional operator which is explicitly tailored to avoid feature aggregation. Our method is based on estimating per-part depth values by means of superpixels. Our evaluation with respect to the NYUv2 as well as the iBims dataset clearly demonstrates the superiority of Instance Convolutions.
arXiv Detail & Related papers (2021-12-02T18:59:48Z)
Adversarial Semantic Data Augmentation for Human Pose Estimation [96.75411357541438]
We propose Semantic Data Augmentation (SDA), a method that augments images by pasting segmented body parts with various semantic granularity. We also propose Adversarial Semantic Data Augmentation (ASDA), which exploits a generative network to dynamiclly predict tailored pasting configuration. State-of-the-art results are achieved on challenging benchmarks.
arXiv Detail & Related papers (2020-08-03T07:56:04Z)
Occlusion-Aware Depth Estimation with Adaptive Normal Constraints [85.44842683936471]
We present a new learning-based method for multi-frame depth estimation from a color video. Our method outperforms the state-of-the-art in terms of depth estimation accuracy.
arXiv Detail & Related papers (2020-04-02T07:10:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.