Jasmine: Harnessing Diffusion Prior for Self-supervised Depth Estimation
- URL: http://arxiv.org/abs/2503.15905v1
- Date: Thu, 20 Mar 2025 07:15:49 GMT
- Title: Jasmine: Harnessing Diffusion Prior for Self-supervised Depth Estimation
- Authors: Jiyuan Wang, Chunyu Lin, Cheng Guan, Lang Nie, Jing He, Haodong Li, Kang Liao, Yao Zhao,
- Abstract summary: Jasmine is a Stable Diffusion-based self-supervised framework for monocular depth estimation.<n>It harnesses SD's visual priors to enhance the sharpness and generalization of unsupervised prediction.<n>It achieves SoTA performance on the KITTI benchmark and exhibits superior zero-shot generalization across multiple datasets.
- Score: 55.501710766726234
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose Jasmine, the first Stable Diffusion (SD)-based self-supervised framework for monocular depth estimation, which effectively harnesses SD's visual priors to enhance the sharpness and generalization of unsupervised prediction. Previous SD-based methods are all supervised since adapting diffusion models for dense prediction requires high-precision supervision. In contrast, self-supervised reprojection suffers from inherent challenges (e.g., occlusions, texture-less regions, illumination variance), and the predictions exhibit blurs and artifacts that severely compromise SD's latent priors. To resolve this, we construct a novel surrogate task of hybrid image reconstruction. Without any additional supervision, it preserves the detail priors of SD models by reconstructing the images themselves while preventing depth estimation from degradation. Furthermore, to address the inherent misalignment between SD's scale and shift invariant estimation and self-supervised scale-invariant depth estimation, we build the Scale-Shift GRU. It not only bridges this distribution gap but also isolates the fine-grained texture of SD output against the interference of reprojection loss. Extensive experiments demonstrate that Jasmine achieves SoTA performance on the KITTI benchmark and exhibits superior zero-shot generalization across multiple datasets.
Related papers
- Occlusion-Aware Self-Supervised Monocular Depth Estimation for Weak-Texture Endoscopic Images [1.1084686909647639]
We propose a self-supervised monocular depth estimation network tailored for endoscopic scenes.
Existing methods, though accurate, typically assume consistent illumination.
These variations lead to incorrect geometric interpretations and unreliable self-supervised signals.
arXiv Detail & Related papers (2025-04-24T14:12:57Z) - How to Use Diffusion Priors under Sparse Views? [29.738350228085928]
Inline Prior Guided Score Matching is proposed to provide visual supervision over sparse views in 3D reconstruction.<n>We show that our method achieves state-of-the-art reconstruction quality.
arXiv Detail & Related papers (2024-12-03T07:31:54Z) - GroCo: Ground Constraint for Metric Self-Supervised Monocular Depth [2.805351469151152]
We propose a novel constraint on ground areas designed specifically for the self-supervised paradigm.
This mechanism not only allows to accurately recover the scale but also ensures coherence between the depth prediction and the ground prior.
arXiv Detail & Related papers (2024-09-23T09:30:27Z) - Unsupervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion [21.939618694037108]
Unsupervised monocular depth estimation has received widespread attention because of its capability to train without ground truth.
We employ a well-converging diffusion model among generative networks for unsupervised monocular depth estimation.
This model significantly enriches the model's capacity for learning and interpreting depth distribution.
arXiv Detail & Related papers (2024-06-14T07:31:20Z) - Uncertainty-guided Optimal Transport in Depth Supervised Sparse-View 3D Gaussian [49.21866794516328]
3D Gaussian splatting has demonstrated impressive performance in real-time novel view synthesis.
Previous approaches have incorporated depth supervision into the training of 3D Gaussians to mitigate overfitting.
We introduce a novel method to supervise the depth distribution of 3D Gaussians, utilizing depth priors with integrated uncertainty estimates.
arXiv Detail & Related papers (2024-05-30T03:18:30Z) - Exploiting Diffusion Prior for Generalizable Dense Prediction [85.4563592053464]
Recent advanced Text-to-Image (T2I) diffusion models are sometimes too imaginative for existing off-the-shelf dense predictors to estimate.
We introduce DMP, a pipeline utilizing pre-trained T2I models as a prior for dense prediction tasks.
Despite limited-domain training data, the approach yields faithful estimations for arbitrary images, surpassing existing state-of-the-art algorithms.
arXiv Detail & Related papers (2023-11-30T18:59:44Z) - FG-Depth: Flow-Guided Unsupervised Monocular Depth Estimation [17.572459787107427]
We propose a flow distillation loss to replace the typical photometric loss and a prior flow based mask to remove invalid pixels.
Our approach achieves state-of-the-art results on both KITTI and NYU-Depth-v2 datasets.
arXiv Detail & Related papers (2023-01-20T04:02:13Z) - ShadowDiffusion: When Degradation Prior Meets Diffusion Model for Shadow
Removal [74.86415440438051]
We propose a unified diffusion framework that integrates both the image and degradation priors for highly effective shadow removal.
Our model achieves a significant improvement in terms of PSNR, increasing from 31.69dB to 34.73dB over SRD dataset.
arXiv Detail & Related papers (2022-12-09T07:48:30Z) - Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose
Estimation [70.32536356351706]
We introduce MRP-Net that constitutes a common deep network backbone with two output heads subscribing to two diverse configurations.
We derive suitable measures to quantify prediction uncertainty at both pose and joint level.
We present a comprehensive evaluation of the proposed approach and demonstrate state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2022-03-29T07:14:58Z) - A high-precision self-supervised monocular visual odometry in foggy
weather based on robust cycled generative adversarial networks and multi-task
learning aided depth estimation [0.0]
This paper proposes a high-precision self-supervised monocular VO, which is specifically designed for navigation in foggy weather.
A cycled generative adversarial network is designed to obtain high-quality self-supervised loss via forcing the forward and backward half-cycle to output consistent estimation.
gradient-based loss and perceptual loss are introduced to eliminate the interference of complex photometric change on self-supervised loss in foggy weather.
arXiv Detail & Related papers (2022-03-09T15:41:57Z) - Adaptive confidence thresholding for monocular depth estimation [83.06265443599521]
We propose a new approach to leverage pseudo ground truth depth maps of stereo images generated from self-supervised stereo matching methods.
The confidence map of the pseudo ground truth depth map is estimated to mitigate performance degeneration by inaccurate pseudo depth maps.
Experimental results demonstrate superior performance to state-of-the-art monocular depth estimation methods.
arXiv Detail & Related papers (2020-09-27T13:26:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.