Related papers: DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation

DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation

URL: http://arxiv.org/abs/2507.01603v2
Date: Thu, 07 Aug 2025 02:10:50 GMT
Title: DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation
Authors: Yue-Jiang Dong, Wang Zhao, Jiale Xu, Ying Shan, Song-Hai Zhang,
Abstract summary: We propose DepthSync, a training-free framework using diffusion guidance to achieve scale- and geometry-consistent depth predictions for long videos.<n>Specifically, we introduce scale guidance to synchronize the depth scale across windows and geometry guidance to enforce geometric alignment within windows.<n> Experiments on various datasets validate the effectiveness of our method in producing depth estimates with improved scale and geometry consistency, particularly for long videos.
Score: 45.8790174686242
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion-based video depth estimation methods have achieved remarkable success with strong generalization ability. However, predicting depth for long videos remains challenging. Existing methods typically split videos into overlapping sliding windows, leading to accumulated scale discrepancies across different windows, particularly as the number of windows increases. Additionally, these methods rely solely on 2D diffusion priors, overlooking the inherent 3D geometric structure of video depths, which results in geometrically inconsistent predictions. In this paper, we propose DepthSync, a novel, training-free framework using diffusion guidance to achieve scale- and geometry-consistent depth predictions for long videos. Specifically, we introduce scale guidance to synchronize the depth scale across windows and geometry guidance to enforce geometric alignment within windows based on the inherent 3D constraints in video depths. These two terms work synergistically, steering the denoising process toward consistent depth predictions. Experiments on various datasets validate the effectiveness of our method in producing depth estimates with improved scale and geometry consistency, particularly for long videos.

Related papers

GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors [47.21120442961684]
We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from open-world videos.<n>We show that GeometryCrafter achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.
arXiv Detail & Related papers (2025-04-01T17:58:03Z)
Align3R: Aligned Monocular Depth Estimation for Dynamic Videos [50.28715151619659]
We propose a novel video-depth estimation method called Align3R to estimate temporal consistent depth maps for a dynamic video.<n>Our key idea is to utilize the recent DUSt3R model to align estimated monocular depth maps of different timesteps.<n>Experiments demonstrate that Align3R estimates consistent video depth and camera poses for a monocular video with superior performance than baseline methods.
arXiv Detail & Related papers (2024-12-04T07:09:59Z)
Learning Temporally Consistent Video Depth from Video Diffusion Priors [62.36887303063542]
This work addresses the challenge of streamed video depth estimation.<n>We argue that sharing contextual information between frames or clips is pivotal in fostering temporal consistency.<n>We propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context.
arXiv Detail & Related papers (2024-06-03T16:20:24Z)
Self-Supervised Depth Completion Guided by 3D Perception and Geometry Consistency [17.68427514090938]
This paper explores the utilization of 3D perceptual features and multi-view geometry consistency to devise a high-precision self-supervised depth completion method. Experiments on benchmark datasets of NYU-Depthv2 and VOID demonstrate that the proposed model achieves the state-of-the-art depth completion performance.
arXiv Detail & Related papers (2023-12-23T14:19:56Z)
Constraining Depth Map Geometry for Multi-View Stereo: A Dual-Depth Approach with Saddle-shaped Depth Cells [23.345139129458122]
We show that different depth geometries have significant performance gaps, even using the same depth prediction error. We introduce an ideal depth geometry composed of Saddle-Shaped Cells, whose predicted depth map oscillates upward and downward around the ground-truth surface. Our method also points to a new research direction for considering depth geometry in MVS.
arXiv Detail & Related papers (2023-07-18T11:37:53Z)
NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation [58.21817572577012]
Video depth estimation aims to infer temporally consistent depth. We introduce NVDS+ that stabilizes inconsistent depth estimated by various single-image models in a plug-and-play manner. We also elaborate a large-scale Video Depth in the Wild dataset, which contains 14,203 videos with over two million frames.
arXiv Detail & Related papers (2023-07-17T17:57:01Z)
Virtual Normal: Enforcing Geometric Constraints for Accurate and Robust Depth Prediction [87.08227378010874]
We show the importance of the high-order 3D geometric constraints for depth prediction. By designing a loss term that enforces a simple geometric constraint, we significantly improve the accuracy and robustness of monocular depth estimation. We show state-of-the-art results of learning metric depth on NYU Depth-V2 and KITTI.
arXiv Detail & Related papers (2021-03-07T00:08:21Z)
Consistent Video Depth Estimation [57.712779457632024]
We present an algorithm for reconstructing dense, geometrically consistent depth for all pixels in a monocular video. We leverage a conventional structure-from-motion reconstruction to establish geometric constraints on pixels in the video. Our algorithm is able to handle challenging hand-held captured input videos with a moderate degree of dynamic motion.
arXiv Detail & Related papers (2020-04-30T17:59:26Z)
DiverseDepth: Affine-invariant Depth Prediction Using Diverse Data [110.29043712400912]
We present a method for depth estimation with monocular images, which can predict high-quality depth on diverse scenes up to an affine transformation. Experiments show that our method outperforms previous methods on 8 datasets by a large margin with the zero-shot test setting.
arXiv Detail & Related papers (2020-02-03T05:38:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.