StereoDiff: Stereo-Diffusion Synergy for Video Depth Estimation
- URL: http://arxiv.org/abs/2506.20756v2
- Date: Mon, 30 Jun 2025 06:41:41 GMT
- Title: StereoDiff: Stereo-Diffusion Synergy for Video Depth Estimation
- Authors: Haodong Li, Chen Wang, Jiahui Lei, Kostas Daniilidis, Lingjie Liu,
- Abstract summary: We argue that video depth estimation is not a naive extension of image depth estimation.<n>We introduce StereoDiff, a two-stage video depth estimator that synergizes stereo matching for mainly the static areas with video depth diffusion.
- Score: 50.105716381966516
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent video depth estimation methods achieve great performance by following the paradigm of image depth estimation, i.e., typically fine-tuning pre-trained video diffusion models with massive data. However, we argue that video depth estimation is not a naive extension of image depth estimation. The temporal consistency requirements for dynamic and static regions in videos are fundamentally different. Consistent video depth in static regions, typically backgrounds, can be more effectively achieved via stereo matching across all frames, which provides much stronger global 3D cues. While the consistency for dynamic regions still should be learned from large-scale video depth data to ensure smooth transitions, due to the violation of triangulation constraints. Based on these insights, we introduce StereoDiff, a two-stage video depth estimator that synergizes stereo matching for mainly the static areas with video depth diffusion for maintaining consistent depth transitions in dynamic areas. We mathematically demonstrate how stereo matching and video depth diffusion offer complementary strengths through frequency domain analysis, highlighting the effectiveness of their synergy in capturing the advantages of both. Experimental results on zero-shot, real-world, dynamic video depth benchmarks, both indoor and outdoor, demonstrate StereoDiff's SoTA performance, showcasing its superior consistency and accuracy in video depth estimation.
Related papers
- Align3R: Aligned Monocular Depth Estimation for Dynamic Videos [50.28715151619659]
We propose a novel video-depth estimation method called Align3R to estimate temporal consistent depth maps for a dynamic video.<n>Our key idea is to utilize the recent DUSt3R model to align estimated monocular depth maps of different timesteps.<n>Experiments demonstrate that Align3R estimates consistent video depth and camera poses for a monocular video with superior performance than baseline methods.
arXiv Detail & Related papers (2024-12-04T07:09:59Z) - Video Depth without Video Models [34.11454612504574]
Video depth estimation lifts monocular video clips to 3D by inferring dense depth at every frame.<n>We show how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator.<n>Our model, which we call RollingDepth, has two main ingredients: (i) a multi-frame depth estimator that is derived from a single-image LDM and maps very short video snippets to depth snippets.
arXiv Detail & Related papers (2024-11-28T14:50:14Z) - Helvipad: A Real-World Dataset for Omnidirectional Stereo Depth Estimation [83.841877607646]
We introduce Helvipad, a real-world dataset for omnidirectional stereo depth estimation.<n>The dataset includes accurate depth and disparity labels by projecting 3D point clouds onto equirectangular images.<n>We benchmark leading stereo depth estimation models for both standard and omnidirectional images.
arXiv Detail & Related papers (2024-11-27T13:34:41Z) - NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation [58.21817572577012]
Video depth estimation aims to infer temporally consistent depth.
We introduce NVDS+ that stabilizes inconsistent depth estimated by various single-image models in a plug-and-play manner.
We also elaborate a large-scale Video Depth in the Wild dataset, which contains 14,203 videos with over two million frames.
arXiv Detail & Related papers (2023-07-17T17:57:01Z) - Edge-aware Consistent Stereo Video Depth Estimation [3.611754783778107]
We propose a consistent method for dense video depth estimation.
Unlike the existing monocular methods, ours relates to stereo videos.
We show that our edge-aware stereo video model can accurately estimate the dense depth maps.
arXiv Detail & Related papers (2023-05-04T08:30:04Z) - DynamicStereo: Consistent Dynamic Depth from Stereo Videos [91.1804971397608]
We propose DynamicStereo to estimate disparity for stereo videos.
The network learns to pool information from neighboring frames to improve the temporal consistency of its predictions.
We also introduce Dynamic Replica, a new benchmark dataset containing synthetic videos of people and animals in scanned environments.
arXiv Detail & Related papers (2023-05-03T17:40:49Z) - SMD-Nets: Stereo Mixture Density Networks [68.56947049719936]
We propose Stereo Mixture Density Networks (SMD-Nets), a simple yet effective learning framework compatible with a wide class of 2D and 3D architectures.
Specifically, we exploit bimodal mixture densities as output representation and show that this allows for sharp and precise disparity estimates near discontinuities.
We carry out comprehensive experiments on a new high-resolution and highly realistic synthetic stereo dataset, consisting of stereo pairs at 8Mpx resolution, as well as on real-world stereo datasets.
arXiv Detail & Related papers (2021-04-08T16:15:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.