Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
- URL: http://arxiv.org/abs/2501.12375v2
- Date: Wed, 22 Jan 2025 11:33:54 GMT
- Title: Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
- Authors: Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, Bingyi Kang,
- Abstract summary: We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos.
Our model is trained on a joint dataset of video depth and unlabeled images, similar to Depth Anything V2.
Our approach sets a new state-of-the-art in zero-shot video depth estimation.
- Score: 60.857723250653976
- License:
- Abstract: Depth Anything has achieved remarkable success in monocular depth estimation with strong generalization ability. However, it suffers from temporal inconsistency in videos, hindering its practical applications. Various methods have been proposed to alleviate this issue by leveraging video generation models or introducing priors from optical flow and camera poses. Nonetheless, these methods are only applicable to short videos (< 10 seconds) and require a trade-off between quality and computational efficiency. We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos (over several minutes) without sacrificing efficiency. We base our model on Depth Anything V2 and replace its head with an efficient spatial-temporal head. We design a straightforward yet effective temporal consistency loss by constraining the temporal depth gradient, eliminating the need for additional geometric priors. The model is trained on a joint dataset of video depth and unlabeled images, similar to Depth Anything V2. Moreover, a novel key-frame-based strategy is developed for long video inference. Experiments show that our model can be applied to arbitrarily long videos without compromising quality, consistency, or generalization ability. Comprehensive evaluations on multiple video benchmarks demonstrate that our approach sets a new state-of-the-art in zero-shot video depth estimation. We offer models of different scales to support a range of scenarios, with our smallest model capable of real-time performance at 30 FPS.
Related papers
- Align3R: Aligned Monocular Depth Estimation for Dynamic Videos [50.28715151619659]
We propose a novel video-depth estimation method called Align3R to estimate temporal consistent depth maps for a dynamic video.
Our key idea is to utilize the recent DUSt3R model to align estimated monocular depth maps of different timesteps.
Experiments demonstrate that Align3R estimates consistent video depth and camera poses for a monocular video with superior performance than baseline methods.
arXiv Detail & Related papers (2024-12-04T07:09:59Z) - Video Depth without Video Models [34.11454612504574]
Video depth estimation lifts monocular video clips to 3D by inferring dense depth at every frame.
We show how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator.
Our model, which we call RollingDepth, has two main ingredients: (i) a multi-frame depth estimator that is derived from a single-image LDM and maps very short video snippets to depth snippets.
arXiv Detail & Related papers (2024-11-28T14:50:14Z) - Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors [54.8852848659663]
Buffer Anytime is a framework for estimation of depth and normal maps (which we call geometric buffers) from video.
We demonstrate high-quality video buffer estimation by leveraging single-image priors with temporal consistency constraints.
arXiv Detail & Related papers (2024-11-26T09:28:32Z) - Depth Any Video with Scalable Synthetic Data [98.42356740981839]
We develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse synthetic environments.
We leverage the powerful priors of generative video diffusion models to handle real-world videos effectively.
Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.
arXiv Detail & Related papers (2024-10-14T17:59:46Z) - DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos [51.90501863934735]
We present DepthCrafter, a method for generating temporally consistent long depth sequences with intricate details for open-world videos.
The generalization ability to open-world videos is achieved by training the video-to-depth model from a pre-trained image-to-video diffusion model.
Our training approach enables the model to generate depth sequences with variable lengths at one time, up to 110 frames, and harvest both precise depth details and rich content diversity from realistic and synthetic datasets.
arXiv Detail & Related papers (2024-09-03T17:52:03Z) - NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation [58.21817572577012]
Video depth estimation aims to infer temporally consistent depth.
We introduce NVDS+ that stabilizes inconsistent depth estimated by various single-image models in a plug-and-play manner.
We also elaborate a large-scale Video Depth in the Wild dataset, which contains 14,203 videos with over two million frames.
arXiv Detail & Related papers (2023-07-17T17:57:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.