Related papers: DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

URL: http://arxiv.org/abs/2409.02095v1
Date: Tue, 3 Sep 2024 17:52:03 GMT
Title: DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
Authors: Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, Ying Shan,
Abstract summary: DepthCrafter generates temporally consistent long depth sequences with intricate details for open-world videos. Our training approach enables the model to generate depth sequences with variable lengths at one time, up to 110 frames. DepthCrafter facilitates various downstream applications, including depth-based visual effects and conditional video generation.
Score: 51.90501863934735
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite significant advancements in monocular depth estimation for static images, estimating video depth in the open world remains challenging, since open-world videos are extremely diverse in content, motion, camera movement, and length. We present DepthCrafter, an innovative method for generating temporally consistent long depth sequences with intricate details for open-world videos, without requiring any supplementary information such as camera poses or optical flow. DepthCrafter achieves generalization ability to open-world videos by training a video-to-depth model from a pre-trained image-to-video diffusion model, through our meticulously designed three-stage training strategy with the compiled paired video-depth datasets. Our training approach enables the model to generate depth sequences with variable lengths at one time, up to 110 frames, and harvest both precise depth details and rich content diversity from realistic and synthetic datasets. We also propose an inference strategy that processes extremely long videos through segment-wise estimation and seamless stitching. Comprehensive evaluations on multiple datasets reveal that DepthCrafter achieves state-of-the-art performance in open-world video depth estimation under zero-shot settings. Furthermore, DepthCrafter facilitates various downstream applications, including depth-based visual effects and conditional video generation.

Related papers

SpatialVID: A Large-Scale Video Dataset with Spatial Annotations [58.01259302233675]
SpatialVID is a dataset of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions.<n>We collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline.<n>A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions.
arXiv Detail & Related papers (2025-09-11T17:59:31Z)
FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution [50.55876151973996]
A versatile video depth estimation model should (1) be accurate across frames, (2) produce high-resolution depth maps, and (3) support real-time streaming. We propose FlashDepth, a method that satisfies all three requirements, performing depth estimation on a 2044x1148 streaming video at 24 FPS.
arXiv Detail & Related papers (2025-04-09T17:59:31Z)
Video Depth Anything: Consistent Depth Estimation for Super-Long Videos [60.857723250653976]
We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos. Our model is trained on a joint dataset of video depth and unlabeled images, similar to Depth Anything V2. Our approach sets a new state-of-the-art in zero-shot video depth estimation.
arXiv Detail & Related papers (2025-01-21T18:53:30Z)
Align3R: Aligned Monocular Depth Estimation for Dynamic Videos [50.28715151619659]
We propose a novel video-depth estimation method called Align3R to estimate temporal consistent depth maps for a dynamic video. Our key idea is to utilize the recent DUSt3R model to align estimated monocular depth maps of different timesteps. Experiments demonstrate that Align3R estimates consistent video depth and camera poses for a monocular video with superior performance than baseline methods.
arXiv Detail & Related papers (2024-12-04T07:09:59Z)
Video Depth without Video Models [34.11454612504574]
Video depth estimation lifts monocular video clips to 3D by inferring dense depth at every frame. We show how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator. Our model, which we call RollingDepth, has two main ingredients: (i) a multi-frame depth estimator that is derived from a single-image LDM and maps very short video snippets to depth snippets.
arXiv Detail & Related papers (2024-11-28T14:50:14Z)
Depth Any Video with Scalable Synthetic Data [98.42356740981839]
We develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse synthetic environments. We leverage the powerful priors of generative video diffusion models to handle real-world videos effectively. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.
arXiv Detail & Related papers (2024-10-14T17:59:46Z)
Learning Temporally Consistent Video Depth from Video Diffusion Priors [57.929828486615605]
This work addresses the challenge of video depth estimation. We reformulate the prediction task into a conditional generation problem. This allows us to leverage the prior knowledge embedded in existing video generation models.
arXiv Detail & Related papers (2024-06-03T16:20:24Z)
NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation [58.21817572577012]
Video depth estimation aims to infer temporally consistent depth. We introduce NVDS+ that stabilizes inconsistent depth estimated by various single-image models in a plug-and-play manner. We also elaborate a large-scale Video Depth in the Wild dataset, which contains 14,203 videos with over two million frames.
arXiv Detail & Related papers (2023-07-17T17:57:01Z)
MonoDVPS: A Self-Supervised Monocular Depth Estimation Approach to Depth-aware Video Panoptic Segmentation [3.2489082010225494]
We propose a novel solution with a multi-task network that performs monocular depth estimation and video panoptic segmentation. We introduce panoptic-guided depth losses and a novel panoptic masking scheme for moving objects to avoid corrupting the training signal.
arXiv Detail & Related papers (2022-10-14T07:00:42Z)
Consistent Depth of Moving Objects in Video [52.72092264848864]
We present a method to estimate depth of a dynamic scene, containing arbitrary moving objects, from an ordinary video captured with a moving camera. We formulate this objective in a new test-time training framework where a depth-prediction CNN is trained in tandem with an auxiliary scene-flow prediction over the entire input video. We demonstrate accurate and temporally coherent results on a variety of challenging videos containing diverse moving objects (pets, people, cars) as well as camera motion.
arXiv Detail & Related papers (2021-08-02T20:53:18Z)
Video Depth Estimation by Fusing Flow-to-Depth Proposals [65.24533384679657]
We present an approach with a differentiable flow-to-depth layer for video depth estimation. The model consists of a flow-to-depth layer, a camera pose refinement module, and a depth fusion network. Our approach outperforms state-of-the-art depth estimation methods, and has reasonable cross dataset generalization capability.
arXiv Detail & Related papers (2019-12-30T10:45:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.