Learning Temporally Consistent Video Depth from Video Diffusion Priors
- URL: http://arxiv.org/abs/2406.01493v2
- Date: Tue, 4 Jun 2024 03:33:52 GMT
- Title: Learning Temporally Consistent Video Depth from Video Diffusion Priors
- Authors: Jiahao Shao, Yuanbo Yang, Hongyu Zhou, Youmin Zhang, Yujun Shen, Matteo Poggi, Yiyi Liao,
- Abstract summary: This work addresses the challenge of video depth estimation.
We reformulate the prediction task into a conditional generation problem.
This allows us to leverage the prior knowledge embedded in existing video generation models.
- Score: 57.929828486615605
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work addresses the challenge of video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. Instead of directly developing a depth estimator from scratch, we reformulate the prediction task into a conditional generation problem. This allows us to leverage the prior knowledge embedded in existing video generation models, thereby reducing learning difficulty and enhancing generalizability. Concretely, we study how to tame the public Stable Video Diffusion (SVD) to predict reliable depth from input videos using a mixture of image depth and video depth datasets. We empirically confirm that a procedural training strategy -- first optimizing the spatial layers of SVD and then optimizing the temporal layers while keeping the spatial layers frozen -- yields the best results in terms of both spatial accuracy and temporal consistency. We further examine the sliding window strategy for inference on arbitrarily long videos. Our observations indicate a trade-off between efficiency and performance, with a one-frame overlap already producing favorable results. Extensive experimental results demonstrate the superiority of our approach, termed ChronoDepth, over existing alternatives, particularly in terms of the temporal consistency of the estimated depth. Additionally, we highlight the benefits of more consistent video depth in two practical applications: depth-conditioned video generation and novel view synthesis. Our project page is available at https://jhaoshao.github.io/ChronoDepth/.
Related papers
- Neural Video Depth Stabilizer [74.04508918791637]
Video depth estimation aims to infer temporally consistent depth.
Some methods achieve temporal consistency by finetuning a single-image depth model during test time using geometry and re-projection constraints.
We propose a plug-and-play framework that stabilizes inconsistent depth estimations and can be applied to different single-image depth models without extra effort.
arXiv Detail & Related papers (2023-07-17T17:57:01Z) - Globally Consistent Video Depth and Pose Estimation with Efficient
Test-Time Training [15.46056322267856]
We present GCVD, a globally consistent method for learning-based video structure from motion (SfM)
GCVD integrates a compact pose graph into the CNN-based optimization to achieve globally consistent from an effective selection mechanism.
Experimental results show that GCVD outperforms the state-of-the-art methods on both depth and pose estimation.
arXiv Detail & Related papers (2022-08-04T15:12:03Z) - Deep Video Prior for Video Consistency and Propagation [58.250209011891904]
We present a novel and general approach for blind video temporal consistency.
Our method is only trained on a pair of original and processed videos directly instead of a large dataset.
We show that temporal consistency can be achieved by training a convolutional neural network on a video with Deep Video Prior.
arXiv Detail & Related papers (2022-01-27T16:38:52Z) - ViPNAS: Efficient Video Pose Estimation via Neural Architecture Search [94.90294600817215]
We propose a novel neural architecture search (NAS) method, termed ViPNAS, to search networks in both spatial and temporal levels for fast online video pose estimation.
In the spatial level, we carefully design the search space with five different dimensions including network depth, width, kernel size, group number, and attentions.
In the temporal level, we search from a series of temporal feature fusions to optimize the total accuracy and speed across multiple video frames.
arXiv Detail & Related papers (2021-05-21T06:36:40Z) - Blind Video Temporal Consistency via Deep Video Prior [61.062900556483164]
We present a novel and general approach for blind video temporal consistency.
Our method is only trained on a pair of original and processed videos directly.
We show that temporal consistency can be achieved by training a convolutional network on a video with the Deep Video Prior.
arXiv Detail & Related papers (2020-10-22T16:19:20Z) - Don't Forget The Past: Recurrent Depth Estimation from Monocular Video [92.84498980104424]
We put three different types of depth estimation into a common framework.
Our method produces a time series of depth maps.
It can be applied to monocular videos only or be combined with different types of sparse depth patterns.
arXiv Detail & Related papers (2020-01-08T16:50:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.