Neural Video Depth Stabilizer
- URL: http://arxiv.org/abs/2307.08695v2
- Date: Thu, 10 Aug 2023 09:36:06 GMT
- Title: Neural Video Depth Stabilizer
- Authors: Yiran Wang, Min Shi, Jiaqi Li, Zihao Huang, Zhiguo Cao, Jianming
Zhang, Ke Xian, Guosheng Lin
- Abstract summary: Video depth estimation aims to infer temporally consistent depth.
Some methods achieve temporal consistency by finetuning a single-image depth model during test time using geometry and re-projection constraints.
We propose a plug-and-play framework that stabilizes inconsistent depth estimations and can be applied to different single-image depth models without extra effort.
- Score: 74.04508918791637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video depth estimation aims to infer temporally consistent depth. Some
methods achieve temporal consistency by finetuning a single-image depth model
during test time using geometry and re-projection constraints, which is
inefficient and not robust. An alternative approach is to learn how to enforce
temporal consistency from data, but this requires well-designed models and
sufficient video depth data. To address these challenges, we propose a
plug-and-play framework called Neural Video Depth Stabilizer (NVDS) that
stabilizes inconsistent depth estimations and can be applied to different
single-image depth models without extra effort. We also introduce a large-scale
dataset, Video Depth in the Wild (VDW), which consists of 14,203 videos with
over two million frames, making it the largest natural-scene video depth
dataset to our knowledge. We evaluate our method on the VDW dataset as well as
two public benchmarks and demonstrate significant improvements in consistency,
accuracy, and efficiency compared to previous approaches. Our work serves as a
solid baseline and provides a data foundation for learning-based video depth
models. We will release our dataset and code for future research.
Related papers
- FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution [50.55876151973996]
A versatile video depth estimation model should (1) be accurate across frames, (2) produce high-resolution depth maps, and (3) support real-time streaming.
We propose FlashDepth, a method that satisfies all three requirements, performing depth estimation on a 2044x1148 streaming video at 24 FPS.
arXiv Detail & Related papers (2025-04-09T17:59:31Z) - Video Depth Anything: Consistent Depth Estimation for Super-Long Videos [60.857723250653976]
We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos.
Our model is trained on a joint dataset of video depth and unlabeled images, similar to Depth Anything V2.
Our approach sets a new state-of-the-art in zero-shot video depth estimation.
arXiv Detail & Related papers (2025-01-21T18:53:30Z) - Align3R: Aligned Monocular Depth Estimation for Dynamic Videos [50.28715151619659]
We propose a novel video-depth estimation method called Align3R to estimate temporal consistent depth maps for a dynamic video.
Our key idea is to utilize the recent DUSt3R model to align estimated monocular depth maps of different timesteps.
Experiments demonstrate that Align3R estimates consistent video depth and camera poses for a monocular video with superior performance than baseline methods.
arXiv Detail & Related papers (2024-12-04T07:09:59Z) - Video Depth without Video Models [34.11454612504574]
Video depth estimation lifts monocular video clips to 3D by inferring dense depth at every frame.
We show how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator.
Our model, which we call RollingDepth, has two main ingredients: (i) a multi-frame depth estimator that is derived from a single-image LDM and maps very short video snippets to depth snippets.
arXiv Detail & Related papers (2024-11-28T14:50:14Z) - Depth Estimation From Monocular Images With Enhanced Encoder-Decoder Architecture [0.0]
This paper introduces a novel deep learning-based approach using an encoder-decoder architecture.
The Inception-ResNet-v2 model is utilized as the encoder.
Experimental results on the NYU Depth V2 dataset show that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-10-15T13:46:19Z) - Depth Any Video with Scalable Synthetic Data [98.42356740981839]
We develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse synthetic environments.
We leverage the powerful priors of generative video diffusion models to handle real-world videos effectively.
Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.
arXiv Detail & Related papers (2024-10-14T17:59:46Z) - Learning Temporally Consistent Video Depth from Video Diffusion Priors [57.929828486615605]
This work addresses the challenge of video depth estimation.
We reformulate the prediction task into a conditional generation problem.
This allows us to leverage the prior knowledge embedded in existing video generation models.
arXiv Detail & Related papers (2024-06-03T16:20:24Z) - SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Model [72.0795843450604]
Current approaches face challenges in maintaining consistent accuracy across diverse scenes.
These methods rely on extensive datasets comprising millions, if not tens of millions, of data for training.
This paper presents SM$4$Depth, a model that seamlessly works for both indoor and outdoor scenes.
arXiv Detail & Related papers (2024-03-13T14:08:25Z) - FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models [67.96827539201071]
We propose a novel test-time optimization approach for 3D scene reconstruction.
Our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets.
arXiv Detail & Related papers (2023-08-10T17:55:02Z) - ViDaS Video Depth-aware Saliency Network [40.08270905030302]
We introduce ViDaS, a two-stream, fully convolutional Video, Depth-Aware Saliency network.
It addresses the problem of attention modeling in-the-wild" via saliency prediction in videos.
Network consists of two visual streams, one for the RGB frames, and one for the depth frames.
It is trained end-to-end and is evaluated in a variety of different databases with eye-tracking data.
arXiv Detail & Related papers (2023-05-19T15:04:49Z) - Towards Accurate Reconstruction of 3D Scene Shape from A Single
Monocular Image [91.71077190961688]
We propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image.
We then exploits 3D point cloud data to predict the depth shift and the camera's focal length that allow us to recover 3D scene shapes.
We test our depth model on nine unseen datasets and achieve state-of-the-art performance on zero-shot evaluation.
arXiv Detail & Related papers (2022-08-28T16:20:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.