NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation
- URL: http://arxiv.org/abs/2307.08695v3
- Date: Thu, 03 Oct 2024 17:58:03 GMT
- Title: NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation
- Authors: Yiran Wang, Min Shi, Jiaqi Li, Chaoyi Hong, Zihao Huang, Juewen Peng, Zhiguo Cao, Jianming Zhang, Ke Xian, Guosheng Lin,
- Abstract summary: Video depth estimation aims to infer temporally consistent depth.
We introduce NVDS+ that stabilizes inconsistent depth estimated by various single-image models in a plug-and-play manner.
We also elaborate a large-scale Video Depth in the Wild dataset, which contains 14,203 videos with over two million frames.
- Score: 58.21817572577012
- License:
- Abstract: Video depth estimation aims to infer temporally consistent depth. One approach is to finetune a single-image model on each video with geometry constraints, which proves inefficient and lacks robustness. An alternative is learning to enforce consistency from data, which requires well-designed models and sufficient video depth data. To address both challenges, we introduce NVDS+ that stabilizes inconsistent depth estimated by various single-image models in a plug-and-play manner. We also elaborate a large-scale Video Depth in the Wild (VDW) dataset, which contains 14,203 videos with over two million frames, making it the largest natural-scene video depth dataset. Additionally, a bidirectional inference strategy is designed to improve consistency by adaptively fusing forward and backward predictions. We instantiate a model family ranging from small to large scales for different applications. The method is evaluated on VDW dataset and three public benchmarks. To further prove the versatility, we extend NVDS+ to video semantic segmentation and several downstream applications like bokeh rendering, novel view synthesis, and 3D reconstruction. Experimental results show that our method achieves significant improvements in consistency, accuracy, and efficiency. Our work serves as a solid baseline and data foundation for learning-based video depth estimation. Code and dataset are available at: https://github.com/RaymondWang987/NVDS
Related papers
- Depth Estimation From Monocular Images With Enhanced Encoder-Decoder Architecture [0.0]
This paper introduces a novel deep learning-based approach using an encoder-decoder architecture.
The Inception-ResNet-v2 model is utilized as the encoder.
Experimental results on the NYU Depth V2 dataset show that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-10-15T13:46:19Z) - Depth Any Video with Scalable Synthetic Data [98.42356740981839]
We develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse synthetic environments.
We leverage the powerful priors of generative video diffusion models to handle real-world videos effectively.
Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.
arXiv Detail & Related papers (2024-10-14T17:59:46Z) - Learning Temporally Consistent Video Depth from Video Diffusion Priors [57.929828486615605]
This work addresses the challenge of video depth estimation.
We reformulate the prediction task into a conditional generation problem.
This allows us to leverage the prior knowledge embedded in existing video generation models.
arXiv Detail & Related papers (2024-06-03T16:20:24Z) - SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Model [72.0795843450604]
Current approaches face challenges in maintaining consistent accuracy across diverse scenes.
These methods rely on extensive datasets comprising millions, if not tens of millions, of data for training.
This paper presents SM$4$Depth, a model that seamlessly works for both indoor and outdoor scenes.
arXiv Detail & Related papers (2024-03-13T14:08:25Z) - FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models [67.96827539201071]
We propose a novel test-time optimization approach for 3D scene reconstruction.
Our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets.
arXiv Detail & Related papers (2023-08-10T17:55:02Z) - ViDaS Video Depth-aware Saliency Network [40.08270905030302]
We introduce ViDaS, a two-stream, fully convolutional Video, Depth-Aware Saliency network.
It addresses the problem of attention modeling in-the-wild" via saliency prediction in videos.
Network consists of two visual streams, one for the RGB frames, and one for the depth frames.
It is trained end-to-end and is evaluated in a variety of different databases with eye-tracking data.
arXiv Detail & Related papers (2023-05-19T15:04:49Z) - Towards Accurate Reconstruction of 3D Scene Shape from A Single
Monocular Image [91.71077190961688]
We propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image.
We then exploits 3D point cloud data to predict the depth shift and the camera's focal length that allow us to recover 3D scene shapes.
We test our depth model on nine unseen datasets and achieve state-of-the-art performance on zero-shot evaluation.
arXiv Detail & Related papers (2022-08-28T16:20:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.