Video Depth Propagation
- URL: http://arxiv.org/abs/2512.10725v1
- Date: Thu, 11 Dec 2025 15:08:37 GMT
- Title: Video Depth Propagation
- Authors: Luigi Piccinelli, Thiemo Wandel, Christos Sakaridis, Wim Abbeloos, Luc Van Gool,
- Abstract summary: Existing methods rely on simple frame-by-frame monocular models, leading to temporal inconsistencies and inaccuracies.<n>We propose VeloDepth, which effectively leverages an online video pipeline and performs deep feature propagation.<n>Our design structurally enforces temporal consistency, resulting in stable depth predictions across consecutive frames with improved efficiency.
- Score: 54.523028170425256
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Depth estimation in videos is essential for visual perception in real-world applications. However, existing methods either rely on simple frame-by-frame monocular models, leading to temporal inconsistencies and inaccuracies, or use computationally demanding temporal modeling, unsuitable for real-time applications. These limitations significantly restrict general applicability and performance in practical settings. To address this, we propose VeloDepth, an efficient and robust online video depth estimation pipeline that effectively leverages spatiotemporal priors from previous depth predictions and performs deep feature propagation. Our method introduces a novel Propagation Module that refines and propagates depth features and predictions using flow-based warping coupled with learned residual corrections. In addition, our design structurally enforces temporal consistency, resulting in stable depth predictions across consecutive frames with improved efficiency. Comprehensive zero-shot evaluation on multiple benchmarks demonstrates the state-of-the-art temporal consistency and competitive accuracy of VeloDepth, alongside its significantly faster inference compared to existing video-based depth estimators. VeloDepth thus provides a practical, efficient, and accurate solution for real-time depth estimation suitable for diverse perception tasks. Code and models are available at https://github.com/lpiccinelli-eth/velodepth
Related papers
- Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation [35.24180364395977]
We show that adapting only the decoder is sufficient for effective test-time optimization.<n>We propose a lightweight test-time adaptation method that updates only this low-dimensional subspace using sparse depth supervision.
arXiv Detail & Related papers (2026-03-02T11:45:19Z) - StableDPT: Temporal Stable Monocular Video Depth Estimation [14.453483279783908]
We propose a novel approach that adapts any state-of-the-art image-based (depth) estimation model for video processing.<n>Our architecture builds upon an off-the-shelf Vision Transformer (ViT) encoder and enhances the Dense Prediction Transformer (DPT) head.<n> Evaluations on multiple benchmark datasets demonstrate improved temporal consistency, competitive state-of-the-art performance and on top 2x faster processing in real-world scenarios.
arXiv Detail & Related papers (2026-01-06T08:02:14Z) - Video Depth Anything: Consistent Depth Estimation for Super-Long Videos [60.857723250653976]
We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos.<n>Our model is trained on a joint dataset of video depth and unlabeled images, similar to Depth Anything V2.<n>Our approach sets a new state-of-the-art in zero-shot video depth estimation.
arXiv Detail & Related papers (2025-01-21T18:53:30Z) - MEDeA: Multi-view Efficient Depth Adjustment [45.90423821963144]
MEDeA is an efficient multi-view test-time depth adjustment method that is an order of magnitude faster than existing test-time approaches.
Our method sets a new state-of-the-art on TUM RGB-D, 7Scenes, and ScanNet benchmarks and successfully handles smartphone-captured data from ARKitScenes dataset.
arXiv Detail & Related papers (2024-06-17T19:39:13Z) - Learning Temporally Consistent Video Depth from Video Diffusion Priors [62.36887303063542]
This work addresses the challenge of streamed video depth estimation.<n>We argue that sharing contextual information between frames or clips is pivotal in fostering temporal consistency.<n>We propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context.
arXiv Detail & Related papers (2024-06-03T16:20:24Z) - NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation [58.21817572577012]
Video depth estimation aims to infer temporally consistent depth.
We introduce NVDS+ that stabilizes inconsistent depth estimated by various single-image models in a plug-and-play manner.
We also elaborate a large-scale Video Depth in the Wild dataset, which contains 14,203 videos with over two million frames.
arXiv Detail & Related papers (2023-07-17T17:57:01Z) - Robust Monocular Localization of Drones by Adapting Domain Maps to Depth
Prediction Inaccuracies [0.4523163728236143]
We present a novel monocular localization framework by jointly training deep learning-based depth prediction and Bayesian filtering-based pose reasoning.
The proposed cross-modal framework significantly outperforms deep learning-only predictions with respect to model scalability and tolerance to environmental variations.
arXiv Detail & Related papers (2022-10-27T15:48:53Z) - Globally Consistent Video Depth and Pose Estimation with Efficient
Test-Time Training [15.46056322267856]
We present GCVD, a globally consistent method for learning-based video structure from motion (SfM)
GCVD integrates a compact pose graph into the CNN-based optimization to achieve globally consistent from an effective selection mechanism.
Experimental results show that GCVD outperforms the state-of-the-art methods on both depth and pose estimation.
arXiv Detail & Related papers (2022-08-04T15:12:03Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z) - Towards Streaming Perception [70.68520310095155]
We present an approach that coherently integrates latency and accuracy into a single metric for real-time online perception.
The key insight behind this metric is to jointly evaluate the output of the entire perception stack at every time instant.
We focus on the illustrative tasks of object detection and instance segmentation in urban video streams, and contribute a novel dataset with high-quality and temporally-dense annotations.
arXiv Detail & Related papers (2020-05-21T01:51:35Z) - Cascaded Deep Video Deblurring Using Temporal Sharpness Prior [88.98348546566675]
The proposed algorithm mainly consists of optical flow estimation from intermediate latent frames and latent frame restoration steps.
It first develops a deep CNN model to estimate optical flow from intermediate latent frames and then restores the latent frames based on the estimated optical flow.
We show that exploring the domain knowledge of video deblurring is able to make the deep CNN model more compact and efficient.
arXiv Detail & Related papers (2020-04-06T09:13:49Z) - Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process.
We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.