StableDPT: Temporal Stable Monocular Video Depth Estimation
- URL: http://arxiv.org/abs/2601.02793v1
- Date: Tue, 06 Jan 2026 08:02:14 GMT
- Title: StableDPT: Temporal Stable Monocular Video Depth Estimation
- Authors: Ivan Sobko, Hayko Riemenschneider, Markus Gross, Christopher Schroers,
- Abstract summary: We propose a novel approach that adapts any state-of-the-art image-based (depth) estimation model for video processing.<n>Our architecture builds upon an off-the-shelf Vision Transformer (ViT) encoder and enhances the Dense Prediction Transformer (DPT) head.<n> Evaluations on multiple benchmark datasets demonstrate improved temporal consistency, competitive state-of-the-art performance and on top 2x faster processing in real-world scenarios.
- Score: 14.453483279783908
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Applying single image Monocular Depth Estimation (MDE) models to video sequences introduces significant temporal instability and flickering artifacts. We propose a novel approach that adapts any state-of-the-art image-based (depth) estimation model for video processing by integrating a new temporal module - trainable on a single GPU in a few days. Our architecture StableDPT builds upon an off-the-shelf Vision Transformer (ViT) encoder and enhances the Dense Prediction Transformer (DPT) head. The core of our contribution lies in the temporal layers within the head, which use an efficient cross-attention mechanism to integrate information from keyframes sampled across the entire video sequence. This allows the model to capture global context and inter-frame relationships leading to more accurate and temporally stable depth predictions. Furthermore, we propose a novel inference strategy for processing videos of arbitrary length avoiding the scale misalignment and redundant computations associated with overlapping windows used in other methods. Evaluations on multiple benchmark datasets demonstrate improved temporal consistency, competitive state-of-the-art performance and on top 2x faster processing in real-world scenarios.
Related papers
- Video Depth Propagation [54.523028170425256]
Existing methods rely on simple frame-by-frame monocular models, leading to temporal inconsistencies and inaccuracies.<n>We propose VeloDepth, which effectively leverages an online video pipeline and performs deep feature propagation.<n>Our design structurally enforces temporal consistency, resulting in stable depth predictions across consecutive frames with improved efficiency.
arXiv Detail & Related papers (2025-12-11T15:08:37Z) - End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer [7.19764062839405]
We present a fully end-to-end framework for multi-person 2D pose estimation in videos.<n>A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories.<n>We introduce a novel Pose-Aware VideoErEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and atemporal decoder pose.
arXiv Detail & Related papers (2025-11-17T10:19:35Z) - MiVID: Multi-Strategic Self-Supervision for Video Frame Interpolation using Diffusion Model [2.9795035162522194]
This article introduces MiVID, a lightweight, self-supervised, diffusion-based framework for video rendering.<n>Our model eliminates the need for explicit motion estimation by combining a 3D U-Net backbone with transformer-style temporal attention.<n>We show that MiVID achieves optimal results just 50 epochs, competitive with several supervised baselines.
arXiv Detail & Related papers (2025-11-08T14:10:04Z) - ResidualViT for Efficient Temporally Dense Video Encoding [66.57779133786131]
We make three contributions to reduce the cost of computing features for temporally dense tasks.<n>First, we introduce a vision transformer (ViT) architecture, dubbed ResidualViT, that leverages the large temporal redundancy in videos.<n>Second, we propose a lightweight distillation strategy to approximate the frame-level features of the original foundation model.
arXiv Detail & Related papers (2025-09-16T17:12:23Z) - Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors [54.8852848659663]
Buffer Anytime is a framework for estimation of depth and normal maps (which we call geometric buffers) from video.
We demonstrate high-quality video buffer estimation by leveraging single-image priors with temporal consistency constraints.
arXiv Detail & Related papers (2024-11-26T09:28:32Z) - D-NPC: Dynamic Neural Point Clouds for Non-Rigid View Synthesis from Monocular Video [53.83936023443193]
This paper contributes to the field by introducing a new synthesis method for dynamic novel view from monocular video, such as smartphone captures.<n>Our approach represents the as a $textitdynamic neural point cloud$, an implicit time-conditioned point cloud that encodes local geometry and appearance in separate hash-encoded neural feature grids.
arXiv Detail & Related papers (2024-06-14T14:35:44Z) - Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection [19.643936110623653]
Video Anomaly Detection (VAD) aims to identify abnormalities within a specific context and timeframe.
Recent deep learning-based VAD models have shown promising results by generating high-resolution frames.
We propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task.
arXiv Detail & Related papers (2024-03-28T03:07:16Z) - UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for
Temporal Forgery Localization [16.963092523737593]
We propose a novel framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation.
Our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd.
arXiv Detail & Related papers (2023-08-28T08:20:30Z) - STDepthFormer: Predicting Spatio-temporal Depth from Video with a
Self-supervised Transformer Model [0.0]
Self-supervised model simultaneously predicts a sequence of future frames from video-input with a spatial-temporal attention network is proposed.
The proposed model leverages prior scene knowledge such as object shape and texture similar to single-image depth inference methods.
It is implicitly capable of forecasting the motion of objects in the scene, rather than requiring complex models involving multi-object detection, segmentation and tracking.
arXiv Detail & Related papers (2023-03-02T12:22:51Z) - Fast Non-Rigid Radiance Fields from Monocularized Data [66.74229489512683]
This paper proposes a new method for full 360deg inward-facing novel view synthesis of non-rigidly deforming scenes.
At the core of our method are 1) An efficient deformation module that decouples the processing of spatial and temporal information for accelerated training and inference; and 2) A static module representing the canonical scene as a fast hash-encoded neural radiance field.
In both cases, our method is significantly faster than previous methods, converging in less than 7 minutes and achieving real-time framerates at 1K resolution, while obtaining a higher visual accuracy for generated novel views.
arXiv Detail & Related papers (2022-12-02T18:51:10Z) - Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process.
We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.