Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking
- URL: http://arxiv.org/abs/2602.16160v2
- Date: Thu, 19 Feb 2026 22:36:27 GMT
- Title: Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking
- Authors: Patrick Poggi, Divake Kumar, Theja Tulabandhula, Amit Ranjan Trivedi,
- Abstract summary: Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference.<n>We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation.<n>Experiments on GOT-10k and LaSOT demonstrate up to 12% GFLOPs reduction, 8.9% latency reduction, and 10.8% energy savings.
- Score: 6.901398609610159
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference, executing the full encoder--decoder stack for every frame regardless of visual complexity, thereby incurring unnecessary computational cost in long video sequences dominated by temporally coherent frames. We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads. The model is fine-tuned to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation, thus enabling safe inference-time truncation. At runtime, we derive a lightweight uncertainty estimate directly from the model's corner localization heatmaps and use it in a feedback-driven policy that selects the encoder and decoder depth for the next frame based on the prediction confidence by exploiting temporal coherence in video. Extensive experiments on GOT-10k and LaSOT demonstrate up to 12% GFLOPs reduction, 8.9% latency reduction, and 10.8% energy savings while maintaining tracking accuracy within 0.2% of the full-depth baseline across both short-term and long-term sequences.
Related papers
- StableDPT: Temporal Stable Monocular Video Depth Estimation [14.453483279783908]
We propose a novel approach that adapts any state-of-the-art image-based (depth) estimation model for video processing.<n>Our architecture builds upon an off-the-shelf Vision Transformer (ViT) encoder and enhances the Dense Prediction Transformer (DPT) head.<n> Evaluations on multiple benchmark datasets demonstrate improved temporal consistency, competitive state-of-the-art performance and on top 2x faster processing in real-world scenarios.
arXiv Detail & Related papers (2026-01-06T08:02:14Z) - FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing [97.35186681023025]
We introduce FFP-300K, a new large-scale dataset of high-fidelity video pairs at 720p resolution and 81 frames in length.<n>We propose a novel framework designed for true guidance-free FFP that resolves the tension between maintaining first-frame appearance and preserving source video motion.
arXiv Detail & Related papers (2026-01-05T01:46:22Z) - Video Depth Propagation [54.523028170425256]
Existing methods rely on simple frame-by-frame monocular models, leading to temporal inconsistencies and inaccuracies.<n>We propose VeloDepth, which effectively leverages an online video pipeline and performs deep feature propagation.<n>Our design structurally enforces temporal consistency, resulting in stable depth predictions across consecutive frames with improved efficiency.
arXiv Detail & Related papers (2025-12-11T15:08:37Z) - Breaking Through the Spike: Spike Window Decoding for Accelerated and Precise Automatic Speech Recognition [26.665132884613477]
The Spike Window Decoding algorithm greatly improves the inference speed by making the number of frames decoded in WFST linearly related to the number of spiking frames in the CTC output.<n>Our method achieves SOTA recognition accuracy with significantly accelerates decoding speed, proven across both AISHELL-1 and large-scale In-House datasets.
arXiv Detail & Related papers (2025-01-01T12:20:07Z) - Enhanced Encoder-Decoder Architecture for Accurate Monocular Depth Estimation [0.0]
This paper introduces a novel deep learning-based approach using an enhanced encoder-decoder architecture.<n>It incorporates multi-scale feature extraction to enhance depth prediction accuracy across various object sizes and distances.<n> Experimental results on the KITTI dataset show that our model achieves a significantly faster inference time of 0.019 seconds.
arXiv Detail & Related papers (2024-10-15T13:46:19Z) - Efficient Decoder-free Object Detection with Transformers [75.00499377197475]
Vision transformers (ViTs) are changing the landscape of object detection approaches.
We propose a decoder-free fully transformer-based (DFFT) object detector.
DFFT_SMALL achieves high efficiency in both training and inference stages.
arXiv Detail & Related papers (2022-06-14T13:22:19Z) - Convex Hull Prediction for Adaptive Video Streaming by Recurrent Learning [38.574550778712236]
We propose a deep learning based method of content aware convex hull prediction.
We employ a recurrent convolutional network (RCN) to implicitly analyze the complexity of video shots in order to predict their convex hulls.
Our proposed model better approximations of the optimal convex hulls, and offers competitive time savings as compared to existing approaches.
arXiv Detail & Related papers (2022-06-10T05:11:02Z) - Layer Pruning on Demand with Intermediate CTC [50.509073206630994]
We present a training and pruning method for ASR based on the connectionist temporal classification (CTC)
We show that a Transformer-CTC model can be pruned in various depth on demand, improving real-time factor from 0.005 to 0.002 on GPU.
arXiv Detail & Related papers (2021-06-17T02:40:18Z) - Intrinsic Temporal Regularization for High-resolution Human Video
Synthesis [59.54483950973432]
temporal consistency is crucial for extending image processing pipelines to the video domain.
We propose an effective intrinsic temporal regularization scheme, where an intrinsic confidence map is estimated via the frame generator to regulate motion estimation.
We apply our intrinsic temporal regulation to single-image generator, leading to a powerful " INTERnet" capable of generating $512times512$ resolution human action videos.
arXiv Detail & Related papers (2020-12-11T05:29:45Z) - Towards Streaming Perception [70.68520310095155]
We present an approach that coherently integrates latency and accuracy into a single metric for real-time online perception.
The key insight behind this metric is to jointly evaluate the output of the entire perception stack at every time instant.
We focus on the illustrative tasks of object detection and instance segmentation in urban video streams, and contribute a novel dataset with high-quality and temporally-dense annotations.
arXiv Detail & Related papers (2020-05-21T01:51:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.