Related papers: Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking

Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking

URL: http://arxiv.org/abs/2506.20381v1
Date: Wed, 25 Jun 2025 12:46:46 GMT
Title: Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking
Authors: Ben Kang, Xin Chen, Jie Zhao, Chunjuan Bo, Dong Wang, Huchuan Lu,
Abstract summary: Transformer-based visual trackers have demonstrated significant advancements due to their powerful modeling capabilities.<n>However, their practicality is limited on resource-constrained devices because of their slow processing speeds.<n>We present HiT, a novel family of efficient tracking models that achieve high performance while maintaining fast operation across various devices.
Score: 49.07982079554859
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based visual trackers have demonstrated significant advancements due to their powerful modeling capabilities. However, their practicality is limited on resource-constrained devices because of their slow processing speeds. To address this challenge, we present HiT, a novel family of efficient tracking models that achieve high performance while maintaining fast operation across various devices. The core innovation of HiT lies in its Bridge Module, which connects lightweight transformers to the tracking framework, enhancing feature representation quality. Additionally, we introduce a dual-image position encoding approach to effectively encode spatial information. HiT achieves an impressive speed of 61 frames per second (fps) on the NVIDIA Jetson AGX platform, alongside a competitive AUC of 64.6% on the LaSOT benchmark, outperforming all previous efficient trackers.Building on HiT, we propose DyHiT, an efficient dynamic tracker that flexibly adapts to scene complexity by selecting routes with varying computational requirements. DyHiT uses search area features extracted by the backbone network and inputs them into an efficient dynamic router to classify tracking scenarios. Based on the classification, DyHiT applies a divide-and-conquer strategy, selecting appropriate routes to achieve a superior trade-off between accuracy and speed. The fastest version of DyHiT achieves 111 fps on NVIDIA Jetson AGX while maintaining an AUC of 62.4% on LaSOT.Furthermore, we introduce a training-free acceleration method based on the dynamic routing architecture of DyHiT. This method significantly improves the execution speed of various high-performance trackers without sacrificing accuracy. For instance, our acceleration method enables the state-of-the-art tracker SeqTrack-B256 to achieve a 2.68 times speedup on an NVIDIA GeForce RTX 2080 Ti GPU while maintaining the same AUC of 69.9% on the LaSOT.

Related papers

Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking [11.602484345266484]
Vision transformers (ViTs) have emerged as a popular backbone for visual tracking.<n>ViTs are too cumbersome to deploy for unmanned aerial vehicle (UAV) tracking.<n>We propose a similarity-guided layer adaptation approach to optimize the structure of ViTs.
arXiv Detail & Related papers (2025-03-09T14:02:30Z)
Two-stream Beats One-stream: Asymmetric Siamese Network for Efficient Visual Tracking [54.124445709376154]
We propose a novel asymmetric Siamese tracker named textbfAsymTrack for efficient tracking.<n>Building on this architecture, we devise an efficient template modulation mechanism to inject crucial cues into the search features.<n>Experiments demonstrate that AsymTrack offers superior speed-precision trade-offs across different platforms.
arXiv Detail & Related papers (2025-03-01T14:44:54Z)
Temporal Correlation Meets Embedding: Towards a 2nd Generation of JDE-based Real-Time Multi-Object Tracking [52.04679257903805]
Joint Detection and Embedding (JDE) trackers have demonstrated excellent performance in Multi-Object Tracking (MOT) tasks. Our tracker, named TCBTrack, achieves state-of-the-art performance on multiple public benchmarks.
arXiv Detail & Related papers (2024-07-19T07:48:45Z)
Exploring Dynamic Transformer for Efficient Object Tracking [58.120191254379854]
We propose DyTrack, a dynamic transformer framework for efficient tracking.<n>DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget.<n>Experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model.
arXiv Detail & Related papers (2024-03-26T12:31:58Z)
LiteTrack: Layer Pruning with Asynchronous Feature Extraction for Lightweight and Efficient Visual Tracking [4.179339279095506]
LiteTrack is an efficient transformer-based tracking model optimized for high-speed operations across various devices. It achieves a more favorable trade-off between accuracy and efficiency than the other lightweight trackers. LiteTrack-B9 reaches competitive 72.2% AO on GOT-10k and 82.4% AUC on TrackingNet, and operates at 171 fps on an NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2023-09-17T12:01:03Z)
Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking [69.89887818921825]
HiT is a new family of efficient tracking models that can run at high speed on different devices. HiT achieves 64.6% AUC on the LaSOT benchmark, surpassing all previous efficient trackers.
arXiv Detail & Related papers (2023-08-14T02:51:34Z)
Efficient Visual Tracking via Hierarchical Cross-Attention Transformer [82.92565582642847]
We present an efficient tracking method via a hierarchical cross-attention transformer named HCAT. Our model runs about 195 fps on GPU, 45 fps on CPU, and 55 fps on the edge AI platform of NVidia Jetson AGX Xavier.
arXiv Detail & Related papers (2022-03-25T09:45:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.