FARTrack: Fast Autoregressive Visual Tracking with High Performance
- URL: http://arxiv.org/abs/2602.03214v1
- Date: Tue, 03 Feb 2026 07:29:36 GMT
- Title: FARTrack: Fast Autoregressive Visual Tracking with High Performance
- Authors: Guijie Wang, Tong Lin, Yifan Bai, Anjia Cao, Shiyi Liang, Wangbo Zhao, Xing Wei,
- Abstract summary: FARTrack is a Fast Auto-Regressive Tracking framework.<n>It delivers an AO of 70.6% on GOT-10k in real-time.<n>Our fastest model achieves a speed of 343 FPS on the GPU and 121 FPS on the CPU.
- Score: 17.53171333786429
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inference speed and tracking performance are two critical evaluation metrics in the field of visual tracking. However, high-performance trackers often suffer from slow processing speeds, making them impractical for deployment on resource-constrained devices. To alleviate this issue, we propose FARTrack, a Fast Auto-Regressive Tracking framework. Since autoregression emphasizes the temporal nature of the trajectory sequence, it can maintain high performance while achieving efficient execution across various devices. FARTrack introduces Task-Specific Self-Distillation and Inter-frame Autoregressive Sparsification, designed from the perspectives of shallow-yet-accurate distillation and redundant-to-essential token optimization, respectively. Task-Specific Self-Distillation achieves model compression by distilling task-specific tokens layer by layer, enhancing the model's inference speed while avoiding suboptimal manual teacher-student layer pairs assignments. Meanwhile, Inter-frame Autoregressive Sparsification sequentially condenses multiple templates, avoiding additional runtime overhead while learning a temporally-global optimal sparsification strategy. FARTrack demonstrates outstanding speed and competitive performance. It delivers an AO of 70.6% on GOT-10k in real-time. Beyond, our fastest model achieves a speed of 343 FPS on the GPU and 121 FPS on the CPU.
Related papers
- Track-On2: Enhancing Online Point Tracking with Memory [57.820749134569574]
We extend our prior model Track-On into Track-On2, a simple and efficient transformer-based model for online long-term tracking.<n>Track-On2 improves both performance and efficiency through architectural refinements, more effective use of memory, and improved synthetic training strategies.
arXiv Detail & Related papers (2025-09-23T15:00:18Z) - Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking [49.07982079554859]
Transformer-based visual trackers have demonstrated significant advancements due to their powerful modeling capabilities.<n>However, their practicality is limited on resource-constrained devices because of their slow processing speeds.<n>We present HiT, a novel family of efficient tracking models that achieve high performance while maintaining fast operation across various devices.
arXiv Detail & Related papers (2025-06-25T12:46:46Z) - Towards Low-Latency Event Stream-based Visual Object Tracking: A Slow-Fast Approach [32.91982063297922]
We propose a novel Slow-Fast Tracking paradigm that flexibly adapts to different operational requirements, termed SFTrack.<n>The proposed framework supports two complementary modes, i.e., a high-precision slow tracker for scenarios with sufficient computational resources, and an efficient fast tracker tailored for latency-aware, resource-constrained environments.<n>Our framework first performs graph-based representation learning from high-temporal-resolution event streams, and then integrates the learned graph-structured information into two FlashAttention-based vision backbones.
arXiv Detail & Related papers (2025-05-19T09:37:23Z) - Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking [11.146155422858824]
Single-stream architectures using Vision Transformer (ViT) backbones show great potential for real-time UAV tracking.<n>We propose to learn Occlusion-Robust Representations (ORR) based on ViTs for UAV tracking.<n>We also propose an Adaptive Feature-Based Knowledge Distillation (AFKD) method to create a more compact tracker.
arXiv Detail & Related papers (2025-04-12T14:06:50Z) - Online Dense Point Tracking with Streaming Memory [54.22820729477756]
Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video.<n>Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one.<n>We present a lightweight and fast model with textbfStreaming memory for dense textbfPOint textbfTracking and online video processing.
arXiv Detail & Related papers (2025-03-09T06:16:49Z) - Temporal Correlation Meets Embedding: Towards a 2nd Generation of JDE-based Real-Time Multi-Object Tracking [52.04679257903805]
Joint Detection and Embedding (JDE) trackers have demonstrated excellent performance in Multi-Object Tracking (MOT) tasks.
Our tracker, named TCBTrack, achieves state-of-the-art performance on multiple public benchmarks.
arXiv Detail & Related papers (2024-07-19T07:48:45Z) - Learning Motion Blur Robust Vision Transformers for Real-Time UAV Tracking [14.382072224997074]
Unmanned aerial vehicle (UAV) tracking is critical for applications like surveillance, search-and-rescue, and autonomous navigation.<n>The high-speed movement of UAVs and targets introduces unique challenges, including real-time processing demands and severe motion blur.<n>We propose an adaptive computation framework that dynamically exits Transformer blocks for real-time UAV tracking.
arXiv Detail & Related papers (2024-07-07T14:10:04Z) - Exploring Dynamic Transformer for Efficient Object Tracking [58.120191254379854]
We propose DyTrack, a dynamic transformer framework for efficient tracking.<n>DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget.<n>Experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model.
arXiv Detail & Related papers (2024-03-26T12:31:58Z) - DropMAE: Learning Representations via Masked Autoencoders with Spatial-Attention Dropout for Temporal Matching Tasks [77.84636815364905]
This paper studies masked autoencoder (MAE) video pre-training for various temporal matching-based downstream tasks.<n>We propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos.
arXiv Detail & Related papers (2023-04-02T16:40:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.