Related papers: FARTrack: Fast Autoregressive Visual Tracking with High Performance

FARTrack: Fast Autoregressive Visual Tracking with High Performance

URL: http://arxiv.org/abs/2602.03214v1
Date: Tue, 03 Feb 2026 07:29:36 GMT
Title: FARTrack: Fast Autoregressive Visual Tracking with High Performance
Authors: Guijie Wang, Tong Lin, Yifan Bai, Anjia Cao, Shiyi Liang, Wangbo Zhao, Xing Wei,
Abstract summary: FARTrack is a Fast Auto-Regressive Tracking framework.<n>It delivers an AO of 70.6% on GOT-10k in real-time.<n>Our fastest model achieves a speed of 343 FPS on the GPU and 121 FPS on the CPU.
Score: 17.53171333786429
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Inference speed and tracking performance are two critical evaluation metrics in the field of visual tracking. However, high-performance trackers often suffer from slow processing speeds, making them impractical for deployment on resource-constrained devices. To alleviate this issue, we propose FARTrack, a Fast Auto-Regressive Tracking framework. Since autoregression emphasizes the temporal nature of the trajectory sequence, it can maintain high performance while achieving efficient execution across various devices. FARTrack introduces Task-Specific Self-Distillation and Inter-frame Autoregressive Sparsification, designed from the perspectives of shallow-yet-accurate distillation and redundant-to-essential token optimization, respectively. Task-Specific Self-Distillation achieves model compression by distilling task-specific tokens layer by layer, enhancing the model's inference speed while avoiding suboptimal manual teacher-student layer pairs assignments. Meanwhile, Inter-frame Autoregressive Sparsification sequentially condenses multiple templates, avoiding additional runtime overhead while learning a temporally-global optimal sparsification strategy. FARTrack demonstrates outstanding speed and competitive performance. It delivers an AO of 70.6% on GOT-10k in real-time. Beyond, our fastest model achieves a speed of 343 FPS on the GPU and 121 FPS on the CPU.

Related papers

Track-On2: Enhancing Online Point Tracking with Memory [57.820749134569574]
We extend our prior model Track-On into Track-On2, a simple and efficient transformer-based model for online long-term tracking.<n>Track-On2 improves both performance and efficiency through architectural refinements, more effective use of memory, and improved synthetic training strategies.
arXiv Detail & Related papers (2025-09-23T15:00:18Z)
Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking [49.07982079554859]
Transformer-based visual trackers have demonstrated significant advancements due to their powerful modeling capabilities.<n>However, their practicality is limited on resource-constrained devices because of their slow processing speeds.<n>We present HiT, a novel family of efficient tracking models that achieve high performance while maintaining fast operation across various devices.
arXiv Detail & Related papers (2025-06-25T12:46:46Z)
Towards Low-Latency Event Stream-based Visual Object Tracking: A Slow-Fast Approach [32.91982063297922]
We propose a novel Slow-Fast Tracking paradigm that flexibly adapts to different operational requirements, termed SFTrack.<n>The proposed framework supports two complementary modes, i.e., a high-precision slow tracker for scenarios with sufficient computational resources, and an efficient fast tracker tailored for latency-aware, resource-constrained environments.<n>Our framework first performs graph-based representation learning from high-temporal-resolution event streams, and then integrates the learned graph-structured information into two FlashAttention-based vision backbones.
arXiv Detail & Related papers (2025-05-19T09:37:23Z)
Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking [11.146155422858824]
Single-stream architectures using Vision Transformer (ViT) backbones show great potential for real-time UAV tracking.<n>We propose to learn Occlusion-Robust Representations (ORR) based on ViTs for UAV tracking.<n>We also propose an Adaptive Feature-Based Knowledge Distillation (AFKD) method to create a more compact tracker.
arXiv Detail & Related papers (2025-04-12T14:06:50Z)
Online Dense Point Tracking with Streaming Memory [54.22820729477756]
Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video.<n>Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one.<n>We present a lightweight and fast model with textbfStreaming memory for dense textbfPOint textbfTracking and online video processing.
arXiv Detail & Related papers (2025-03-09T06:16:49Z)
Temporal Correlation Meets Embedding: Towards a 2nd Generation of JDE-based Real-Time Multi-Object Tracking [52.04679257903805]
Joint Detection and Embedding (JDE) trackers have demonstrated excellent performance in Multi-Object Tracking (MOT) tasks. Our tracker, named TCBTrack, achieves state-of-the-art performance on multiple public benchmarks.
arXiv Detail & Related papers (2024-07-19T07:48:45Z)
Learning Motion Blur Robust Vision Transformers for Real-Time UAV Tracking [14.382072224997074]
Unmanned aerial vehicle (UAV) tracking is critical for applications like surveillance, search-and-rescue, and autonomous navigation.<n>The high-speed movement of UAVs and targets introduces unique challenges, including real-time processing demands and severe motion blur.<n>We propose an adaptive computation framework that dynamically exits Transformer blocks for real-time UAV tracking.
arXiv Detail & Related papers (2024-07-07T14:10:04Z)
Exploring Dynamic Transformer for Efficient Object Tracking [58.120191254379854]
We propose DyTrack, a dynamic transformer framework for efficient tracking.<n>DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget.<n>Experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model.
arXiv Detail & Related papers (2024-03-26T12:31:58Z)
DropMAE: Learning Representations via Masked Autoencoders with Spatial-Attention Dropout for Temporal Matching Tasks [77.84636815364905]
This paper studies masked autoencoder (MAE) video pre-training for various temporal matching-based downstream tasks.<n>We propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos.
arXiv Detail & Related papers (2023-04-02T16:40:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.