Related papers: AViTMP: A Tracking-Specific Transformer for Single-Branch Visual Tracking

AViTMP: A Tracking-Specific Transformer for Single-Branch Visual Tracking

URL: http://arxiv.org/abs/2310.19542v3
Date: Thu, 4 Jul 2024 03:37:57 GMT
Title: AViTMP: A Tracking-Specific Transformer for Single-Branch Visual Tracking
Authors: Chuanming Tang, Kai Wang, Joost van de Weijer, Jianlin Zhang, Yongmei Huang,
Abstract summary: We propose an Adaptive ViT Model Prediction tracker (AViTMP) to design a customised tracking method. This method bridges the single-branch network with discriminative models for the first time. We show that AViTMP achieves state-of-the-art performance, especially in terms of long-term tracking and robustness.
Score: 17.133735660335343
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual object tracking is a fundamental component of transportation systems, especially for intelligent driving. Despite achieving state-of-the-art performance in visual tracking, recent single-branch trackers tend to overlook the weak prior assumptions associated with the Vision Transformer (ViT) encoder and inference pipeline in visual tracking. Moreover, the effectiveness of discriminative trackers remains constrained due to the adoption of the dual-branch pipeline. To tackle the inferior effectiveness of vanilla ViT, we propose an Adaptive ViT Model Prediction tracker (AViTMP) to design a customised tracking method. This method bridges the single-branch network with discriminative models for the first time. Specifically, in the proposed encoder AViT encoder, we introduce a tracking-tailored Adaptor module for vanilla ViT and a joint target state embedding to enrich the target-prior embedding paradigm. Then, we combine the AViT encoder with a discriminative transformer-specific model predictor to predict the accurate location. Furthermore, to mitigate the limitations of conventional inference practice, we present a novel inference pipeline called CycleTrack, which bolsters the tracking robustness in the presence of distractors via bidirectional cycle tracking verification. In the experiments, we evaluated AViTMP on eight tracking benchmarks for a comprehensive assessment, including LaSOT, LaSOTExtSub, AVisT, etc. The experimental results unequivocally establish that, under fair comparison, AViTMP achieves state-of-the-art performance, especially in terms of long-term tracking and robustness. The source code will be released at https://github.com/Tchuanm/AViTMP.

Related papers

MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking [17.96400810834486]
We introduce the first large-scale Multispectral UAV Single Object Tracking dataset (MUST) MUST includes 250 video sequences spanning diverse environments and challenges. We also propose a novel tracking framework, UNTrack, which encodes unified spectral, spatial, and temporal features from spectrum prompts.
arXiv Detail & Related papers (2025-03-22T08:47:28Z)
Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking [11.602484345266484]
Vision transformers (ViTs) have emerged as a popular backbone for visual tracking. ViTs are too cumbersome to deploy for unmanned aerial vehicle (UAV) tracking. We propose a similarity-guided layer adaptation approach to optimize the structure of ViTs.
arXiv Detail & Related papers (2025-03-09T14:02:30Z)
SFTrack: A Robust Scale and Motion Adaptive Algorithm for Tracking Small and Fast Moving Objects [2.9803250365852443]
This paper addresses the problem of multi-object tracking in Unmanned Aerial Vehicle (UAV) footage. It plays a critical role in various UAV applications, including traffic monitoring systems and real-time suspect tracking by the police. We propose a new tracking strategy, which initiates the tracking of target objects from low-confidence detections.
arXiv Detail & Related papers (2024-10-26T05:09:20Z)
Learning Motion Blur Robust Vision Transformers with Dynamic Early Exit for Real-Time UAV Tracking [14.382072224997074]
Single-stream architectures utilizing pre-trained ViT backbones offer improved performance, efficiency, and robustness. We boost the efficiency of this framework by tailoring it into an adaptive framework that dynamically exits Transformer blocks for real-time UAV tracking. We also improve the effectiveness of ViTs in handling motion blur, a common issue in UAV tracking caused by the fast movements of either the UAV, the tracked objects, or both.
arXiv Detail & Related papers (2024-07-07T14:10:04Z)
Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking [11.361394596302334]
ABTrack is an adaptive computation framework that adaptively bypassing transformer blocks for efficient visual tracking. We propose a Bypass Decision Module (BDM) to determine if a transformer block should be bypassed. We introduce a novel ViT pruning method to reduce the dimension of the latent representation of tokens in each transformer block.
arXiv Detail & Related papers (2024-06-12T09:39:18Z)
Exploring Dynamic Transformer for Efficient Object Tracking [58.120191254379854]
We propose DyTrack, a dynamic transformer framework for efficient tracking. DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget. Experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model.
arXiv Detail & Related papers (2024-03-26T12:31:58Z)
Tracking with Human-Intent Reasoning [64.69229729784008]
This work proposes a new tracking task -- Instruction Tracking. It involves providing implicit tracking instructions that require the trackers to perform tracking automatically in video frames. TrackGPT is capable of performing complex reasoning-based tracking.
arXiv Detail & Related papers (2023-12-29T03:22:18Z)
Unified Sequence-to-Sequence Learning for Single- and Multi-Modal Visual Object Tracking [64.28025685503376]
SeqTrack casts visual tracking as a sequence generation task, forecasting object bounding boxes in an autoregressive manner. SeqTrackv2 integrates a unified interface for auxiliary modalities and a set of task-prompt tokens to specify the task. This sequence learning paradigm not only simplifies the tracking framework, but also showcases superior performance across 14 challenging benchmarks.
arXiv Detail & Related papers (2023-04-27T17:56:29Z)
OmniTracker: Unifying Object Tracking by Tracking-with-Detection [119.51012668709502]
OmniTracker is presented to resolve all the tracking tasks with a fully shared network architecture, model weights, and inference pipeline. Experiments on 7 tracking datasets, including LaSOT, TrackingNet, DAVIS16-17, MOT17, MOTS20, and YTVIS19, demonstrate that OmniTracker achieves on-par or even better results than both task-specific and unified tracking models.
arXiv Detail & Related papers (2023-03-21T17:59:57Z)
SGDViT: Saliency-Guided Dynamic Vision Transformer for UAV Tracking [12.447854608181833]
This work presents a novel saliency-guided dynamic vision Transformer (SGDViT) for UAV tracking. The proposed method designs a new task-specific object saliency mining network to refine the cross-correlation operation. A lightweight saliency filtering Transformer further refines saliency information and increases the focus on appearance information.
arXiv Detail & Related papers (2023-03-08T05:01:00Z)
Unsupervised Learning of Accurate Siamese Tracking [68.58171095173056]
We present a novel unsupervised tracking framework, in which we can learn temporal correspondence both on the classification branch and regression branch. Our tracker outperforms preceding unsupervised methods by a substantial margin, performing on par with supervised methods on large-scale datasets such as TrackingNet and LaSOT.
arXiv Detail & Related papers (2022-04-04T13:39:43Z)
Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking [47.205979159070445]
We bridge the individual video frames and explore the temporal contexts across them via a transformer architecture for robust object tracking. Different from classic usage of the transformer in natural language processing tasks, we separate its encoder and decoder into two parallel branches. Our method sets several new state-of-the-art records on prevalent tracking benchmarks.
arXiv Detail & Related papers (2021-03-22T09:20:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.