Related papers: SwinTrack: A Simple and Strong Baseline for Transformer Tracking

SwinTrack: A Simple and Strong Baseline for Transformer Tracking

URL: http://arxiv.org/abs/2112.00995v1
Date: Thu, 2 Dec 2021 05:56:03 GMT
Title: SwinTrack: A Simple and Strong Baseline for Transformer Tracking
Authors: Liting Lin, Heng Fan, Yong Xu, Haibin Ling
Abstract summary: We propose a fully attentional-based Transformer tracking algorithm, Swin-Transformer Tracker (SwinTrack) SwinTrack uses Transformer for both feature extraction and feature fusion, allowing full interactions between the target object and the search region for tracking. In our thorough experiments, SwinTrack sets a new record with 0.717 SUC on LaSOT, surpassing STARK by 4.6% while still running at 45 FPS.
Score: 81.65306568735335
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer has recently demonstrated clear potential in improving visual tracking algorithms. Nevertheless, existing transformer-based trackers mostly use Transformer to fuse and enhance the features generated by convolutional neural networks (CNNs). By contrast, in this paper, we propose a fully attentional-based Transformer tracking algorithm, Swin-Transformer Tracker (SwinTrack). SwinTrack uses Transformer for both feature extraction and feature fusion, allowing full interactions between the target object and the search region for tracking. To further improve performance, we investigate comprehensively different strategies for feature fusion, position encoding, and training loss. All these efforts make SwinTrack a simple yet solid baseline. In our thorough experiments, SwinTrack sets a new record with 0.717 SUC on LaSOT, surpassing STARK by 4.6\% while still running at 45 FPS. Besides, it achieves state-of-the-art performances with 0.483 SUC, 0.832 SUC and 0.694 AO on other challenging LaSOT$_{ext}$, TrackingNet, and GOT-10k. Our implementation and trained models are available at https://github.com/LitingLin/SwinTrack.

Related papers

Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking [49.07982079554859]
Transformer-based visual trackers have demonstrated significant advancements due to their powerful modeling capabilities.<n>However, their practicality is limited on resource-constrained devices because of their slow processing speeds.<n>We present HiT, a novel family of efficient tracking models that achieve high performance while maintaining fast operation across various devices.
arXiv Detail & Related papers (2025-06-25T12:46:46Z)
Exploring Dynamic Transformer for Efficient Object Tracking [58.120191254379854]
We propose DyTrack, a dynamic transformer framework for efficient tracking. DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget. Experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model.
arXiv Detail & Related papers (2024-03-26T12:31:58Z)
LiteTrack: Layer Pruning with Asynchronous Feature Extraction for Lightweight and Efficient Visual Tracking [4.179339279095506]
LiteTrack is an efficient transformer-based tracking model optimized for high-speed operations across various devices. It achieves a more favorable trade-off between accuracy and efficiency than the other lightweight trackers. LiteTrack-B9 reaches competitive 72.2% AO on GOT-10k and 82.4% AUC on TrackingNet, and operates at 171 fps on an NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2023-09-17T12:01:03Z)
Separable Self and Mixed Attention Transformers for Efficient Object Tracking [3.9160947065896803]
This paper proposes an efficient self and mixed attention transformer-based architecture for lightweight tracking. With these contributions, the proposed lightweight tracker deploys a transformer-based backbone and head module concurrently for the first time. Simulations show that our Separable Self and Mixed Attention-based Tracker, SMAT, surpasses the performance of related lightweight trackers on GOT10k, TrackingNet, LaSOT, NfS30, UAV123, and AVisT datasets.
arXiv Detail & Related papers (2023-09-07T19:23:02Z)
Divert More Attention to Vision-Language Tracking [33.6802730856683]
We show that ConvNets are still competitive and even better yet more economical and friendly in achieving SOTA tracking. Our solution is to unleash the power of multimodal vision-language (VL) tracking, simply using ConvNets. We show that our unified-adaptive VL representation, learned purely with the ConvNets, is a simple yet strong alternative to Transformer visual features.
arXiv Detail & Related papers (2022-07-03T16:38:24Z)
SparseTT: Visual Tracking with Sparse Transformers [43.1666514605021]
Self-attention mechanism designed to model long-range dependencies is the key to the success of Transformers. In this paper, we relieve this issue with a sparse attention mechanism by focusing the most relevant information in the search regions. We introduce a double-head predictor to boost the accuracy of foreground-background classification and regression of target bounding boxes.
arXiv Detail & Related papers (2022-05-08T04:00:28Z)
Efficient Visual Tracking via Hierarchical Cross-Attention Transformer [82.92565582642847]
We present an efficient tracking method via a hierarchical cross-attention transformer named HCAT. Our model runs about 195 fps on GPU, 45 fps on CPU, and 55 fps on the edge AI platform of NVidia Jetson AGX Xavier.
arXiv Detail & Related papers (2022-03-25T09:45:27Z)
Efficient Visual Tracking with Exemplar Transformers [98.62550635320514]
We introduce the Exemplar Transformer, an efficient transformer for real-time visual object tracking. E.T.Track, our visual tracker that incorporates Exemplar Transformer layers, runs at 47 fps on a CPU. This is up to 8 times faster than other transformer-based models.
arXiv Detail & Related papers (2021-12-17T18:57:54Z)
Learning Tracking Representations via Dual-Branch Fully Transformer Networks [82.21771581817937]
We present a Siamese-like Dual-branch network based on solely Transformers for tracking. We extract a feature vector for each patch based on its matching results with others within an attention window. The method achieves better or comparable results as the best-performing methods.
arXiv Detail & Related papers (2021-12-05T13:44:33Z)
Transformer Tracking [76.96796612225295]
Correlation acts as a critical role in the tracking field, especially in popular Siamese-based trackers. This work presents a novel attention-based feature fusion network, which effectively combines the template and search region features solely using attention. Experiments show that our TransT achieves very promising results on six challenging datasets.
arXiv Detail & Related papers (2021-03-29T09:06:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.