LoReTrack: Efficient and Accurate Low-Resolution Transformer Tracking
- URL: http://arxiv.org/abs/2405.17660v1
- Date: Mon, 27 May 2024 21:19:04 GMT
- Title: LoReTrack: Efficient and Accurate Low-Resolution Transformer Tracking
- Authors: Shaohua Dong, Yunhe Feng, Qing Yang, Yuewei Lin, Heng Fan,
- Abstract summary: Low-Resolution Transformer Tracker (LoReTrack)
LoReTrack with a 256x256 resolution consistently improves baseline with the same resolution, and shows competitive or even better results compared to 384x384 high-resolution Transformer tracker.
With a 128x128 resolution, it runs 25 fps on a CPU with 64.9%/46.4% SUC scores on LaSOT/LaSOText, surpassing all other CPU real-time trackers.
- Score: 12.670730236928353
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: High-performance Transformer trackers have shown excellent results, yet they often bear a heavy computational load. Observing that a smaller input can immediately and conveniently reduce computations without changing the model, an easy solution is to adopt the low-resolution input for efficient Transformer tracking. Albeit faster, this hurts tracking accuracy much due to information loss in low resolution tracking. In this paper, we aim to mitigate such information loss to boost the performance of the low-resolution Transformer tracking via dual knowledge distillation from a frozen high-resolution (but not a larger) Transformer tracker. The core lies in two simple yet effective distillation modules, comprising query-key-value knowledge distillation (QKV-KD) and discrimination knowledge distillation (Disc-KD), across resolutions. The former, from the global view, allows the low-resolution tracker to inherit the features and interactions from the high-resolution tracker, while the later, from the target-aware view, enhances the target-background distinguishing capacity via imitating discriminative regions from its high-resolution counterpart. With the dual knowledge distillation, our Low-Resolution Transformer Tracker (LoReTrack) enjoys not only high efficiency owing to reduced computation but also enhanced accuracy by distilling knowledge from the high-resolution tracker. In extensive experiments, LoReTrack with a 256x256 resolution consistently improves baseline with the same resolution, and shows competitive or even better results compared to 384x384 high-resolution Transformer tracker, while running 52% faster and saving 56% MACs. Moreover, LoReTrack is resolution-scalable. With a 128x128 resolution, it runs 25 fps on a CPU with 64.9%/46.4% SUC scores on LaSOT/LaSOText, surpassing all other CPU real-time trackers. Code will be released.
Related papers
- Cross Resolution Encoding-Decoding For Detection Transformers [33.248031676529635]
Cross-Resolution.
Decoding (CRED) is designed to fuse multiscale.
detection mechanisms.
CRED delivers accuracy similar to the high-resolution DETR counterpart in roughly 50% fewer FLOPs.
We plan to release pretrained CRED-DETRs for use by the community.
arXiv Detail & Related papers (2024-10-05T09:01:59Z) - Multi-resolution Rescored ByteTrack for Video Object Detection on Ultra-low-power Embedded Systems [13.225654514930595]
Multi-Resolution Rescored Byte-Track (MR2-ByteTrack) is a novel video object detection framework for ultra-low-power embedded processors.
MR2-ByteTrack reduces the average compute load of an off-the-shelf Deep Neural Network based object detector by up to 2.25$times$.
We demonstrate an average accuracy increase of 2.16% and a latency reduction of 43% on the GAP9 microcontroller.
arXiv Detail & Related papers (2024-04-17T15:45:49Z) - Exploring Dynamic Transformer for Efficient Object Tracking [58.120191254379854]
We propose DyTrack, a dynamic transformer framework for efficient tracking.
DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget.
Experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model.
arXiv Detail & Related papers (2024-03-26T12:31:58Z) - Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance [87.19164603145056]
We propose LoRAT, a method that unveils the power of large ViT model for tracking within laboratory-level resources.
The essence of our work lies in adapting LoRA, a technique that fine-tunes a small subset of model parameters without adding inference latency.
We design an anchor-free head solely based on to adapt PETR, enabling better performance with less computational overhead.
arXiv Detail & Related papers (2024-03-08T11:41:48Z) - PTSR: Patch Translator for Image Super-Resolution [16.243363392717434]
We propose a patch translator for image super-resolution (PTSR) to address this problem.
The proposed PTSR is a transformer-based GAN network with no convolution operation.
We introduce a novel patch translator module for regenerating the improved patches utilising multi-head attention.
arXiv Detail & Related papers (2023-10-20T01:45:00Z) - ZoomTrack: Target-aware Non-uniform Resizing for Efficient Visual
Tracking [40.13014036490452]
transformer has enabled the speed-oriented trackers to approach state-of-the-art (SOTA) performance with high-speed.
We demonstrate that it is possible to narrow or even close this gap while achieving high tracking speed based on the smaller input size.
arXiv Detail & Related papers (2023-10-16T05:06:13Z) - Learning Disentangled Representation with Mutual Information
Maximization for Real-Time UAV Tracking [1.0541541376305243]
This paper exploits disentangled representation with mutual information (DR-MIM) to improve precision and efficiency for UAV tracking.
Our DR-MIM tracker significantly outperforms state-of-the-art UAV tracking methods.
arXiv Detail & Related papers (2023-08-20T13:16:15Z) - Exploring Lightweight Hierarchical Vision Transformers for Efficient
Visual Tracking [69.89887818921825]
HiT is a new family of efficient tracking models that can run at high speed on different devices.
HiT achieves 64.6% AUC on the LaSOT benchmark, surpassing all previous efficient trackers.
arXiv Detail & Related papers (2023-08-14T02:51:34Z) - Rethinking Resolution in the Context of Efficient Video Recognition [49.957690643214576]
Cross-resolution KD (ResKD) is a simple but effective method to boost recognition accuracy on low-resolution frames.
We extensively demonstrate its effectiveness over state-of-the-art architectures, i.e., 3D-CNNs and Video Transformers.
arXiv Detail & Related papers (2022-09-26T15:50:44Z) - Efficient Decoder-free Object Detection with Transformers [75.00499377197475]
Vision transformers (ViTs) are changing the landscape of object detection approaches.
We propose a decoder-free fully transformer-based (DFFT) object detector.
DFFT_SMALL achieves high efficiency in both training and inference stages.
arXiv Detail & Related papers (2022-06-14T13:22:19Z) - SALISA: Saliency-based Input Sampling for Efficient Video Object
Detection [58.22508131162269]
We propose SALISA, a novel non-uniform SALiency-based Input SAmpling technique for video object detection.
We show that SALISA significantly improves the detection of small objects.
arXiv Detail & Related papers (2022-04-05T17:59:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.