Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance
- URL: http://arxiv.org/abs/2403.05231v1
- Date: Fri, 8 Mar 2024 11:41:48 GMT
- Title: Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance
- Authors: Liting Lin, Heng Fan, Zhipeng Zhang, Yaowei Wang, Yong Xu, Haibin Ling
- Abstract summary: We propose LoRAT, a method that unveils the power of larger Vision Transformers (ViT) for tracking within laboratory-level resources.
The essence of our work lies in adapting LoRA, a technique that fine-tunes a small subset of model parameters without adding latency inference.
We design an anchor-free head solely based on a multilayer perceptron (MLP) to adapt PETR, enabling better performance with less computational overhead.
- Score: 92.38964762187477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motivated by the Parameter-Efficient Fine-Tuning (PEFT) in large language
models, we propose LoRAT, a method that unveils the power of larger Vision
Transformers (ViT) for tracking within laboratory-level resources. The essence
of our work lies in adapting LoRA, a technique that fine-tunes a small subset
of model parameters without adding inference latency, to the domain of visual
tracking. However, unique challenges and potential domain gaps make this
transfer not as easy as the first intuition. Firstly, a transformer-based
tracker constructs unshared position embedding for template and search image.
This poses a challenge for the transfer of LoRA, usually requiring consistency
in the design when applied to the pre-trained backbone, to downstream tasks.
Secondly, the inductive bias inherent in convolutional heads diminishes the
effectiveness of parameter-efficient fine-tuning in tracking models. To
overcome these limitations, we first decouple the position embeddings in
transformer-based trackers into shared spatial ones and independent type ones.
The shared embeddings, which describe the absolute coordinates of
multi-resolution images (namely, the template and search images), are inherited
from the pre-trained backbones. In contrast, the independent embeddings
indicate the sources of each token and are learned from scratch. Furthermore,
we design an anchor-free head solely based on a multilayer perceptron (MLP) to
adapt PETR, enabling better performance with less computational overhead. With
our design, 1) it becomes practical to train trackers with the ViT-g backbone
on GPUs with only memory of 25.8GB (batch size of 16); 2) we reduce the
training time of the L-224 variant from 35.0 to 10.8 GPU hours; 3) we improve
the LaSOT SUC score from 0.703 to 0.743 with the L-224 variant; 4) we fast the
inference speed of the L-224 variant from 52 to 119 FPS. Code and models will
be released.
Related papers
- Exploring Dynamic Transformer for Efficient Object Tracking [58.120191254379854]
We propose DyTrack, a dynamic transformer framework for efficient tracking.
DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget.
Experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model.
arXiv Detail & Related papers (2024-03-26T12:31:58Z) - Separable Self and Mixed Attention Transformers for Efficient Object
Tracking [3.9160947065896803]
This paper proposes an efficient self and mixed attention transformer-based architecture for lightweight tracking.
With these contributions, the proposed lightweight tracker deploys a transformer-based backbone and head module concurrently for the first time.
Simulations show that our Separable Self and Mixed Attention-based Tracker, SMAT, surpasses the performance of related lightweight trackers on GOT10k, TrackingNet, LaSOT, NfS30, UAV123, and AVisT datasets.
arXiv Detail & Related papers (2023-09-07T19:23:02Z) - Efficient Training for Visual Tracking with Deformable Transformer [0.0]
We present DETRack, a streamlined end-to-end visual object tracking framework.
Our framework utilizes an efficient encoder-decoder structure where the deformable transformer decoder acting as a target head.
For training, we introduce a novel one-to-many label assignment and an auxiliary denoising technique.
arXiv Detail & Related papers (2023-09-06T03:07:43Z) - Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective.
customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT.
This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - Efficient Visual Tracking with Exemplar Transformers [98.62550635320514]
We introduce the Exemplar Transformer, an efficient transformer for real-time visual object tracking.
E.T.Track, our visual tracker that incorporates Exemplar Transformer layers, runs at 47 fps on a CPU.
This is up to 8 times faster than other transformer-based models.
arXiv Detail & Related papers (2021-12-17T18:57:54Z) - Learning Tracking Representations via Dual-Branch Fully Transformer
Networks [82.21771581817937]
We present a Siamese-like Dual-branch network based on solely Transformers for tracking.
We extract a feature vector for each patch based on its matching results with others within an attention window.
The method achieves better or comparable results as the best-performing methods.
arXiv Detail & Related papers (2021-12-05T13:44:33Z) - Siamese Transformer Pyramid Networks for Real-Time UAV Tracking [3.0969191504482243]
We introduce the Siamese Transformer Pyramid Network (SiamTPN), which inherits the advantages from both CNN and Transformer architectures.
Experiments on both aerial and prevalent tracking benchmarks achieve competitive results while operating at high speed.
Our fastest variant tracker operates over 30 Hz on a single CPU-core and obtaining an AUC score of 58.1% on the LaSOT dataset.
arXiv Detail & Related papers (2021-10-17T13:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.