Related papers: Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance

Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance

URL: http://arxiv.org/abs/2403.05231v1
Date: Fri, 8 Mar 2024 11:41:48 GMT
Title: Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance
Authors: Liting Lin, Heng Fan, Zhipeng Zhang, Yaowei Wang, Yong Xu, Haibin Ling
Abstract summary: We propose LoRAT, a method that unveils the power of larger Vision Transformers (ViT) for tracking within laboratory-level resources. The essence of our work lies in adapting LoRA, a technique that fine-tunes a small subset of model parameters without adding latency inference. We design an anchor-free head solely based on a multilayer perceptron (MLP) to adapt PETR, enabling better performance with less computational overhead.
Score: 92.38964762187477
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Motivated by the Parameter-Efficient Fine-Tuning (PEFT) in large language models, we propose LoRAT, a method that unveils the power of larger Vision Transformers (ViT) for tracking within laboratory-level resources. The essence of our work lies in adapting LoRA, a technique that fine-tunes a small subset of model parameters without adding inference latency, to the domain of visual tracking. However, unique challenges and potential domain gaps make this transfer not as easy as the first intuition. Firstly, a transformer-based tracker constructs unshared position embedding for template and search image. This poses a challenge for the transfer of LoRA, usually requiring consistency in the design when applied to the pre-trained backbone, to downstream tasks. Secondly, the inductive bias inherent in convolutional heads diminishes the effectiveness of parameter-efficient fine-tuning in tracking models. To overcome these limitations, we first decouple the position embeddings in transformer-based trackers into shared spatial ones and independent type ones. The shared embeddings, which describe the absolute coordinates of multi-resolution images (namely, the template and search images), are inherited from the pre-trained backbones. In contrast, the independent embeddings indicate the sources of each token and are learned from scratch. Furthermore, we design an anchor-free head solely based on a multilayer perceptron (MLP) to adapt PETR, enabling better performance with less computational overhead. With our design, 1) it becomes practical to train trackers with the ViT-g backbone on GPUs with only memory of 25.8GB (batch size of 16); 2) we reduce the training time of the L-224 variant from 35.0 to 10.8 GPU hours; 3) we improve the LaSOT SUC score from 0.703 to 0.743 with the L-224 variant; 4) we fast the inference speed of the L-224 variant from 52 to 119 FPS. Code and models will be released.

Related papers

Solo Connection: A Parameter Efficient Fine-Tuning Technique for Transformers [0.0]
Solo Connection is a novel method that adapts the representation at the decoder-block level rather than modifying individual weight matrices.<n>Not only does Solo Connection outperform LoRA on E2E natural language generation benchmarks, but it also reduces the number of trainable parameters by 59%.<n>This paper focuses on long skip connections that link outputs of different decoder blocks, potentially enhancing the model's ability to adapt to new tasks while leveraging pre-trained knowledge.
arXiv Detail & Related papers (2025-07-18T20:11:50Z)
Lightweight RGB-T Tracking with Mobile Vision Transformers [2.209921757303168]
We propose a novel lightweight RGB-T tracking algorithm based on Mobile Vision Transformers (MobileViT)<n>Compared to state-of-the-art efficient multimodal trackers, our model achieves comparable accuracy while offering significantly lower parameter counts.<n>This paper is the first to propose a tracker using Mobile Vision Transformers for RGB-T tracking and multimodal tracking at large.
arXiv Detail & Related papers (2025-06-23T21:46:22Z)
Online Dense Point Tracking with Streaming Memory [54.22820729477756]
Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video.<n>Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one.<n>We present a lightweight and fast model with textbfStreaming memory for dense textbfPOint textbfTracking and online video processing.
arXiv Detail & Related papers (2025-03-09T06:16:49Z)
Two-stream Beats One-stream: Asymmetric Siamese Network for Efficient Visual Tracking [54.124445709376154]
We propose a novel asymmetric Siamese tracker named textbfAsymTrack for efficient tracking. Building on this architecture, we devise an efficient template modulation mechanism to inject crucial cues into the search features. Experiments demonstrate that AsymTrack offers superior speed-precision trade-offs across different platforms.
arXiv Detail & Related papers (2025-03-01T14:44:54Z)
Temporal Correlation Meets Embedding: Towards a 2nd Generation of JDE-based Real-Time Multi-Object Tracking [52.04679257903805]
Joint Detection and Embedding (JDE) trackers have demonstrated excellent performance in Multi-Object Tracking (MOT) tasks. Our tracker, named TCBTrack, achieves state-of-the-art performance on multiple public benchmarks.
arXiv Detail & Related papers (2024-07-19T07:48:45Z)
Exploring Dynamic Transformer for Efficient Object Tracking [58.120191254379854]
We propose DyTrack, a dynamic transformer framework for efficient tracking. DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget. Experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model.
arXiv Detail & Related papers (2024-03-26T12:31:58Z)
Separable Self and Mixed Attention Transformers for Efficient Object Tracking [3.9160947065896803]
This paper proposes an efficient self and mixed attention transformer-based architecture for lightweight tracking. With these contributions, the proposed lightweight tracker deploys a transformer-based backbone and head module concurrently for the first time. Simulations show that our Separable Self and Mixed Attention-based Tracker, SMAT, surpasses the performance of related lightweight trackers on GOT10k, TrackingNet, LaSOT, NfS30, UAV123, and AVisT datasets.
arXiv Detail & Related papers (2023-09-07T19:23:02Z)
Improving Siamese Based Trackers with Light or No Training through Multiple Templates and Temporal Network [0.0]
We propose a framework with two ideas on Siamese-based trackers. (i) Extending number of templates in a way that removes the need to retrain the network. (ii) a lightweight temporal network with a novel architecture focusing on both local and global information.
arXiv Detail & Related papers (2022-11-24T22:07:33Z)
Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects. The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z)
Efficient Visual Tracking with Exemplar Transformers [98.62550635320514]
We introduce the Exemplar Transformer, an efficient transformer for real-time visual object tracking. E.T.Track, our visual tracker that incorporates Exemplar Transformer layers, runs at 47 fps on a CPU. This is up to 8 times faster than other transformer-based models.
arXiv Detail & Related papers (2021-12-17T18:57:54Z)
FEAR: Fast, Efficient, Accurate and Robust Visual Tracker [2.544539499281093]
We present FEAR, a novel, fast, efficient, accurate, and robust Siamese visual tracker. FEAR-XS tracker is 2.4x smaller and 4.3x faster than LightTrack [62] with superior accuracy.
arXiv Detail & Related papers (2021-12-15T08:28:55Z)
Learning Tracking Representations via Dual-Branch Fully Transformer Networks [82.21771581817937]
We present a Siamese-like Dual-branch network based on solely Transformers for tracking. We extract a feature vector for each patch based on its matching results with others within an attention window. The method achieves better or comparable results as the best-performing methods.
arXiv Detail & Related papers (2021-12-05T13:44:33Z)
Fully Convolutional Online Tracking [47.78513247048846]
We present a fully convolutional online tracking framework, coined as FCOT, for both classification and regression branches. Our key contribution is to introduce an online regression model generator (RMG) for initializing weights of the target filter with online samples. Thanks to the unique design of RMG, our FCOT can not only be more effective in handling target variation along temporal dimension thus generating more precise results.
arXiv Detail & Related papers (2020-04-15T14:21:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.