Efficient Training for Visual Tracking with Deformable Transformer
- URL: http://arxiv.org/abs/2309.02676v1
- Date: Wed, 6 Sep 2023 03:07:43 GMT
- Title: Efficient Training for Visual Tracking with Deformable Transformer
- Authors: Qingmao Wei, Guotian Zeng, Bi Zeng
- Abstract summary: We present DETRack, a streamlined end-to-end visual object tracking framework.
Our framework utilizes an efficient encoder-decoder structure where the deformable transformer decoder acting as a target head.
For training, we introduce a novel one-to-many label assignment and an auxiliary denoising technique.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent Transformer-based visual tracking models have showcased superior
performance. Nevertheless, prior works have been resource-intensive, requiring
prolonged GPU training hours and incurring high GFLOPs during inference due to
inefficient training methods and convolution-based target heads. This intensive
resource use renders them unsuitable for real-world applications. In this
paper, we present DETRack, a streamlined end-to-end visual object tracking
framework. Our framework utilizes an efficient encoder-decoder structure where
the deformable transformer decoder acting as a target head, achieves higher
sparsity than traditional convolution heads, resulting in decreased GFLOPs. For
training, we introduce a novel one-to-many label assignment and an auxiliary
denoising technique, significantly accelerating model's convergence.
Comprehensive experiments affirm the effectiveness and efficiency of our
proposed method. For instance, DETRack achieves 72.9% AO on challenging GOT-10k
benchmarks using only 20% of the training epochs required by the baseline, and
runs with lower GFLOPs than all the transformer-based trackers.
Related papers
- Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models.
Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information.
Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z) - Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis [16.253898272659242]
This study focuses on Transformer-based LLMs, specifically applying low-rank parametrization to feedforward networks (FFNs)
Experiments on the large RefinedWeb dataset show that low-rank parametrization is both efficient (e.g., 2.6$times$ FFN speed-up with 32% parameters) and effective during training.
Motivated by this finding, we develop the wide and structured networks surpassing the current medium-sized and large-sized Transformer in perplexity and throughput performance.
arXiv Detail & Related papers (2024-07-13T10:08:55Z) - Exploring Dynamic Transformer for Efficient Object Tracking [58.120191254379854]
We propose DyTrack, a dynamic transformer framework for efficient tracking.
DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget.
Experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model.
arXiv Detail & Related papers (2024-03-26T12:31:58Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - SiRi: A Simple Selective Retraining Mechanism for Transformer-based
Visual Grounding [131.0977050185209]
Selective Retraining (SiRi) can significantly outperform previous approaches on three popular benchmarks.
SiRi performs surprisingly superior even with limited training data.
We also extend it to transformer-based visual grounding models and other vision-language tasks to verify the validity.
arXiv Detail & Related papers (2022-07-27T07:01:01Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z) - ProFormer: Learning Data-efficient Representations of Body Movement with
Prototype-based Feature Augmentation and Visual Transformers [31.908276711898548]
Methods for data-efficient recognition from body poses increasingly leverage skeleton sequences structured as image-like arrays.
We look at this paradigm from the perspective of transformer networks, for the first time exploring visual transformers as data-efficient encoders of skeleton movement.
In our pipeline, body pose sequences cast as image-like representations are converted into patch embeddings and then passed to a visual transformer backbone optimized with deep metric learning.
arXiv Detail & Related papers (2022-02-23T11:11:54Z) - EF-Train: Enable Efficient On-device CNN Training on FPGA Through Data
Reshaping for Online Adaptation or Personalization [11.44696439060875]
EF-Train is an efficient DNN training accelerator with a unified channel-level parallelism-based convolution kernel.
It can achieve end-to-end training on resource-limited low-power edge-level FPGAs.
Our design achieves 46.99 GFLOPS and 6.09GFLOPS/W in terms of throughput and energy efficiency.
arXiv Detail & Related papers (2022-02-18T18:27:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.