MixFormerV2: Efficient Fully Transformer Tracking
- URL: http://arxiv.org/abs/2305.15896v2
- Date: Wed, 7 Feb 2024 12:20:21 GMT
- Title: MixFormerV2: Efficient Fully Transformer Tracking
- Authors: Yutao Cui, Tianhui Song, Gangshan Wu and Limin Wang
- Abstract summary: Transformer-based trackers have achieved strong accuracy on the standard benchmarks.
But their efficiency remains an obstacle to practical deployment on both GPU and CPU platforms.
We propose a fully transformer tracking framework, coined as emphMixFormerV2, without any dense convolutional operation and complex score prediction module.
- Score: 49.07428299165031
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based trackers have achieved strong accuracy on the standard
benchmarks. However, their efficiency remains an obstacle to practical
deployment on both GPU and CPU platforms. In this paper, to overcome this
issue, we propose a fully transformer tracking framework, coined as
\emph{MixFormerV2}, without any dense convolutional operation and complex score
prediction module. Our key design is to introduce four special prediction
tokens and concatenate them with the tokens from target template and search
areas. Then, we apply the unified transformer backbone on these mixed token
sequence. These prediction tokens are able to capture the complex correlation
between target template and search area via mixed attentions. Based on them, we
can easily predict the tracking box and estimate its confidence score through
simple MLP heads. To further improve the efficiency of MixFormerV2, we present
a new distillation-based model reduction paradigm, including dense-to-sparse
distillation and deep-to-shallow distillation. The former one aims to transfer
knowledge from the dense-head based MixViT to our fully transformer tracker,
while the latter one is used to prune some layers of the backbone. We
instantiate two types of MixForemrV2, where the MixFormerV2-B achieves an AUC
of 70.6\% on LaSOT and an AUC of 57.4\% on TNL2k with a high GPU speed of 165
FPS, and the MixFormerV2-S surpasses FEAR-L by 2.7\% AUC on LaSOT with a
real-time CPU speed.
Related papers
- Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules [96.21649779507831]
We propose a novel architecture dubbed mixture-of-modules (MoM)
MoM is motivated by an intuition that any layer, regardless of its position, can be used to compute a token.
We show that MoM provides not only a unified framework for Transformers but also a flexible and learnable approach for reducing redundancy.
arXiv Detail & Related papers (2024-07-09T08:50:18Z) - SDPose: Tokenized Pose Estimation via Circulation-Guide Self-Distillation [53.675725490807615]
We introduce SDPose, a new self-distillation method for improving the performance of small transformer-based models.
SDPose-T obtains 69.7% mAP with 4.4M parameters and 1.8 GFLOPs, while SDPose-S-V2 obtains 73.5% mAP on the MSCOCO validation dataset.
arXiv Detail & Related papers (2024-04-04T15:23:14Z) - SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
Vision Transformers have achieved impressive performance in many vision tasks.
Much less research has been devoted to the channel mixer or feature mixing block (FFN or)
We show that the dense connections can be replaced with a diagonal block structure that supports larger expansion ratios.
arXiv Detail & Related papers (2023-12-01T08:22:34Z) - Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation [73.31524865643709]
We present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D pose estimation from videos.
Our HoDT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks.
Our method can achieve both high efficiency and estimation accuracy compared to the original VPT models.
arXiv Detail & Related papers (2023-11-20T18:59:51Z) - Separable Self and Mixed Attention Transformers for Efficient Object
Tracking [3.9160947065896803]
This paper proposes an efficient self and mixed attention transformer-based architecture for lightweight tracking.
With these contributions, the proposed lightweight tracker deploys a transformer-based backbone and head module concurrently for the first time.
Simulations show that our Separable Self and Mixed Attention-based Tracker, SMAT, surpasses the performance of related lightweight trackers on GOT10k, TrackingNet, LaSOT, NfS30, UAV123, and AVisT datasets.
arXiv Detail & Related papers (2023-09-07T19:23:02Z) - Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST)
CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background.
Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z) - ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer [6.473688838974095]
We propose a new type of multiplication-reduced model, dubbed $textbfShiftAddViT$, to achieve end-to-end inference speedups on GPUs.
Experiments on various 2D/3D vision tasks consistently validate the effectiveness of our proposed ShiftAddViT.
arXiv Detail & Related papers (2023-06-10T13:53:41Z) - Learning Spatial-Frequency Transformer for Visual Object Tracking [15.750739748843744]
Recent trackers adopt the Transformer to combine or replace the widely used ResNet as their new backbone network.
We believe these operations ignore the spatial prior of the target object which may lead to sub-optimal results.
We propose a unified Spatial-Frequency Transformer that models the spatial Prior and High-frequency emphasis Attention (GPHA) simultaneously.
arXiv Detail & Related papers (2022-08-18T13:46:12Z) - Unleashing Vanilla Vision Transformer with Masked Image Modeling for
Object Detection [39.37861288287621]
A MIM pre-trained vanilla ViT can work surprisingly well in the challenging object-level recognition scenario.
A random compact convolutional stem supplants the pre-trained large kernel patchify stem.
The proposed detector, named MIMDet, enables a MIM pre-trained vanilla ViT to outperform hierarchical Swin Transformer by 2.3 box AP and 2.5 mask AP on.
arXiv Detail & Related papers (2022-04-06T17:59:04Z) - DoT: An efficient Double Transformer for NLP tasks with tables [3.0079490585515343]
DoT is a double transformer model that decomposes the problem into two sub-tasks.
We show that for a small drop of accuracy, DoT improves training and inference time by at least 50%.
arXiv Detail & Related papers (2021-06-01T13:33:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.