Divert More Attention to Vision-Language Tracking
- URL: http://arxiv.org/abs/2207.01076v1
- Date: Sun, 3 Jul 2022 16:38:24 GMT
- Title: Divert More Attention to Vision-Language Tracking
- Authors: Mingzhe Guo, Zhipeng Zhang, Heng Fan, Liping Jing
- Abstract summary: We show that ConvNets are still competitive and even better yet more economical and friendly in achieving SOTA tracking.
Our solution is to unleash the power of multimodal vision-language (VL) tracking, simply using ConvNets.
We show that our unified-adaptive VL representation, learned purely with the ConvNets, is a simple yet strong alternative to Transformer visual features.
- Score: 33.6802730856683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Relying on Transformer for complex visual feature learning, object tracking
has witnessed the new standard for state-of-the-arts (SOTAs). However, this
advancement accompanies by larger training data and longer training period,
making tracking increasingly expensive. In this paper, we demonstrate that the
Transformer-reliance is not necessary and the pure ConvNets are still
competitive and even better yet more economical and friendly in achieving SOTA
tracking. Our solution is to unleash the power of multimodal vision-language
(VL) tracking, simply using ConvNets. The essence lies in learning novel
unified-adaptive VL representations with our modality mixer (ModaMixer) and
asymmetrical ConvNet search. We show that our unified-adaptive VL
representation, learned purely with the ConvNets, is a simple yet strong
alternative to Transformer visual features, by unbelievably improving a
CNN-based Siamese tracker by 14.5% in SUC on challenging LaSOT (50.7% > 65.2%),
even outperforming several Transformer-based SOTA trackers. Besides empirical
results, we theoretically analyze our approach to evidence its effectiveness.
By revealing the potential of VL representation, we expect the community to
divert more attention to VL tracking and hope to open more possibilities for
future tracking beyond Transformer. Code and models will be released at
https://github.com/JudasDie/SOTS.
Related papers
- Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking.
Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos.
We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - ViT-P: Rethinking Data-efficient Vision Transformers from Locality [9.515925867530262]
We make vision transformers as data-efficient as convolutional neural networks by introducing multi-focal attention bias.
Inspired by the attention distance in a well-trained ViT, we constrain the self-attention of ViT to have multi-scale localized receptive field.
On Cifar100, our ViT-P Base model achieves the state-of-the-art accuracy (83.16%) trained from scratch.
arXiv Detail & Related papers (2022-03-04T14:49:48Z) - SwinTrack: A Simple and Strong Baseline for Transformer Tracking [81.65306568735335]
We propose a fully attentional-based Transformer tracking algorithm, Swin-Transformer Tracker (SwinTrack)
SwinTrack uses Transformer for both feature extraction and feature fusion, allowing full interactions between the target object and the search region for tracking.
In our thorough experiments, SwinTrack sets a new record with 0.717 SUC on LaSOT, surpassing STARK by 4.6% while still running at 45 FPS.
arXiv Detail & Related papers (2021-12-02T05:56:03Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z) - ConvNets vs. Transformers: Whose Visual Representations are More
Transferable? [49.62201738334348]
We investigate the transfer learning ability of ConvNets and vision transformers in 15 single-task and multi-task performance evaluations.
We observe consistent advantages of Transformer-based backbones on 13 downstream tasks.
arXiv Detail & Related papers (2021-08-11T16:20:38Z) - Rethinking the Design Principles of Robust Vision Transformer [28.538786330184642]
Vision Transformers (ViT) have shown that self-attention-based networks surpassed traditional convolution neural networks (CNNs) in most vision tasks.
In this paper, we rethink the design principles of ViTs based on the robustness.
By combining the robust design components, we propose Robust Vision Transformer (RVT)
arXiv Detail & Related papers (2021-05-17T15:04:15Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.