Learning Tracking Representations via Dual-Branch Fully Transformer
Networks
- URL: http://arxiv.org/abs/2112.02571v1
- Date: Sun, 5 Dec 2021 13:44:33 GMT
- Title: Learning Tracking Representations via Dual-Branch Fully Transformer
Networks
- Authors: Fei Xie, Chunyu Wang, Guangting Wang, Wankou Yang, Wenjun Zeng
- Abstract summary: We present a Siamese-like Dual-branch network based on solely Transformers for tracking.
We extract a feature vector for each patch based on its matching results with others within an attention window.
The method achieves better or comparable results as the best-performing methods.
- Score: 82.21771581817937
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a Siamese-like Dual-branch network based on solely Transformers
for tracking. Given a template and a search image, we divide them into
non-overlapping patches and extract a feature vector for each patch based on
its matching results with others within an attention window. For each token, we
estimate whether it contains the target object and the corresponding size. The
advantage of the approach is that the features are learned from matching, and
ultimately, for matching. So the features are aligned with the object tracking
task. The method achieves better or comparable results as the best-performing
methods which first use CNN to extract features and then use Transformer to
fuse them. It outperforms the state-of-the-art methods on the GOT-10k and
VOT2020 benchmarks. In addition, the method achieves real-time inference speed
(about $40$ fps) on one GPU. The code and models will be released.
Related papers
- Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance [87.19164603145056]
We propose LoRAT, a method that unveils the power of large ViT model for tracking within laboratory-level resources.
The essence of our work lies in adapting LoRA, a technique that fine-tunes a small subset of model parameters without adding inference latency.
We design an anchor-free head solely based on to adapt PETR, enabling better performance with less computational overhead.
arXiv Detail & Related papers (2024-03-08T11:41:48Z) - Revisiting Color-Event based Tracking: A Unified Network, Dataset, and
Metric [53.88188265943762]
We propose a single-stage backbone network for Color-Event Unified Tracking (CEUTrack), which achieves the above functions simultaneously.
Our proposed CEUTrack is simple, effective, and efficient, which achieves over 75 FPS and new SOTA performance.
arXiv Detail & Related papers (2022-11-20T16:01:31Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - Green Hierarchical Vision Transformer for Masked Image Modeling [54.14989750044489]
We present an efficient approach for Masked Image Modeling with hierarchical Vision Transformers (ViTs)
We design a Group Window Attention scheme following the Divide-and-Conquer strategy.
We further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall cost of the attention on the grouped patches.
arXiv Detail & Related papers (2022-05-26T17:34:42Z) - MatchFormer: Interleaving Attention in Transformers for Feature Matching [31.175513306917654]
We propose a novel hierarchical extract-and-match transformer, termed as MatchFormer.
We interleave self-attention for feature extraction and cross-attention for feature matching, enabling a human-intuitive extract-and-match scheme.
Thanks to such a strategy, MatchFormer is a multi-win solution in efficiency, robustness, and precision.
arXiv Detail & Related papers (2022-03-17T22:49:14Z) - TrTr: Visual Tracking with Transformer [29.415900191169587]
We propose a novel tracker network based on a powerful attention mechanism called Transformer encoder-decoder architecture.
We design the classification and regression heads using the output of Transformer to localize target based on shape-agnostic anchor.
Our method performs favorably against state-of-the-art algorithms.
arXiv Detail & Related papers (2021-05-09T02:32:28Z) - Transformer Tracking [76.96796612225295]
Correlation acts as a critical role in the tracking field, especially in popular Siamese-based trackers.
This work presents a novel attention-based feature fusion network, which effectively combines the template and search region features solely using attention.
Experiments show that our TransT achieves very promising results on six challenging datasets.
arXiv Detail & Related papers (2021-03-29T09:06:55Z) - Single Object Tracking through a Fast and Effective Single-Multiple
Model Convolutional Neural Network [0.0]
Recent state-of-the-art (SOTA) approaches are proposed based on taking a matching network with a heavy structure to distinguish the target from other objects in the area.
In this article, a special architecture is proposed based on which in contrast to the previous approaches, it is possible to identify the object location in a single shot.
The presented tracker performs comparatively with the SOTA in challenging situations while having a super speed compared to them (up to $120 FPS$ on 1080ti)
arXiv Detail & Related papers (2021-03-28T11:02:14Z) - Multiple Convolutional Features in Siamese Networks for Object Tracking [13.850110645060116]
Multiple Features-Siamese Tracker (MFST) is a novel tracking algorithm exploiting several hierarchical feature maps for robust tracking.
MFST achieves high tracking accuracy, while outperforming the standard siamese tracker on object tracking benchmarks.
arXiv Detail & Related papers (2021-03-01T08:02:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.