Transformer Tracking
- URL: http://arxiv.org/abs/2103.15436v1
- Date: Mon, 29 Mar 2021 09:06:55 GMT
- Title: Transformer Tracking
- Authors: Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang and Huchuan Lu
- Abstract summary: Correlation acts as a critical role in the tracking field, especially in popular Siamese-based trackers.
This work presents a novel attention-based feature fusion network, which effectively combines the template and search region features solely using attention.
Experiments show that our TransT achieves very promising results on six challenging datasets.
- Score: 76.96796612225295
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Correlation acts as a critical role in the tracking field, especially in
recent popular Siamese-based trackers. The correlation operation is a simple
fusion manner to consider the similarity between the template and the search
region. However, the correlation operation itself is a local linear matching
process, leading to lose semantic information and fall into local optimum
easily, which may be the bottleneck of designing high-accuracy tracking
algorithms. Is there any better feature fusion method than correlation? To
address this issue, inspired by Transformer, this work presents a novel
attention-based feature fusion network, which effectively combines the template
and search region features solely using attention. Specifically, the proposed
method includes an ego-context augment module based on self-attention and a
cross-feature augment module based on cross-attention. Finally, we present a
Transformer tracking (named TransT) method based on the Siamese-like feature
extraction backbone, the designed attention-based fusion mechanism, and the
classification and regression head. Experiments show that our TransT achieves
very promising results on six challenging datasets, especially on large-scale
LaSOT, TrackingNet, and GOT-10k benchmarks. Our tracker runs at approximatively
50 fps on GPU. Code and models are available at
https://github.com/chenxin-dlut/TransT.
Related papers
- Autoregressive Queries for Adaptive Tracking with Spatio-TemporalTransformers [55.46413719810273]
rich-temporal information is crucial to the complicated target appearance in visual tracking.
Our method improves the tracker's performance on six popular tracking benchmarks.
arXiv Detail & Related papers (2024-03-15T02:39:26Z) - Separable Self and Mixed Attention Transformers for Efficient Object
Tracking [3.9160947065896803]
This paper proposes an efficient self and mixed attention transformer-based architecture for lightweight tracking.
With these contributions, the proposed lightweight tracker deploys a transformer-based backbone and head module concurrently for the first time.
Simulations show that our Separable Self and Mixed Attention-based Tracker, SMAT, surpasses the performance of related lightweight trackers on GOT10k, TrackingNet, LaSOT, NfS30, UAV123, and AVisT datasets.
arXiv Detail & Related papers (2023-09-07T19:23:02Z) - Compact Transformer Tracker with Correlative Masked Modeling [16.234426179567837]
Transformer framework has been showing superior performances in visual object tracking.
Recent advances focus on exploring attention mechanism variants for better information aggregation.
In this paper, we prove that the vanilla self-attention structure is sufficient for information aggregation.
arXiv Detail & Related papers (2023-01-26T04:58:08Z) - Revisiting Color-Event based Tracking: A Unified Network, Dataset, and
Metric [53.88188265943762]
We propose a single-stage backbone network for Color-Event Unified Tracking (CEUTrack), which achieves the above functions simultaneously.
Our proposed CEUTrack is simple, effective, and efficient, which achieves over 75 FPS and new SOTA performance.
arXiv Detail & Related papers (2022-11-20T16:01:31Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - SparseTT: Visual Tracking with Sparse Transformers [43.1666514605021]
Self-attention mechanism designed to model long-range dependencies is the key to the success of Transformers.
In this paper, we relieve this issue with a sparse attention mechanism by focusing the most relevant information in the search regions.
We introduce a double-head predictor to boost the accuracy of foreground-background classification and regression of target bounding boxes.
arXiv Detail & Related papers (2022-05-08T04:00:28Z) - High-Performance Transformer Tracking [74.07751002861802]
We present a Transformer tracking (named TransT) method based on the Siamese-like feature extraction backbone, the designed attention-based fusion mechanism, and the classification and regression head.
Experiments show that our TransT and TransT-M methods achieve promising results on seven popular datasets.
arXiv Detail & Related papers (2022-03-25T09:33:29Z) - MFGNet: Dynamic Modality-Aware Filter Generation for RGB-T Tracking [72.65494220685525]
We propose a new dynamic modality-aware filter generation module (named MFGNet) to boost the message communication between visible and thermal data.
We generate dynamic modality-aware filters with two independent networks. The visible and thermal filters will be used to conduct a dynamic convolutional operation on their corresponding input feature maps respectively.
To address issues caused by heavy occlusion, fast motion, and out-of-view, we propose to conduct a joint local and global search by exploiting a new direction-aware target-driven attention mechanism.
arXiv Detail & Related papers (2021-07-22T03:10:51Z) - TrTr: Visual Tracking with Transformer [29.415900191169587]
We propose a novel tracker network based on a powerful attention mechanism called Transformer encoder-decoder architecture.
We design the classification and regression heads using the output of Transformer to localize target based on shape-agnostic anchor.
Our method performs favorably against state-of-the-art algorithms.
arXiv Detail & Related papers (2021-05-09T02:32:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.