TrTr: Visual Tracking with Transformer
- URL: http://arxiv.org/abs/2105.03817v1
- Date: Sun, 9 May 2021 02:32:28 GMT
- Title: TrTr: Visual Tracking with Transformer
- Authors: Moju Zhao and Kei Okada and Masayuki Inaba
- Abstract summary: We propose a novel tracker network based on a powerful attention mechanism called Transformer encoder-decoder architecture.
We design the classification and regression heads using the output of Transformer to localize target based on shape-agnostic anchor.
Our method performs favorably against state-of-the-art algorithms.
- Score: 29.415900191169587
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Template-based discriminative trackers are currently the dominant tracking
methods due to their robustness and accuracy, and the Siamese-network-based
methods that depend on cross-correlation operation between features extracted
from template and search images show the state-of-the-art tracking performance.
However, general cross-correlation operation can only obtain relationship
between local patches in two feature maps. In this paper, we propose a novel
tracker network based on a powerful attention mechanism called Transformer
encoder-decoder architecture to gain global and rich contextual
interdependencies. In this new architecture, features of the template image is
processed by a self-attention module in the encoder part to learn strong
context information, which is then sent to the decoder part to compute
cross-attention with the search image features processed by another
self-attention module. In addition, we design the classification and regression
heads using the output of Transformer to localize target based on
shape-agnostic anchor. We extensively evaluate our tracker TrTr, on VOT2018,
VOT2019, OTB-100, UAV, NfS, TrackingNet, and LaSOT benchmarks and our method
performs favorably against state-of-the-art algorithms. Training code and
pretrained models are available at https://github.com/tongtybj/TrTr.
Related papers
- With a Little Help from your own Past: Prototypical Memory Networks for
Image Captioning [47.96387857237473]
We devise a network which can perform attention over activations obtained while processing other training samples.
Our memory models the distribution of past keys and values through the definition of prototype vectors.
We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training.
arXiv Detail & Related papers (2023-08-23T18:53:00Z) - Compact Transformer Tracker with Correlative Masked Modeling [16.234426179567837]
Transformer framework has been showing superior performances in visual object tracking.
Recent advances focus on exploring attention mechanism variants for better information aggregation.
In this paper, we prove that the vanilla self-attention structure is sufficient for information aggregation.
arXiv Detail & Related papers (2023-01-26T04:58:08Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - High-Performance Transformer Tracking [74.07751002861802]
We present a Transformer tracking (named TransT) method based on the Siamese-like feature extraction backbone, the designed attention-based fusion mechanism, and the classification and regression head.
Experiments show that our TransT and TransT-M methods achieve promising results on seven popular datasets.
arXiv Detail & Related papers (2022-03-25T09:33:29Z) - Learning Tracking Representations via Dual-Branch Fully Transformer
Networks [82.21771581817937]
We present a Siamese-like Dual-branch network based on solely Transformers for tracking.
We extract a feature vector for each patch based on its matching results with others within an attention window.
The method achieves better or comparable results as the best-performing methods.
arXiv Detail & Related papers (2021-12-05T13:44:33Z) - Learning Dynamic Compact Memory Embedding for Deformable Visual Object
Tracking [82.34356879078955]
We propose a compact memory embedding to enhance the discrimination of the segmentation-based deformable visual tracking method.
Our method outperforms the excellent segmentation-based trackers, i.e., D3S and SiamMask on DAVIS 2017 benchmark.
arXiv Detail & Related papers (2021-11-23T03:07:12Z) - Transformer Tracking [76.96796612225295]
Correlation acts as a critical role in the tracking field, especially in popular Siamese-based trackers.
This work presents a novel attention-based feature fusion network, which effectively combines the template and search region features solely using attention.
Experiments show that our TransT achieves very promising results on six challenging datasets.
arXiv Detail & Related papers (2021-03-29T09:06:55Z) - Learning Spatio-Appearance Memory Network for High-Performance Visual
Tracking [79.80401607146987]
Existing object tracking usually learns a bounding-box based template to match visual targets across frames, which cannot accurately learn a pixel-wise representation.
This paper presents a novel segmentation-based tracking architecture, which is equipped with a local-temporal memory network to learn accurate-temporal correspondence.
arXiv Detail & Related papers (2020-09-21T08:12:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.