Related papers: TrTr: Visual Tracking with Transformer

TrTr: Visual Tracking with Transformer

URL: http://arxiv.org/abs/2105.03817v1
Date: Sun, 9 May 2021 02:32:28 GMT
Title: TrTr: Visual Tracking with Transformer
Authors: Moju Zhao and Kei Okada and Masayuki Inaba
Abstract summary: We propose a novel tracker network based on a powerful attention mechanism called Transformer encoder-decoder architecture. We design the classification and regression heads using the output of Transformer to localize target based on shape-agnostic anchor. Our method performs favorably against state-of-the-art algorithms.
Score: 29.415900191169587
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Template-based discriminative trackers are currently the dominant tracking methods due to their robustness and accuracy, and the Siamese-network-based methods that depend on cross-correlation operation between features extracted from template and search images show the state-of-the-art tracking performance. However, general cross-correlation operation can only obtain relationship between local patches in two feature maps. In this paper, we propose a novel tracker network based on a powerful attention mechanism called Transformer encoder-decoder architecture to gain global and rich contextual interdependencies. In this new architecture, features of the template image is processed by a self-attention module in the encoder part to learn strong context information, which is then sent to the decoder part to compute cross-attention with the search image features processed by another self-attention module. In addition, we design the classification and regression heads using the output of Transformer to localize target based on shape-agnostic anchor. We extensively evaluate our tracker TrTr, on VOT2018, VOT2019, OTB-100, UAV, NfS, TrackingNet, and LaSOT benchmarks and our method performs favorably against state-of-the-art algorithms. Training code and pretrained models are available at https://github.com/tongtybj/TrTr.

Related papers

Two-stream Beats One-stream: Asymmetric Siamese Network for Efficient Visual Tracking [54.124445709376154]
We propose a novel asymmetric Siamese tracker named textbfAsymTrack for efficient tracking. Building on this architecture, we devise an efficient template modulation mechanism to inject crucial cues into the search features. Experiments demonstrate that AsymTrack offers superior speed-precision trade-offs across different platforms.
arXiv Detail & Related papers (2025-03-01T14:44:54Z)
With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning [47.96387857237473]
We devise a network which can perform attention over activations obtained while processing other training samples. Our memory models the distribution of past keys and values through the definition of prototype vectors. We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training.
arXiv Detail & Related papers (2023-08-23T18:53:00Z)
Compact Transformer Tracker with Correlative Masked Modeling [16.234426179567837]
Transformer framework has been showing superior performances in visual object tracking. Recent advances focus on exploring attention mechanism variants for better information aggregation. In this paper, we prove that the vanilla self-attention structure is sufficient for information aggregation.
arXiv Detail & Related papers (2023-01-26T04:58:08Z)
Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects. The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z)
High-Performance Transformer Tracking [74.07751002861802]
We present a Transformer tracking (named TransT) method based on the Siamese-like feature extraction backbone, the designed attention-based fusion mechanism, and the classification and regression head. Experiments show that our TransT and TransT-M methods achieve promising results on seven popular datasets.
arXiv Detail & Related papers (2022-03-25T09:33:29Z)
Learning Tracking Representations via Dual-Branch Fully Transformer Networks [82.21771581817937]
We present a Siamese-like Dual-branch network based on solely Transformers for tracking. We extract a feature vector for each patch based on its matching results with others within an attention window. The method achieves better or comparable results as the best-performing methods.
arXiv Detail & Related papers (2021-12-05T13:44:33Z)
Learning Dynamic Compact Memory Embedding for Deformable Visual Object Tracking [82.34356879078955]
We propose a compact memory embedding to enhance the discrimination of the segmentation-based deformable visual tracking method. Our method outperforms the excellent segmentation-based trackers, i.e., D3S and SiamMask on DAVIS 2017 benchmark.
arXiv Detail & Related papers (2021-11-23T03:07:12Z)
Transformer Tracking [76.96796612225295]
Correlation acts as a critical role in the tracking field, especially in popular Siamese-based trackers. This work presents a novel attention-based feature fusion network, which effectively combines the template and search region features solely using attention. Experiments show that our TransT achieves very promising results on six challenging datasets.
arXiv Detail & Related papers (2021-03-29T09:06:55Z)
Learning Spatio-Appearance Memory Network for High-Performance Visual Tracking [79.80401607146987]
Existing object tracking usually learns a bounding-box based template to match visual targets across frames, which cannot accurately learn a pixel-wise representation. This paper presents a novel segmentation-based tracking architecture, which is equipped with a local-temporal memory network to learn accurate-temporal correspondence.
arXiv Detail & Related papers (2020-09-21T08:12:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.