High-Performance Transformer Tracking
- URL: http://arxiv.org/abs/2203.13533v1
- Date: Fri, 25 Mar 2022 09:33:29 GMT
- Title: High-Performance Transformer Tracking
- Authors: Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Huchuan Lu
- Abstract summary: We present a Transformer tracking (named TransT) method based on the Siamese-like feature extraction backbone, the designed attention-based fusion mechanism, and the classification and regression head.
Experiments show that our TransT and TransT-M methods achieve promising results on seven popular datasets.
- Score: 74.07751002861802
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Correlation has a critical role in the tracking field, especially in recent
popular Siamese-based trackers. The correlation operation is a simple fusion
manner to consider the similarity between the template and the search region.
However, the correlation operation is a local linear matching process, losing
semantic information and falling into local optimum easily, which may be the
bottleneck of designing high-accuracy tracking algorithms. In this work, to
determine whether a better feature fusion method exists than correlation, a
novel attention-based feature fusion network, inspired by Transformer, is
presented. This network effectively combines the template and the search region
features using attention. Specifically, the proposed method includes an
ego-context augment module based on self-attention and a cross-feature augment
module based on cross-attention. First, we present a Transformer tracking
(named TransT) method based on the Siamese-like feature extraction backbone,
the designed attention-based fusion mechanism, and the classification and
regression head. Based on the TransT baseline, we further design a segmentation
branch to generate an accurate mask. Finally, we propose a stronger version of
TransT by extending TransT with a multi-template design and an IoU prediction
head, named TransT-M. Experiments show that our TransT and TransT-M methods
achieve promising results on seven popular datasets. Code and models are
available at https://github.com/chenxin-dlut/TransT-M.
Related papers
- Hierarchical Cross-modal Transformer for RGB-D Salient Object Detection [6.385624548310884]
We propose the Hierarchical Cross-modal Transformer (HCT), a new multi-modal transformer, to tackle this problem.
Unlike previous multi-modal transformers that directly connecting all patches from two modalities, we explore the cross-modal complementarity hierarchically.
We present a Feature Pyramid module for Transformer (FPT) to boost informative cross-scale integration as well as a consistency-complementarity module to disentangle the multi-modal integration path.
arXiv Detail & Related papers (2023-02-16T03:23:23Z) - OST: Efficient One-stream Network for 3D Single Object Tracking in Point Clouds [6.661881950861012]
We propose a novel one-stream network with the strength of the instance-level encoding, which avoids the correlation operations occurring in previous Siamese network.
The proposed method has achieved considerable performance not only for class-specific tracking but also for class-agnostic tracking with less computation and higher efficiency.
arXiv Detail & Related papers (2022-10-16T12:31:59Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - TransCMD: Cross-Modal Decoder Equipped with Transformer for RGB-D
Salient Object Detection [86.94578023985677]
In this work, we rethink this task from the perspective of global information alignment and transformation.
Specifically, the proposed method (TransCMD) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path.
Experimental results on seven RGB-D SOD benchmark datasets demonstrate that a simple two-stream encoder-decoder framework can surpass the state-of-the-art purely CNN-based methods.
arXiv Detail & Related papers (2021-12-04T15:45:34Z) - TransMVSNet: Global Context-aware Multi-view Stereo Network with
Transformers [6.205844084751411]
We present TransMVSNet, based on our exploration of feature matching in multi-view stereo (MVS)
We propose a powerful Feature Matching Transformer (FMT) to leverage intra- (self-) and inter- (cross-) attention to aggregate long-range context information.
Our method achieves state-of-the-art performance on DTU dataset, Tanks and Temples benchmark, and BlendedMVS dataset.
arXiv Detail & Related papers (2021-11-29T15:31:49Z) - TrTr: Visual Tracking with Transformer [29.415900191169587]
We propose a novel tracker network based on a powerful attention mechanism called Transformer encoder-decoder architecture.
We design the classification and regression heads using the output of Transformer to localize target based on shape-agnostic anchor.
Our method performs favorably against state-of-the-art algorithms.
arXiv Detail & Related papers (2021-05-09T02:32:28Z) - TransMOT: Spatial-Temporal Graph Transformer for Multiple Object
Tracking [74.82415271960315]
We propose a solution named TransMOT to efficiently model the spatial and temporal interactions among objects in a video.
TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy.
The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20.
arXiv Detail & Related papers (2021-04-01T01:49:05Z) - Transformer Tracking [76.96796612225295]
Correlation acts as a critical role in the tracking field, especially in popular Siamese-based trackers.
This work presents a novel attention-based feature fusion network, which effectively combines the template and search region features solely using attention.
Experiments show that our TransT achieves very promising results on six challenging datasets.
arXiv Detail & Related papers (2021-03-29T09:06:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.