Learning Spatial-Frequency Transformer for Visual Object Tracking
- URL: http://arxiv.org/abs/2208.08829v1
- Date: Thu, 18 Aug 2022 13:46:12 GMT
- Title: Learning Spatial-Frequency Transformer for Visual Object Tracking
- Authors: Chuanming Tang, Xiao Wang, Yuanchao Bai, Zhe Wu, Jianlin Zhang,
Yongmei Huang
- Abstract summary: Recent trackers adopt the Transformer to combine or replace the widely used ResNet as their new backbone network.
We believe these operations ignore the spatial prior of the target object which may lead to sub-optimal results.
We propose a unified Spatial-Frequency Transformer that models the spatial Prior and High-frequency emphasis Attention (GPHA) simultaneously.
- Score: 15.750739748843744
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent trackers adopt the Transformer to combine or replace the widely used
ResNet as their new backbone network. Although their trackers work well in
regular scenarios, however, they simply flatten the 2D features into a sequence
to better match the Transformer. We believe these operations ignore the spatial
prior of the target object which may lead to sub-optimal results only. In
addition, many works demonstrate that self-attention is actually a low-pass
filter, which is independent of input features or key/queries. That is to say,
it may suppress the high-frequency component of the input features and preserve
or even amplify the low-frequency information. To handle these issues, in this
paper, we propose a unified Spatial-Frequency Transformer that models the
Gaussian spatial Prior and High-frequency emphasis Attention (GPHA)
simultaneously. To be specific, Gaussian spatial prior is generated using dual
Multi-Layer Perceptrons (MLPs) and injected into the similarity matrix produced
by multiplying Query and Key features in self-attention. The output will be fed
into a Softmax layer and then decomposed into two components, i.e., the direct
signal and high-frequency signal. The low- and high-pass branches are rescaled
and combined to achieve all-pass, therefore, the high-frequency features will
be protected well in stacked self-attention layers. We further integrate the
Spatial-Frequency Transformer into the Siamese tracking framework and propose a
novel tracking algorithm, termed SFTransT. The cross-scale fusion based
SwinTransformer is adopted as the backbone, and also a multi-head
cross-attention module is used to boost the interaction between search and
template features. The output will be fed into the tracking head for target
localization. Extensive experiments on both short-term and long-term tracking
benchmarks all demonstrate the effectiveness of our proposed framework.
Related papers
- A Hybrid Transformer-Mamba Network for Single Image Deraining [70.64069487982916]
Existing deraining Transformers employ self-attention mechanisms with fixed-range windows or along channel dimensions.
We introduce a novel dual-branch hybrid Transformer-Mamba network, denoted as TransMamba, aimed at effectively capturing long-range rain-related dependencies.
arXiv Detail & Related papers (2024-08-31T10:03:19Z) - U-shaped Transformer: Retain High Frequency Context in Time Series
Analysis [0.5710971447109949]
In this paper, we consider the low-pass characteristics of transformers and try to incorporate the advantages of them.
We introduce patch merge and split operation to extract features with different scales and use larger datasets to fully make use of the transformer backbone.
Our experiments demonstrate that the model performs at an advanced level across multiple datasets with relatively low cost.
arXiv Detail & Related papers (2023-07-18T07:15:26Z) - STMixer: A One-Stage Sparse Action Detector [48.0614066856134]
We propose a new one-stage action detector, termed STMixer.
We present a query-based adaptive feature sampling module, which endows our STMixer with the flexibility of mining a set of discriminative video features.
We obtain the state-of-the-art results on the datasets of AVA, UCF101-24, and JHMDB.
arXiv Detail & Related papers (2023-03-28T10:47:06Z) - Multi-Scale Wavelet Transformer for Face Forgery Detection [43.33712402517951]
We propose a multi-scale wavelet transformer framework for face forgery detection.
Frequency-based spatial attention is designed to guide the spatial feature extractor to concentrate more on forgery traces.
Cross-modality attention is proposed to fuse the frequency features with the spatial features.
arXiv Detail & Related papers (2022-10-08T03:39:36Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - Inception Transformer [151.939077819196]
Inception Transformer, or iFormer, learns comprehensive features with both high- and low-frequency information in visual data.
We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation.
arXiv Detail & Related papers (2022-05-25T17:59:54Z) - High-Performance Transformer Tracking [74.07751002861802]
We present a Transformer tracking (named TransT) method based on the Siamese-like feature extraction backbone, the designed attention-based fusion mechanism, and the classification and regression head.
Experiments show that our TransT and TransT-M methods achieve promising results on seven popular datasets.
arXiv Detail & Related papers (2022-03-25T09:33:29Z) - MFGNet: Dynamic Modality-Aware Filter Generation for RGB-T Tracking [72.65494220685525]
We propose a new dynamic modality-aware filter generation module (named MFGNet) to boost the message communication between visible and thermal data.
We generate dynamic modality-aware filters with two independent networks. The visible and thermal filters will be used to conduct a dynamic convolutional operation on their corresponding input feature maps respectively.
To address issues caused by heavy occlusion, fast motion, and out-of-view, we propose to conduct a joint local and global search by exploiting a new direction-aware target-driven attention mechanism.
arXiv Detail & Related papers (2021-07-22T03:10:51Z) - Transformer Tracking [76.96796612225295]
Correlation acts as a critical role in the tracking field, especially in popular Siamese-based trackers.
This work presents a novel attention-based feature fusion network, which effectively combines the template and search region features solely using attention.
Experiments show that our TransT achieves very promising results on six challenging datasets.
arXiv Detail & Related papers (2021-03-29T09:06:55Z) - Efficient Two-Stream Network for Violence Detection Using Separable
Convolutional LSTM [0.0]
We propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet.
SepConvLSTM is constructed by replacing convolution operation at each gate of ConvLSTM with a depthwise separable convolution.
Our model outperforms the accuracy on the larger and more challenging RWF-2000 dataset by more than a 2% margin.
arXiv Detail & Related papers (2021-02-21T12:01:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.