Unified Single-Stage Transformer Network for Efficient RGB-T Tracking
- URL: http://arxiv.org/abs/2308.13764v1
- Date: Sat, 26 Aug 2023 05:09:57 GMT
- Title: Unified Single-Stage Transformer Network for Efficient RGB-T Tracking
- Authors: Jianqiang Xia, DianXi Shi, Ke Song, Linna Song, XiaoLei Wang,
Songchang Jin, Li Zhou, Yu Cheng, Lei Jin, Zheng Zhu, Jianan Li, Gang Wang,
Junliang Xing, Jian Zhao
- Abstract summary: We propose a single-stage Transformer RGB-T tracking network, namely USTrack, which unifies the above three stages into a single ViT (Vision Transformer) backbone.
With this structure, the network can extract fusion features of the template and search region under the mutual interaction of modalities.
Experiments on three popular RGB-T tracking benchmarks demonstrate that our method achieves new state-of-the-art performance while maintaining the fastest inference speed 84.2FPS.
- Score: 47.88113335927079
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing RGB-T tracking networks extract modality features in a separate
manner, which lacks interaction and mutual guidance between modalities. This
limits the network's ability to adapt to the diverse dual-modality appearances
of targets and the dynamic relationships between the modalities. Additionally,
the three-stage fusion tracking paradigm followed by these networks
significantly restricts the tracking speed. To overcome these problems, we
propose a unified single-stage Transformer RGB-T tracking network, namely
USTrack, which unifies the above three stages into a single ViT (Vision
Transformer) backbone with a dual embedding layer through self-attention
mechanism. With this structure, the network can extract fusion features of the
template and search region under the mutual interaction of modalities.
Simultaneously, relation modeling is performed between these features,
efficiently obtaining the search region fusion features with better
target-background discriminability for prediction. Furthermore, we introduce a
novel feature selection mechanism based on modality reliability to mitigate the
influence of invalid modalities for prediction, further improving the tracking
performance. Extensive experiments on three popular RGB-T tracking benchmarks
demonstrate that our method achieves new state-of-the-art performance while
maintaining the fastest inference speed 84.2FPS. In particular, MPR/MSR on the
short-term and long-term subsets of VTUAV dataset increased by
11.1$\%$/11.7$\%$ and 11.3$\%$/9.7$\%$.
Related papers
- Cross Fusion RGB-T Tracking with Bi-directional Adapter [8.425592063392857]
We propose a novel Cross Fusion RGB-T Tracking architecture (CFBT)
The effectiveness of CFBT relies on three newly designed cross-temporal information fusion modules.
Experiments on three popular RGB-T tracking benchmarks demonstrate that our method achieves new state-of-the-art performance.
arXiv Detail & Related papers (2024-08-30T02:45:56Z) - X Modality Assisting RGBT Object Tracking [36.614908357546035]
We propose a novel X Modality Assisting Network (X-Net) to shed light on the impact of the fusion paradigm.
To tackle the feature learning hurdles stemming from significant differences between RGB and thermal modalities, a plug-and-play pixel-level generation module (PGM) is proposed.
We also propose a feature-level interaction module (FIM) that incorporates a mixed feature interaction transformer and a spatial-dimensional feature translation strategy.
arXiv Detail & Related papers (2023-12-27T05:38:54Z) - RGB-T Tracking Based on Mixed Attention [5.151994214135177]
RGB-T tracking involves the use of images from both visible and thermal modalities.
An RGB-T tracker based on mixed attention mechanism to achieve complementary fusion of modalities is proposed in this paper.
arXiv Detail & Related papers (2023-04-09T15:59:41Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in
Driving Scenes [82.4186966781934]
We introduce a simple, efficient, and effective two-stage detector, termed as Ret3D.
At the core of Ret3D is the utilization of novel intra-frame and inter-frame relation modules.
With negligible extra overhead, Ret3D achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-08-18T03:48:58Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - Joint Feature Learning and Relation Modeling for Tracking: A One-Stream
Framework [76.70603443624012]
We propose a novel one-stream tracking (OSTrack) framework that unifies feature learning and relation modeling.
In this way, discriminative target-oriented features can be dynamically extracted by mutual guidance.
OSTrack achieves state-of-the-art performance on multiple benchmarks, in particular, it shows impressive results on the one-shot tracking benchmark GOT-10k.
arXiv Detail & Related papers (2022-03-22T18:37:11Z) - Temporal Aggregation for Adaptive RGBT Tracking [14.00078027541162]
We propose an RGBT tracker which takes clues into account for robust appearance model learning.
Unlike most existing RGBT trackers that implement object tracking tasks with only spatial information included, temporal information is further considered in this method.
arXiv Detail & Related papers (2022-01-22T02:31:56Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z) - Parameter Sharing Exploration and Hetero-Center based Triplet Loss for
Visible-Thermal Person Re-Identification [17.402673438396345]
This paper focuses on the visible-thermal cross-modality person re-identification (VT Re-ID) task.
Our proposed method distinctly outperforms the state-of-the-art methods by large margins.
arXiv Detail & Related papers (2020-08-14T07:40:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.