Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens
- URL: http://arxiv.org/abs/2401.01674v1
- Date: Wed, 3 Jan 2024 11:16:38 GMT
- Title: Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens
- Authors: Dengdi Sun, Yajie Pan, Andong Lu, Chenglong Li, Bin Luo
- Abstract summary: We propose a novel Transformer-T tracking approach, which mixes multimodal tokens from static templates and multimodal search Transformer to handle target appearance changes.
Our module is inserted into the transformer network and inherits joint feature extraction, searchtemplate matching, and cross-temporal interaction.
Experiments on three RGBT benchmark datasets show that the proposed approach maintains competitive performance compared to other state-of-the-art tracking algorithms.
- Score: 13.608089918718797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many RGBT tracking researches primarily focus on modal fusion design, while
overlooking the effective handling of target appearance changes. While some
approaches have introduced historical frames or fuse and replace initial
templates to incorporate temporal information, they have the risk of disrupting
the original target appearance and accumulating errors over time. To alleviate
these limitations, we propose a novel Transformer RGBT tracking approach, which
mixes spatio-temporal multimodal tokens from the static multimodal templates
and multimodal search regions in Transformer to handle target appearance
changes, for robust RGBT tracking. We introduce independent dynamic template
tokens to interact with the search region, embedding temporal information to
address appearance changes, while also retaining the involvement of the initial
static template tokens in the joint feature extraction process to ensure the
preservation of the original reliable target appearance information that
prevent deviations from the target appearance caused by traditional temporal
updates. We also use attention mechanisms to enhance the target features of
multimodal template tokens by incorporating supplementary modal cues, and make
the multimodal search region tokens interact with multimodal dynamic template
tokens via attention mechanisms, which facilitates the conveyance of
multimodal-enhanced target change information. Our module is inserted into the
transformer backbone network and inherits joint feature extraction,
search-template matching, and cross-modal interaction. Extensive experiments on
three RGBT benchmark datasets show that the proposed approach maintains
competitive performance compared to other state-of-the-art tracking algorithms
while running at 39.1 FPS.
Related papers
- Unified Local and Global Attention Interaction Modeling for Vision Transformers [1.9571946424055506]
We present a novel method that extends the self-attention mechanism of a vision transformer (ViT) for more accurate object detection across diverse datasets.
ViTs show strong capability for image understanding tasks such as object detection, segmentation, and classification.
We introduce two modifications to the traditional self-attention framework; a novel aggressive convolution pooling strategy for local feature mixing, and a new conceptual attention transformation.
arXiv Detail & Related papers (2024-12-25T04:53:19Z) - Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking [53.33637391723555]
We propose a unified multimodal spatial-temporal tracking approach named STTrack.
In contrast to previous paradigms, we introduced a temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information.
These temporal information tokens are used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target.
arXiv Detail & Related papers (2024-12-20T09:10:17Z) - Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs.
Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction.
We benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z) - Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification [64.36210786350568]
We propose a novel learning framework named textbfEDITOR to select diverse tokens from vision Transformers for multi-modal object ReID.
Our framework can generate more discriminative features for multi-modal object ReID.
arXiv Detail & Related papers (2024-03-15T12:44:35Z) - Temporal Adaptive RGBT Tracking with Modality Prompt [10.431364270734331]
RGBT tracking has been widely used in various fields such as robotics, processing, surveillance, and autonomous driving.
Existing RGBT trackers fully explore the spatial information between the template and the search region and locate the target based on the appearance matching results.
These RGBT trackers have very limited exploitation of temporal information, either ignoring temporal information or exploiting it through online sampling and training.
arXiv Detail & Related papers (2024-01-02T15:20:50Z) - Unsupervised Multi-modal Feature Alignment for Time Series
Representation Learning [20.655943795843037]
We introduce an innovative approach that focuses on aligning and binding time series representations encoded from different modalities.
In contrast to conventional methods that fuse features from multiple modalities, our proposed approach simplifies the neural architecture by retaining a single time series encoder.
Our approach outperforms existing state-of-the-art URL methods across diverse downstream tasks.
arXiv Detail & Related papers (2023-12-09T22:31:20Z) - Correlated Attention in Transformers for Multivariate Time Series [22.542109523780333]
We propose a novel correlated attention mechanism, which efficiently captures feature-wise dependencies, and can be seamlessly integrated within the encoder blocks of existing Transformers.
In particular, correlated attention operates across feature channels to compute cross-covariance matrices between queries and keys with different lag values, and selectively aggregate representations at the sub-series level.
This architecture facilitates automated discovery and representation learning of not only instantaneous but also lagged cross-correlations, while inherently capturing time series auto-correlation.
arXiv Detail & Related papers (2023-11-20T17:35:44Z) - Cross-modal Orthogonal High-rank Augmentation for RGB-Event
Transformer-trackers [58.802352477207094]
We explore the great potential of a pre-trained vision Transformer (ViT) to bridge the vast distribution gap between two modalities.
We propose a mask modeling strategy that randomly masks a specific modality of some tokens to enforce the interaction between tokens from different modalities interacting proactively.
Experiments demonstrate that our plug-and-play training augmentation techniques can significantly boost state-of-the-art one-stream and two trackersstream to a large extent in terms of both tracking precision and success rate.
arXiv Detail & Related papers (2023-07-09T08:58:47Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - Multimodal Token Fusion for Vision Transformers [54.81107795090239]
We propose a multimodal token fusion method (TokenFusion) for transformer-based vision tasks.
To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features.
The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact.
arXiv Detail & Related papers (2022-04-19T07:47:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.