Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens
- URL: http://arxiv.org/abs/2401.01674v1
- Date: Wed, 3 Jan 2024 11:16:38 GMT
- Title: Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens
- Authors: Dengdi Sun, Yajie Pan, Andong Lu, Chenglong Li, Bin Luo
- Abstract summary: We propose a novel Transformer-T tracking approach, which mixes multimodal tokens from static templates and multimodal search Transformer to handle target appearance changes.
Our module is inserted into the transformer network and inherits joint feature extraction, searchtemplate matching, and cross-temporal interaction.
Experiments on three RGBT benchmark datasets show that the proposed approach maintains competitive performance compared to other state-of-the-art tracking algorithms.
- Score: 13.608089918718797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many RGBT tracking researches primarily focus on modal fusion design, while
overlooking the effective handling of target appearance changes. While some
approaches have introduced historical frames or fuse and replace initial
templates to incorporate temporal information, they have the risk of disrupting
the original target appearance and accumulating errors over time. To alleviate
these limitations, we propose a novel Transformer RGBT tracking approach, which
mixes spatio-temporal multimodal tokens from the static multimodal templates
and multimodal search regions in Transformer to handle target appearance
changes, for robust RGBT tracking. We introduce independent dynamic template
tokens to interact with the search region, embedding temporal information to
address appearance changes, while also retaining the involvement of the initial
static template tokens in the joint feature extraction process to ensure the
preservation of the original reliable target appearance information that
prevent deviations from the target appearance caused by traditional temporal
updates. We also use attention mechanisms to enhance the target features of
multimodal template tokens by incorporating supplementary modal cues, and make
the multimodal search region tokens interact with multimodal dynamic template
tokens via attention mechanisms, which facilitates the conveyance of
multimodal-enhanced target change information. Our module is inserted into the
transformer backbone network and inherits joint feature extraction,
search-template matching, and cross-modal interaction. Extensive experiments on
three RGBT benchmark datasets show that the proposed approach maintains
competitive performance compared to other state-of-the-art tracking algorithms
while running at 39.1 FPS.
Related papers
- Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation [12.455034591553506]
Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields.
Previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion.
We have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem.
arXiv Detail & Related papers (2024-07-23T02:23:51Z) - Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs.
Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction.
We benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z) - Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification [64.36210786350568]
We propose a novel learning framework named textbfEDITOR to select diverse tokens from vision Transformers for multi-modal object ReID.
Our framework can generate more discriminative features for multi-modal object ReID.
arXiv Detail & Related papers (2024-03-15T12:44:35Z) - Temporal Adaptive RGBT Tracking with Modality Prompt [10.431364270734331]
RGBT tracking has been widely used in various fields such as robotics, processing, surveillance, and autonomous driving.
Existing RGBT trackers fully explore the spatial information between the template and the search region and locate the target based on the appearance matching results.
These RGBT trackers have very limited exploitation of temporal information, either ignoring temporal information or exploiting it through online sampling and training.
arXiv Detail & Related papers (2024-01-02T15:20:50Z) - Unsupervised Multi-modal Feature Alignment for Time Series
Representation Learning [20.655943795843037]
We introduce an innovative approach that focuses on aligning and binding time series representations encoded from different modalities.
In contrast to conventional methods that fuse features from multiple modalities, our proposed approach simplifies the neural architecture by retaining a single time series encoder.
Our approach outperforms existing state-of-the-art URL methods across diverse downstream tasks.
arXiv Detail & Related papers (2023-12-09T22:31:20Z) - Correlated Attention in Transformers for Multivariate Time Series [22.542109523780333]
We propose a novel correlated attention mechanism, which efficiently captures feature-wise dependencies, and can be seamlessly integrated within the encoder blocks of existing Transformers.
In particular, correlated attention operates across feature channels to compute cross-covariance matrices between queries and keys with different lag values, and selectively aggregate representations at the sub-series level.
This architecture facilitates automated discovery and representation learning of not only instantaneous but also lagged cross-correlations, while inherently capturing time series auto-correlation.
arXiv Detail & Related papers (2023-11-20T17:35:44Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Cross-modal Orthogonal High-rank Augmentation for RGB-Event
Transformer-trackers [58.802352477207094]
We explore the great potential of a pre-trained vision Transformer (ViT) to bridge the vast distribution gap between two modalities.
We propose a mask modeling strategy that randomly masks a specific modality of some tokens to enforce the interaction between tokens from different modalities interacting proactively.
Experiments demonstrate that our plug-and-play training augmentation techniques can significantly boost state-of-the-art one-stream and two trackersstream to a large extent in terms of both tracking precision and success rate.
arXiv Detail & Related papers (2023-07-09T08:58:47Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - Multimodal Token Fusion for Vision Transformers [54.81107795090239]
We propose a multimodal token fusion method (TokenFusion) for transformer-based vision tasks.
To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features.
The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact.
arXiv Detail & Related papers (2022-04-19T07:47:50Z) - Cross-Modality Fusion Transformer for Multispectral Object Detection [0.0]
Multispectral image pairs can provide the combined information, making object detection applications more reliable and robust.
We present a simple yet effective cross-modality feature fusion approach, named Cross-Modality Fusion Transformer (CFT) in this paper.
arXiv Detail & Related papers (2021-10-30T15:34:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.