Related papers: Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens

Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens

URL: http://arxiv.org/abs/2401.01674v1
Date: Wed, 3 Jan 2024 11:16:38 GMT
Title: Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens
Authors: Dengdi Sun, Yajie Pan, Andong Lu, Chenglong Li, Bin Luo
Abstract summary: We propose a novel Transformer-T tracking approach, which mixes multimodal tokens from static templates and multimodal search Transformer to handle target appearance changes. Our module is inserted into the transformer network and inherits joint feature extraction, searchtemplate matching, and cross-temporal interaction. Experiments on three RGBT benchmark datasets show that the proposed approach maintains competitive performance compared to other state-of-the-art tracking algorithms.
Score: 13.608089918718797
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many RGBT tracking researches primarily focus on modal fusion design, while overlooking the effective handling of target appearance changes. While some approaches have introduced historical frames or fuse and replace initial templates to incorporate temporal information, they have the risk of disrupting the original target appearance and accumulating errors over time. To alleviate these limitations, we propose a novel Transformer RGBT tracking approach, which mixes spatio-temporal multimodal tokens from the static multimodal templates and multimodal search regions in Transformer to handle target appearance changes, for robust RGBT tracking. We introduce independent dynamic template tokens to interact with the search region, embedding temporal information to address appearance changes, while also retaining the involvement of the initial static template tokens in the joint feature extraction process to ensure the preservation of the original reliable target appearance information that prevent deviations from the target appearance caused by traditional temporal updates. We also use attention mechanisms to enhance the target features of multimodal template tokens by incorporating supplementary modal cues, and make the multimodal search region tokens interact with multimodal dynamic template tokens via attention mechanisms, which facilitates the conveyance of multimodal-enhanced target change information. Our module is inserted into the transformer backbone network and inherits joint feature extraction, search-template matching, and cross-modal interaction. Extensive experiments on three RGBT benchmark datasets show that the proposed approach maintains competitive performance compared to other state-of-the-art tracking algorithms while running at 39.1 FPS.

Related papers

Unified Local and Global Attention Interaction Modeling for Vision Transformers [1.9571946424055506]
We present a novel method that extends the self-attention mechanism of a vision transformer (ViT) for more accurate object detection across diverse datasets. ViTs show strong capability for image understanding tasks such as object detection, segmentation, and classification. We introduce two modifications to the traditional self-attention framework; a novel aggressive convolution pooling strategy for local feature mixing, and a new conceptual attention transformation.
arXiv Detail & Related papers (2024-12-25T04:53:19Z)
Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking [53.33637391723555]
We propose a unified multimodal spatial-temporal tracking approach named STTrack. In contrast to previous paradigms, we introduced a temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information. These temporal information tokens are used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target.
arXiv Detail & Related papers (2024-12-20T09:10:17Z)
Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation [12.455034591553506]
Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields. Previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion. We have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem.
arXiv Detail & Related papers (2024-07-23T02:23:51Z)
Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs. Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction. We benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z)
Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification [64.36210786350568]
We propose a novel learning framework named textbfEDITOR to select diverse tokens from vision Transformers for multi-modal object ReID. Our framework can generate more discriminative features for multi-modal object ReID.
arXiv Detail & Related papers (2024-03-15T12:44:35Z)
Temporal Adaptive RGBT Tracking with Modality Prompt [10.431364270734331]
RGBT tracking has been widely used in various fields such as robotics, processing, surveillance, and autonomous driving. Existing RGBT trackers fully explore the spatial information between the template and the search region and locate the target based on the appearance matching results. These RGBT trackers have very limited exploitation of temporal information, either ignoring temporal information or exploiting it through online sampling and training.
arXiv Detail & Related papers (2024-01-02T15:20:50Z)
Unsupervised Multi-modal Feature Alignment for Time Series Representation Learning [20.655943795843037]
We introduce an innovative approach that focuses on aligning and binding time series representations encoded from different modalities. In contrast to conventional methods that fuse features from multiple modalities, our proposed approach simplifies the neural architecture by retaining a single time series encoder. Our approach outperforms existing state-of-the-art URL methods across diverse downstream tasks.
arXiv Detail & Related papers (2023-12-09T22:31:20Z)
Correlated Attention in Transformers for Multivariate Time Series [22.542109523780333]
We propose a novel correlated attention mechanism, which efficiently captures feature-wise dependencies, and can be seamlessly integrated within the encoder blocks of existing Transformers. In particular, correlated attention operates across feature channels to compute cross-covariance matrices between queries and keys with different lag values, and selectively aggregate representations at the sub-series level. This architecture facilitates automated discovery and representation learning of not only instantaneous but also lagged cross-correlations, while inherently capturing time series auto-correlation.
arXiv Detail & Related papers (2023-11-20T17:35:44Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Cross-modal Orthogonal High-rank Augmentation for RGB-Event Transformer-trackers [58.802352477207094]
We explore the great potential of a pre-trained vision Transformer (ViT) to bridge the vast distribution gap between two modalities. We propose a mask modeling strategy that randomly masks a specific modality of some tokens to enforce the interaction between tokens from different modalities interacting proactively. Experiments demonstrate that our plug-and-play training augmentation techniques can significantly boost state-of-the-art one-stream and two trackersstream to a large extent in terms of both tracking precision and success rate.
arXiv Detail & Related papers (2023-07-09T08:58:47Z)
Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream. At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank. To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z)
Multimodal Token Fusion for Vision Transformers [54.81107795090239]
We propose a multimodal token fusion method (TokenFusion) for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact.
arXiv Detail & Related papers (2022-04-19T07:47:50Z)
Cross-Modality Fusion Transformer for Multispectral Object Detection [0.0]
Multispectral image pairs can provide the combined information, making object detection applications more reliable and robust. We present a simple yet effective cross-modality feature fusion approach, named Cross-Modality Fusion Transformer (CFT) in this paper.
arXiv Detail & Related papers (2021-10-30T15:34:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.