RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided
Learning
- URL: http://arxiv.org/abs/2303.14778v2
- Date: Sat, 22 Apr 2023 08:55:55 GMT
- Title: RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided
Learning
- Authors: Yabin Zhu, Chenglong Li, Xiao Wang, Jin Tang, Zhixiang Huang
- Abstract summary: We propose a novel Progressive Fusion Transformer called ProFormer.
It integrates single-modality information into the multimodal representation for robust RGBT tracking.
ProFormer sets a new state-of-the-art performance on RGBT210, RGBT234, LasHeR, and VTUAV datasets.
- Score: 37.067605349559
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing Transformer-based RGBT tracking methods either use cross-attention
to fuse the two modalities, or use self-attention and cross-attention to model
both modality-specific and modality-sharing information. However, the
significant appearance gap between modalities limits the feature representation
ability of certain modalities during the fusion process. To address this
problem, we propose a novel Progressive Fusion Transformer called ProFormer,
which progressively integrates single-modality information into the multimodal
representation for robust RGBT tracking. In particular, ProFormer first uses a
self-attention module to collaboratively extract the multimodal representation,
and then uses two cross-attention modules to interact it with the features of
the dual modalities respectively. In this way, the modality-specific
information can well be activated in the multimodal representation. Finally, a
feed-forward network is used to fuse two interacted multimodal representations
for the further enhancement of the final multimodal representation. In
addition, existing learning methods of RGBT trackers either fuse multimodal
features into one for final classification, or exploit the relationship between
unimodal branches and fused branch through a competitive learning strategy.
However, they either ignore the learning of single-modality branches or result
in one branch failing to be well optimized. To solve these problems, we propose
a dynamically guided learning algorithm that adaptively uses well-performing
branches to guide the learning of other branches, for enhancing the
representation ability of each branch. Extensive experiments demonstrate that
our proposed ProFormer sets a new state-of-the-art performance on RGBT210,
RGBT234, LasHeR, and VTUAV datasets.
Related papers
- Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation [12.455034591553506]
Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields.
Previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion.
We have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem.
arXiv Detail & Related papers (2024-07-23T02:23:51Z) - Multimodal Information Interaction for Medical Image Segmentation [24.024848382458767]
We introduce an innovative Multimodal Information Cross Transformer (MicFormer)
It queries features from one modality and retrieves corresponding responses from another, facilitating effective communication between bimodal features.
Compared to other multimodal segmentation techniques, our method outperforms by margins of 2.83 and 4.23, respectively.
arXiv Detail & Related papers (2024-04-25T07:21:14Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - NestedFormer: Nested Modality-Aware Transformer for Brain Tumor
Segmentation [29.157465321864265]
We propose a novel Nested Modality-Aware Transformer (NestedFormer) to explore the intra-modality and inter-modality relationships of multi-modal MRIs for brain tumor segmentation.
Built on the transformer-based multi-encoder and single-decoder structure, we perform nested multi-modal fusion for high-level representations of different modalities.
arXiv Detail & Related papers (2022-08-31T14:04:25Z) - LMR-CBT: Learning Modality-fused Representations with CB-Transformer for
Multimodal Emotion Recognition from Unaligned Multimodal Sequences [5.570499497432848]
We propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition.
We conduct word-aligned and unaligned experiments on three challenging datasets.
arXiv Detail & Related papers (2021-12-03T03:43:18Z) - Learning Deep Multimodal Feature Representation with Asymmetric
Multi-layer Fusion [63.72912507445662]
We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network.
We verify that multimodal features can be learnt within a shared single network by merely maintaining modality-specific batch normalization layers in the encoder.
Secondly, we propose a bidirectional multi-layer fusion scheme, where multimodal features can be exploited progressively.
arXiv Detail & Related papers (2021-08-11T03:42:13Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Accelerated Multi-Modal MR Imaging with Transformers [92.18406564785329]
We propose a multi-modal transformer (MTrans) for accelerated MR imaging.
By restructuring the transformer architecture, our MTrans gains a powerful ability to capture deep multi-modal information.
Our framework provides two appealing benefits: (i) MTrans is the first attempt at using improved transformers for multi-modal MR imaging, affording more global information compared with CNN-based methods.
arXiv Detail & Related papers (2021-06-27T15:01:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.