RGBT Tracking via All-layer Multimodal Interactions with Progressive Fusion Mamba
- URL: http://arxiv.org/abs/2408.08827v1
- Date: Fri, 16 Aug 2024 16:22:34 GMT
- Title: RGBT Tracking via All-layer Multimodal Interactions with Progressive Fusion Mamba
- Authors: Andong Lu, Wanyu Wang, Chenglong Li, Jin Tang, Bin Luo,
- Abstract summary: This paper presents a novel All-layer multimodal Interaction Network, named AINet, for robust RGBT tracking.
We show that AINet achieves leading performance against existing state-of-the-art methods.
- Score: 22.449878625622844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing RGBT tracking methods often design various interaction models to perform cross-modal fusion of each layer, but can not execute the feature interactions among all layers, which plays a critical role in robust multimodal representation, due to large computational burden. To address this issue, this paper presents a novel All-layer multimodal Interaction Network, named AINet, which performs efficient and effective feature interactions of all modalities and layers in a progressive fusion Mamba, for robust RGBT tracking. Even though modality features in different layers are known to contain different cues, it is always challenging to build multimodal interactions in each layer due to struggling in balancing interaction capabilities and efficiency. Meanwhile, considering that the feature discrepancy between RGB and thermal modalities reflects their complementary information to some extent, we design a Difference-based Fusion Mamba (DFM) to achieve enhanced fusion of different modalities with linear complexity. When interacting with features from all layers, a huge number of token sequences (3840 tokens in this work) are involved and the computational burden is thus large. To handle this problem, we design an Order-dynamic Fusion Mamba (OFM) to execute efficient and effective feature interactions of all layers by dynamically adjusting the scan order of different layers in Mamba. Extensive experiments on four public RGBT tracking datasets show that AINet achieves leading performance against existing state-of-the-art methods.
Related papers
- COMO: Cross-Mamba Interaction and Offset-Guided Fusion for Multimodal Object Detection [9.913133285133998]
Single-modal object detection tasks often experience performance degradation when encountering diverse scenarios.
multimodal object detection tasks can offer more comprehensive information about object features by integrating data from various modalities.
In this paper, we propose a novel approach called the CrOss-Mamba interaction and Offset-guided fusion framework.
arXiv Detail & Related papers (2024-12-24T01:14:48Z) - AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment [37.213291617683325]
Cross-modal alignment is crucial for multimodal representation fusion.
We propose AlignMamba, an efficient and effective method for multimodal fusion.
Experiments on complete and incomplete multimodal fusion tasks demonstrate the effectiveness and efficiency of the proposed method.
arXiv Detail & Related papers (2024-12-01T14:47:41Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - Multi-layer Learnable Attention Mask for Multimodal Tasks [2.378535917357144]
Learnable Attention Mask (LAM) strategically designed to globally regulate attention maps and prioritize critical tokens.
LAM adeptly captures associations between tokens in BERT-like transformer network.
Comprehensive experimental validation on various datasets, such as MADv2, QVHighlights, ImageNet 1K, and MSRVTT.
arXiv Detail & Related papers (2024-06-04T20:28:02Z) - SurvMamba: State Space Model with Multi-grained Multi-modal Interaction for Survival Prediction [8.452410804749512]
We propose a structured state space model named Mamba with multi-grained multi-modal interaction (SurvMamba) for survival prediction.
SurvMamba is implemented with a Hierarchical Interaction Mamba (HIM) module that facilitates efficient intra-modal interactions at different granularities.
An Interaction Fusion Mamba (IFM) module is used for cascaded inter-modal interactive fusion, yielding more comprehensive features for survival prediction.
arXiv Detail & Related papers (2024-04-11T15:58:12Z) - ICAFusion: Iterative Cross-Attention Guided Feature Fusion for
Multispectral Object Detection [25.66305300362193]
A novel feature fusion framework of dual cross-attention transformers is proposed to model global feature interaction.
This framework enhances the discriminability of object features through the query-guided cross-attention mechanism.
The proposed method achieves superior performance and faster inference, making it suitable for various practical scenarios.
arXiv Detail & Related papers (2023-08-15T00:02:10Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge
Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge.
MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z) - Learning Deep Multimodal Feature Representation with Asymmetric
Multi-layer Fusion [63.72912507445662]
We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network.
We verify that multimodal features can be learnt within a shared single network by merely maintaining modality-specific batch normalization layers in the encoder.
Secondly, we propose a bidirectional multi-layer fusion scheme, where multimodal features can be exploited progressively.
arXiv Detail & Related papers (2021-08-11T03:42:13Z) - Memory based fusion for multi-modal deep learning [39.29589204750581]
Memory based Attentive Fusion layer fuses modes by incorporating both the current features and longterm dependencies in the data.
We present a novel Memory based Attentive Fusion layer, which fuses modes by incorporating both the current features and longterm dependencies in the data.
arXiv Detail & Related papers (2020-07-16T02:05:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.