Generative-based Fusion Mechanism for Multi-Modal Tracking
- URL: http://arxiv.org/abs/2309.01728v3
- Date: Thu, 30 Nov 2023 15:21:01 GMT
- Title: Generative-based Fusion Mechanism for Multi-Modal Tracking
- Authors: Zhangyong Tang, Tianyang Xu, Xuefeng Zhu, Xiao-Jun Wu, Josef Kittler
- Abstract summary: We introduce Conditional Generative Adversarial Networks (CGANs) and Diffusion Models (DMs)
We condition these multi-modal features with random noise in the GM framework, effectively transforming the original training samples into harder instances.
This design excels at extracting discriminative clues from the features, enhancing the ultimate tracking performance.
- Score: 35.77340348091937
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative models (GMs) have received increasing research interest for their
remarkable capacity to achieve comprehensive understanding. However, their
potential application in the domain of multi-modal tracking has remained
relatively unexplored. In this context, we seek to uncover the potential of
harnessing generative techniques to address the critical challenge, information
fusion, in multi-modal tracking. In this paper, we delve into two prominent GM
techniques, namely, Conditional Generative Adversarial Networks (CGANs) and
Diffusion Models (DMs). Different from the standard fusion process where the
features from each modality are directly fed into the fusion block, we
condition these multi-modal features with random noise in the GM framework,
effectively transforming the original training samples into harder instances.
This design excels at extracting discriminative clues from the features,
enhancing the ultimate tracking performance. To quantitatively gauge the
effectiveness of our approach, we conduct extensive experiments across two
multi-modal tracking tasks, three baseline methods, and three challenging
benchmarks. The experimental results demonstrate that the proposed
generative-based fusion mechanism achieves state-of-the-art performance,
setting new records on LasHeR and RGBD1K.
Related papers
- RADAR: Robust Two-stage Modality-incomplete Industrial Anomaly Detection [61.71770293720491]
We propose a novel two-stage Robust modAlity-imcomplete fusing and Detecting frAmewoRk, abbreviated as RADAR.
Our bootstrapping philosophy is to enhance two stages in MIIAD, improving the robustness of the Multimodal Transformer.
Our experimental results demonstrate that the proposed RADAR significantly surpasses conventional MIAD methods in terms of effectiveness and robustness.
arXiv Detail & Related papers (2024-10-02T16:47:55Z) - Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations [19.731611716111566]
We propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations.
We introduce a predictive self-attention module to capture reliable context dynamics within modalities.
A hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities.
A double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner.
arXiv Detail & Related papers (2024-07-06T04:36:48Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Provable Dynamic Fusion for Low-Quality Multimodal Data [94.39538027450948]
Dynamic multimodal fusion emerges as a promising learning paradigm.
Despite its widespread use, theoretical justifications in this field are still notably lacking.
This paper provides theoretical understandings to answer this question under a most popular multimodal fusion framework from the generalization perspective.
A novel multimodal fusion framework termed Quality-aware Multimodal Fusion (QMF) is proposed, which can improve the performance in terms of classification accuracy and model robustness.
arXiv Detail & Related papers (2023-06-03T08:32:35Z) - Quantifying & Modeling Multimodal Interactions: An Information
Decomposition Framework [89.8609061423685]
We propose an information-theoretic approach to quantify the degree of redundancy, uniqueness, and synergy relating input modalities with an output task.
To validate PID estimation, we conduct extensive experiments on both synthetic datasets where the PID is known and on large-scale multimodal benchmarks.
We demonstrate their usefulness in (1) quantifying interactions within multimodal datasets, (2) quantifying interactions captured by multimodal models, (3) principled approaches for model selection, and (4) three real-world case studies.
arXiv Detail & Related papers (2023-02-23T18:59:05Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - MISA: Modality-Invariant and -Specific Representations for Multimodal
Sentiment Analysis [48.776247141839875]
We propose a novel framework, MISA, which projects each modality to two distinct subspaces.
The first subspace is modality-invariant, where the representations across modalities learn their commonalities and reduce the modality gap.
Our experiments on popular sentiment analysis benchmarks, MOSI and MOSEI, demonstrate significant gains over state-of-the-art models.
arXiv Detail & Related papers (2020-05-07T15:13:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.