Sparse Fusion for Multimodal Transformers
- URL: http://arxiv.org/abs/2111.11992v1
- Date: Tue, 23 Nov 2021 16:43:49 GMT
- Title: Sparse Fusion for Multimodal Transformers
- Authors: Yi Ding, Alex Rich, Mason Wang, Noah Stier, Pradeep Sen, Matthew Turk,
Tobias H\"ollerer
- Abstract summary: We present Sparse Fusion Transformers (SFT), a novel multimodal fusion method for transformers.
Key to our idea is a sparse-pooling block that reduces unimodal token sets prior to cross-modality modeling.
State-of-the-art performance is obtained on multiple benchmarks under similar experiment conditions, while reporting up to six-fold reduction in computational cost and memory requirements.
- Score: 7.98117428941095
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal classification is a core task in human-centric machine learning.
We observe that information is highly complementary across modalities, thus
unimodal information can be drastically sparsified prior to multimodal fusion
without loss of accuracy. To this end, we present Sparse Fusion Transformers
(SFT), a novel multimodal fusion method for transformers that performs
comparably to existing state-of-the-art methods while having greatly reduced
memory footprint and computation cost. Key to our idea is a sparse-pooling
block that reduces unimodal token sets prior to cross-modality modeling.
Evaluations are conducted on multiple multimodal benchmark datasets for a wide
range of classification tasks. State-of-the-art performance is obtained on
multiple benchmarks under similar experiment conditions, while reporting up to
six-fold reduction in computational cost and memory requirements. Extensive
ablation studies showcase our benefits of combining sparsification and
multimodal learning over naive approaches. This paves the way for enabling
multimodal learning on low-resource devices.
Related papers
- GSIFN: A Graph-Structured and Interlaced-Masked Multimodal Transformer-based Fusion Network for Multimodal Sentiment Analysis [0.0]
Multimodal Sentiment Analysis (MSA) leverages multiple data modals to analyze human sentiment.
Existing MSA models generally employ cutting-edge multimodal fusion and representation learning-based methods to promote MSA capability.
Our proposed GSIFN incorporates two main components to solve these problems: (i) a graph-structured and interlaced-masked multimodal Transformer.
It adopts the Interlaced Mask mechanism to construct robust multimodal graph embedding, achieve all-modal-in-one Transformer-based fusion, and greatly reduce the computational overhead.
arXiv Detail & Related papers (2024-08-27T06:44:28Z) - Deep Equilibrium Multimodal Fusion [88.04713412107947]
Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently.
We propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process.
Experiments on BRCA, MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ fusion.
arXiv Detail & Related papers (2023-06-29T03:02:20Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - Efficient Multimodal Fusion via Interactive Prompting [62.08292938484994]
Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era.
We propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pre-trained transformers.
arXiv Detail & Related papers (2023-04-13T07:31:51Z) - Generalizing Multimodal Variational Methods to Sets [35.69942798534849]
This paper presents a novel variational method on sets called the Set Multimodal VAE (SMVAE) for learning a multimodal latent space.
By modeling the joint-modality posterior distribution directly, the proposed SMVAE learns to exchange information between multiple modalities and compensate for the drawbacks caused by factorization.
arXiv Detail & Related papers (2022-12-19T23:50:19Z) - Dynamic Multimodal Fusion [8.530680502975095]
Dynamic multimodal fusion (DynMM) is a new approach that adaptively fuses multimodal data and generates data-dependent forward paths during inference.
Results on various multimodal tasks demonstrate the efficiency and wide applicability of our approach.
arXiv Detail & Related papers (2022-03-31T21:35:13Z) - Improving Multimodal Fusion with Hierarchical Mutual Information
Maximization for Multimodal Sentiment Analysis [16.32509144501822]
We propose a framework named MultiModal InfoMax (MMIM), which hierarchically maximizes the Mutual Information (MI) in unimodal input pairs.
The framework is jointly trained with the main task (MSA) to improve the performance of the downstream MSA task.
arXiv Detail & Related papers (2021-09-01T14:45:16Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.