Multimodal Fusion Refiner Networks
- URL: http://arxiv.org/abs/2104.03435v1
- Date: Thu, 8 Apr 2021 00:02:01 GMT
- Title: Multimodal Fusion Refiner Networks
- Authors: Sethuraman Sankaran, David Yang, Ser-Nam Lim
- Abstract summary: We develop a Refiner Fusion Network (ReFNet) that enables fusion modules to combine strong unimodal representation with strong multimodal representations.
ReFNet combines the fusion network with a decoding/defusing module, which imposes a modality-centric responsibility condition.
We demonstrate that the Refiner Fusion Network can improve upon performance of powerful baseline fusion modules such as multimodal transformers.
- Score: 22.93868090722948
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tasks that rely on multi-modal information typically include a fusion module
that combines information from different modalities. In this work, we develop a
Refiner Fusion Network (ReFNet) that enables fusion modules to combine strong
unimodal representation with strong multimodal representations. ReFNet combines
the fusion network with a decoding/defusing module, which imposes a
modality-centric responsibility condition. This approach addresses a big gap in
existing multimodal fusion frameworks by ensuring that both unimodal and fused
representations are strongly encoded in the latent fusion space. We demonstrate
that the Refiner Fusion Network can improve upon performance of powerful
baseline fusion modules such as multimodal transformers. The refiner network
enables inducing graphical representations of the fused embeddings in the
latent space, which we prove under certain conditions and is supported by
strong empirical results in the numerical experiments. These graph structures
are further strengthened by combining the ReFNet with a Multi-Similarity
contrastive loss function. The modular nature of Refiner Fusion Network lends
itself to be combined with different fusion architectures easily, and in
addition, the refiner step can be applied for pre-training on unlabeled
datasets, thus leveraging unsupervised data towards improving performance. We
demonstrate the power of Refiner Fusion Networks on three datasets, and further
show that they can maintain performance with only a small fraction of labeled
data.
Related papers
- Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding [51.96911650437978]
Multi-modal fusion has played a vital role in multi-modal scene understanding.
Most existing methods focus on cross-modal fusion involving two modalities, often overlooking more complex multi-modal fusion.
We propose a relational Part-Whole Fusion (PWRF) framework for multi-modal scene understanding.
arXiv Detail & Related papers (2024-10-19T02:27:30Z) - StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation [63.31007867379312]
We propose StitchFusion, a framework that integrates large-scale pre-trained models directly as encoders and feature fusers.
We introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding.
Our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters.
arXiv Detail & Related papers (2024-08-02T15:41:16Z) - LMFNet: An Efficient Multimodal Fusion Approach for Semantic Segmentation in High-Resolution Remote Sensing [25.016421338677816]
Current methods often process only two types of data, missing out on the rich information that additional modalities can provide.
We propose a novel textbfLightweight textbfMultimodal data textbfFusion textbfNetwork (LMFNet)
LMFNet accommodates various data types simultaneously, including RGB, NirRG, and DSM, through a weight-sharing, multi-branch vision transformer.
arXiv Detail & Related papers (2024-04-21T13:29:42Z) - ReFusion: Learning Image Fusion from Reconstruction with Learnable Loss
via Meta-Learning [17.91346343984845]
We introduce a unified image fusion framework based on meta-learning, named ReFusion.
ReFusion employs a parameterized loss function, dynamically adjusted by the training framework according to the specific scenario and task.
It is capable of adapting to various tasks, including infrared-visible, medical, multi-focus, and multi-exposure image fusion.
arXiv Detail & Related papers (2023-12-13T07:40:39Z) - Deep Equilibrium Multimodal Fusion [88.04713412107947]
Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently.
We propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process.
Experiments on BRCA, MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ fusion.
arXiv Detail & Related papers (2023-06-29T03:02:20Z) - CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for
Multi-Modality Image Fusion [138.40422469153145]
We propose a novel Correlation-Driven feature Decomposition Fusion (CDDFuse) network.
We show that CDDFuse achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion.
arXiv Detail & Related papers (2022-11-26T02:40:28Z) - ScaleVLAD: Improving Multimodal Sentiment Analysis via Multi-Scale
Fusion of Locally Descriptors [15.042741192427334]
This paper proposes a fusion model named ScaleVLAD to gather multi-Scale representation from text, video, and audio.
Experiments on three popular sentiment analysis benchmarks, IEMOCAP, MOSI, and MOSEI, demonstrate significant gains over baselines.
arXiv Detail & Related papers (2021-12-02T16:09:33Z) - Multi-modal land cover mapping of remote sensing images using pyramid
attention and gated fusion networks [20.66034058363032]
We propose a new multi-modality network for land cover mapping of multi-modal remote sensing data based on a novel pyramid attention fusion (PAF) module and a gated fusion unit (GFU)
PAF module is designed to efficiently obtain rich fine-grained contextual representations from each modality with a built-in cross-level and cross-view attention fusion mechanism.
GFU module utilizes a novel gating mechanism for early merging of features, thereby diminishing hidden redundancies and noise.
arXiv Detail & Related papers (2021-11-06T10:01:01Z) - Learning Deep Multimodal Feature Representation with Asymmetric
Multi-layer Fusion [63.72912507445662]
We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network.
We verify that multimodal features can be learnt within a shared single network by merely maintaining modality-specific batch normalization layers in the encoder.
Secondly, we propose a bidirectional multi-layer fusion scheme, where multimodal features can be exploited progressively.
arXiv Detail & Related papers (2021-08-11T03:42:13Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Memory based fusion for multi-modal deep learning [39.29589204750581]
Memory based Attentive Fusion layer fuses modes by incorporating both the current features and longterm dependencies in the data.
We present a novel Memory based Attentive Fusion layer, which fuses modes by incorporating both the current features and longterm dependencies in the data.
arXiv Detail & Related papers (2020-07-16T02:05:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.