Memory based fusion for multi-modal deep learning
- URL: http://arxiv.org/abs/2007.08076v3
- Date: Fri, 23 Oct 2020 05:22:34 GMT
- Title: Memory based fusion for multi-modal deep learning
- Authors: Darshana Priyasad, Tharindu Fernando, Simon Denman, Sridha Sridharan,
Clinton Fookes
- Abstract summary: Memory based Attentive Fusion layer fuses modes by incorporating both the current features and longterm dependencies in the data.
We present a novel Memory based Attentive Fusion layer, which fuses modes by incorporating both the current features and longterm dependencies in the data.
- Score: 39.29589204750581
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The use of multi-modal data for deep machine learning has shown promise when
compared to uni-modal approaches with fusion of multi-modal features resulting
in improved performance in several applications. However, most state-of-the-art
methods use naive fusion which processes feature streams independently,
ignoring possible long-term dependencies within the data during fusion. In this
paper, we present a novel Memory based Attentive Fusion layer, which fuses
modes by incorporating both the current features and longterm dependencies in
the data, thus allowing the model to understand the relative importance of
modes over time. We introduce an explicit memory block within the fusion layer
which stores features containing long-term dependencies of the fused data. The
feature inputs from uni-modal encoders are fused through attentive composition
and transformation followed by naive fusion of the resultant memory derived
features with layer inputs. Following state-of-the-art methods, we have
evaluated the performance and the generalizability of the proposed fusion
approach on two different datasets with different modalities. In our
experiments, we replace the naive fusion layer in benchmark networks with our
proposed layer to enable a fair comparison. Experimental results indicate that
the MBAF layer can generalise across different modalities and networks to
enhance fusion and improve performance.
Related papers
- Appformer: A Novel Framework for Mobile App Usage Prediction Leveraging Progressive Multi-Modal Data Fusion and Feature Extraction [9.53224378857976]
Appformer is a novel mobile application prediction framework inspired by the efficiency of Transformer-like architectures.
The framework employs Points of Interest (POIs) associated with base stations, optimizing them through comparative experiments to identify the most effective clustering method.
The Feature Extraction Module, employing Transformer-like architectures specialized for time series analysis, adeptly distils comprehensive features.
arXiv Detail & Related papers (2024-07-28T06:41:31Z) - Progressively Modality Freezing for Multi-Modal Entity Alignment [27.77877721548588]
We propose a novel strategy of progressive modality freezing, called PMF, that focuses on alignmentrelevant features.
Notably, our approach introduces a pioneering cross-modal association loss to foster modal consistency.
Empirical evaluations across nine datasets confirm PMF's superiority.
arXiv Detail & Related papers (2024-07-23T04:22:30Z) - Fusion-Mamba for Cross-modality Object Detection [63.56296480951342]
Cross-modality fusing information from different modalities effectively improves object detection performance.
We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction.
Our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M3FD$ and 4.9% on FLIR-Aligned datasets.
arXiv Detail & Related papers (2024-04-14T05:28:46Z) - From Text to Pixels: A Context-Aware Semantic Synergy Solution for
Infrared and Visible Image Fusion [66.33467192279514]
We introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images.
Our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-12-31T08:13:47Z) - Deep Equilibrium Multimodal Fusion [88.04713412107947]
Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently.
We propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process.
Experiments on BRCA, MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ fusion.
arXiv Detail & Related papers (2023-06-29T03:02:20Z) - Improving Multimodal Fusion with Hierarchical Mutual Information
Maximization for Multimodal Sentiment Analysis [16.32509144501822]
We propose a framework named MultiModal InfoMax (MMIM), which hierarchically maximizes the Mutual Information (MI) in unimodal input pairs.
The framework is jointly trained with the main task (MSA) to improve the performance of the downstream MSA task.
arXiv Detail & Related papers (2021-09-01T14:45:16Z) - Learning Deep Multimodal Feature Representation with Asymmetric
Multi-layer Fusion [63.72912507445662]
We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network.
We verify that multimodal features can be learnt within a shared single network by merely maintaining modality-specific batch normalization layers in the encoder.
Secondly, we propose a bidirectional multi-layer fusion scheme, where multimodal features can be exploited progressively.
arXiv Detail & Related papers (2021-08-11T03:42:13Z) - Multimodal Fusion Refiner Networks [22.93868090722948]
We develop a Refiner Fusion Network (ReFNet) that enables fusion modules to combine strong unimodal representation with strong multimodal representations.
ReFNet combines the fusion network with a decoding/defusing module, which imposes a modality-centric responsibility condition.
We demonstrate that the Refiner Fusion Network can improve upon performance of powerful baseline fusion modules such as multimodal transformers.
arXiv Detail & Related papers (2021-04-08T00:02:01Z) - Deep Multimodal Fusion by Channel Exchanging [87.40768169300898]
This paper proposes a parameter-free multimodal fusion framework that dynamically exchanges channels between sub-networks of different modalities.
The validity of such exchanging process is also guaranteed by sharing convolutional filters yet keeping separate BN layers across modalities, which, as an add-on benefit, allows our multimodal architecture to be almost as compact as a unimodal network.
arXiv Detail & Related papers (2020-11-10T09:53:20Z) - Multi-Modality Cascaded Fusion Technology for Autonomous Driving [18.93984652806857]
We propose a general multi-modality cascaded fusion framework, exploiting the advantages of decision-level and feature-level fusion.
In the fusion process, dynamic coordinate alignment(DCA) is conducted to reduce the error between sensors from different modalities.
The proposed step-by-step cascaded fusion framework is more interpretable and flexible compared to the end-toend fusion methods.
arXiv Detail & Related papers (2020-02-08T10:59:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.