Deep Multimodal Fusion by Channel Exchanging
- URL: http://arxiv.org/abs/2011.05005v2
- Date: Sat, 5 Dec 2020 05:42:46 GMT
- Title: Deep Multimodal Fusion by Channel Exchanging
- Authors: Yikai Wang, Wenbing Huang, Fuchun Sun, Tingyang Xu, Yu Rong, Junzhou
Huang
- Abstract summary: This paper proposes a parameter-free multimodal fusion framework that dynamically exchanges channels between sub-networks of different modalities.
The validity of such exchanging process is also guaranteed by sharing convolutional filters yet keeping separate BN layers across modalities, which, as an add-on benefit, allows our multimodal architecture to be almost as compact as a unimodal network.
- Score: 87.40768169300898
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep multimodal fusion by using multiple sources of data for classification
or regression has exhibited a clear advantage over the unimodal counterpart on
various applications. Yet, current methods including aggregation-based and
alignment-based fusion are still inadequate in balancing the trade-off between
inter-modal fusion and intra-modal processing, incurring a bottleneck of
performance improvement. To this end, this paper proposes
Channel-Exchanging-Network (CEN), a parameter-free multimodal fusion framework
that dynamically exchanges channels between sub-networks of different
modalities. Specifically, the channel exchanging process is self-guided by
individual channel importance that is measured by the magnitude of
Batch-Normalization (BN) scaling factor during training. The validity of such
exchanging process is also guaranteed by sharing convolutional filters yet
keeping separate BN layers across modalities, which, as an add-on benefit,
allows our multimodal architecture to be almost as compact as a unimodal
network. Extensive experiments on semantic segmentation via RGB-D data and
image translation through multi-domain input verify the effectiveness of our
CEN compared to current state-of-the-art methods. Detailed ablation studies
have also been carried out, which provably affirm the advantage of each
component we propose. Our code is available at https://github.com/yikaiw/CEN.
Related papers
- Rethinking Normalization Strategies and Convolutional Kernels for Multimodal Image Fusion [25.140475569677758]
Multimodal image fusion aims to integrate information from different modalities to obtain a comprehensive image.
Existing methods tend to prioritize natural image fusion and focus on information complementary and network training strategies.
This paper dissects the significant differences between the two tasks regarding fusion goals, statistical properties, and data distribution.
arXiv Detail & Related papers (2024-11-15T08:36:24Z) - SOFTS: Efficient Multivariate Time Series Forecasting with Series-Core Fusion [59.96233305733875]
Time series forecasting plays a crucial role in various fields such as finance, traffic management, energy, and healthcare.
Several methods utilize mechanisms like attention or mixer to address this by capturing channel correlations.
This paper presents an efficient-based model, the Series-cOre Fused Time Series forecaster (SOFTS)
arXiv Detail & Related papers (2024-04-22T14:06:35Z) - Deep Equilibrium Multimodal Fusion [88.04713412107947]
Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently.
We propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process.
Experiments on BRCA, MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ fusion.
arXiv Detail & Related papers (2023-06-29T03:02:20Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - Dynamic Multimodal Fusion [8.530680502975095]
Dynamic multimodal fusion (DynMM) is a new approach that adaptively fuses multimodal data and generates data-dependent forward paths during inference.
Results on various multimodal tasks demonstrate the efficiency and wide applicability of our approach.
arXiv Detail & Related papers (2022-03-31T21:35:13Z) - Multi-Modal Mutual Information Maximization: A Novel Approach for
Unsupervised Deep Cross-Modal Hashing [73.29587731448345]
We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH)
We learn informative representations that can preserve both intra- and inter-modal similarities.
The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
arXiv Detail & Related papers (2021-12-13T08:58:03Z) - Channel Exchanging Networks for Multimodal and Multitask Dense Image
Prediction [125.18248926508045]
We propose Channel-Exchanging-Network (CEN) which is self-adaptive, parameter-free, and more importantly, applicable for both multimodal fusion and multitask learning.
CEN dynamically exchanges channels betweenworks of different modalities.
For the application of dense image prediction, the validity of CEN is tested by four different scenarios.
arXiv Detail & Related papers (2021-12-04T05:47:54Z) - A cross-modal fusion network based on self-attention and residual
structure for multimodal emotion recognition [7.80238628278552]
We propose a novel cross-modal fusion network based on self-attention and residual structure (CFN-SR) for multimodal emotion recognition.
To verify the effectiveness of the proposed method, we conduct experiments on the RAVDESS dataset.
The experimental results show that the proposed CFN-SR achieves the state-of-the-art and obtains 75.76% accuracy with 26.30M parameters.
arXiv Detail & Related papers (2021-11-03T12:24:03Z) - Deep feature selection-and-fusion for RGB-D semantic segmentation [8.831857715361624]
This work proposes a unified and efficient feature selectionand-fusion network (FSFNet)
FSFNet contains a symmetric cross-modality residual fusion module used for explicit fusion of multi-modality information.
Compared with the state-of-the-art methods, experimental evaluations demonstrate that the proposed model achieves competitive performance on two public datasets.
arXiv Detail & Related papers (2021-05-10T04:02:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.