Learning Deep Multimodal Feature Representation with Asymmetric
Multi-layer Fusion
- URL: http://arxiv.org/abs/2108.05009v1
- Date: Wed, 11 Aug 2021 03:42:13 GMT
- Title: Learning Deep Multimodal Feature Representation with Asymmetric
Multi-layer Fusion
- Authors: Yikai Wang, Fuchun Sun, Ming Lu, Anbang Yao
- Abstract summary: We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network.
We verify that multimodal features can be learnt within a shared single network by merely maintaining modality-specific batch normalization layers in the encoder.
Secondly, we propose a bidirectional multi-layer fusion scheme, where multimodal features can be exploited progressively.
- Score: 63.72912507445662
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a compact and effective framework to fuse multimodal features at
multiple layers in a single network. The framework consists of two innovative
fusion schemes. Firstly, unlike existing multimodal methods that necessitate
individual encoders for different modalities, we verify that multimodal
features can be learnt within a shared single network by merely maintaining
modality-specific batch normalization layers in the encoder, which also enables
implicit fusion via joint feature representation learning. Secondly, we propose
a bidirectional multi-layer fusion scheme, where multimodal features can be
exploited progressively. To take advantage of such scheme, we introduce two
asymmetric fusion operations including channel shuffle and pixel shift, which
learn different fused features with respect to different fusion directions.
These two operations are parameter-free and strengthen the multimodal feature
interactions across channels as well as enhance the spatial feature
discrimination within channels. We conduct extensive experiments on semantic
segmentation and image translation tasks, based on three publicly available
datasets covering diverse modalities. Results indicate that our proposed
framework is general, compact and is superior to state-of-the-art fusion
frameworks.
Related papers
- Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding [51.96911650437978]
Multi-modal fusion has played a vital role in multi-modal scene understanding.
Most existing methods focus on cross-modal fusion involving two modalities, often overlooking more complex multi-modal fusion.
We propose a relational Part-Whole Fusion (PWRF) framework for multi-modal scene understanding.
arXiv Detail & Related papers (2024-10-19T02:27:30Z) - StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation [63.31007867379312]
We propose StitchFusion, a framework that integrates large-scale pre-trained models directly as encoders and feature fusers.
We introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding.
Our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters.
arXiv Detail & Related papers (2024-08-02T15:41:16Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - AttX: Attentive Cross-Connections for Fusion of Wearable Signals in
Emotion Recognition [15.21696076393078]
Cross-modal attentive connections is a new dynamic and effective technique for multimodal representation learning from wearable data.
We perform extensive experiments on three public multimodal wearable datasets, WESAD, SWELL-KW, and CASE.
Our method can result in superior or competitive performance to state-of-the-art and outperform a variety of baseline uni-modal and classical multimodal methods.
arXiv Detail & Related papers (2022-06-09T17:18:33Z) - CMF: Cascaded Multi-model Fusion for Referring Image Segmentation [24.942658173937563]
We address the task of referring image segmentation (RIS), which aims at predicting a segmentation mask for the object described by a natural language expression.
We propose a simple yet effective Cascaded Multi-modal Fusion (CMF) module, which stacks multiple atrous convolutional layers in parallel.
Experimental results on four benchmark datasets demonstrate that our method outperforms most state-of-the-art methods.
arXiv Detail & Related papers (2021-06-16T08:18:39Z) - MSAF: Multimodal Split Attention Fusion [6.460517449962825]
We propose a novel multimodal fusion module that learns to emphasize more contributive features across all modalities.
Our approach achieves competitive results in each task and outperforms other application-specific networks and multimodal fusion benchmarks.
arXiv Detail & Related papers (2020-12-13T22:42:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.