Progressive Fusion for Multimodal Integration
- URL: http://arxiv.org/abs/2209.00302v1
- Date: Thu, 1 Sep 2022 09:08:33 GMT
- Title: Progressive Fusion for Multimodal Integration
- Authors: Shiv Shankar, Laure Thompson, Madalina Fiterau
- Abstract summary: We present an iterative representation refinement approach, called Progressive Fusion, which mitigates the issues with late fusion representations.
We show that our approach consistently improves performance, for instance attaining a 5% reduction in MSE and 40% improvement in robustness on multimodal time series prediction.
- Score: 12.94175198001421
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Integration of multimodal information from various sources has been shown to
boost the performance of machine learning models and thus has received
increased attention in recent years. Often such models use deep
modality-specific networks to obtain unimodal features which are combined to
obtain "late-fusion" representations. However, these designs run the risk of
information loss in the respective unimodal pipelines. On the other hand,
"early-fusion" methodologies, which combine features early, suffer from the
problems associated with feature heterogeneity and high sample complexity. In
this work, we present an iterative representation refinement approach, called
Progressive Fusion, which mitigates the issues with late fusion
representations. Our model-agnostic technique introduces backward connections
that make late stage fused representations available to early layers, improving
the expressiveness of the representations at those stages, while retaining the
advantages of late fusion designs. We test Progressive Fusion on tasks
including affective sentiment detection, multimedia analysis, and time series
fusion with different models, demonstrating its versatility. We show that our
approach consistently improves performance, for instance attaining a 5%
reduction in MSE and 40% improvement in robustness on multimodal time series
prediction.
Related papers
- MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling [64.09238330331195]
We propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework.
Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss.
We show that MMAR demonstrates much more superior performance than other joint multi-modal models.
arXiv Detail & Related papers (2024-10-14T17:57:18Z) - MMLF: Multi-modal Multi-class Late Fusion for Object Detection with Uncertainty Estimation [13.624431305114564]
This paper introduces a pioneering Multi-modal Multi-class Late Fusion method, designed for late fusion to enable multi-class detection.
Experiments conducted on the KITTI validation and official test datasets illustrate substantial performance improvements.
Our approach incorporates uncertainty analysis into the classification fusion process, rendering our model more transparent and trustworthy.
arXiv Detail & Related papers (2024-10-11T11:58:35Z) - Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations [19.731611716111566]
We propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations.
We introduce a predictive self-attention module to capture reliable context dynamics within modalities.
A hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities.
A double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner.
arXiv Detail & Related papers (2024-07-06T04:36:48Z) - UniCat: Crafting a Stronger Fusion Baseline for Multimodal
Re-Identification [0.9831489366502301]
We show that prevailing late-fusion techniques often produce suboptimal latent representations when compared to methods that train modalities in isolation.
We argue that this effect is largely due to the inadvertent relaxation of the training objectives on individual modalities when using fusion.
Our findings also show that unimodal concatenation (UniCat) and other late-fusion ensembling of unimodal backbones, exceed the current state-of-the-art performance across several multimodal ReID benchmarks.
arXiv Detail & Related papers (2023-10-28T20:30:59Z) - Deep Equilibrium Multimodal Fusion [88.04713412107947]
Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently.
We propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process.
Experiments on BRCA, MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ fusion.
arXiv Detail & Related papers (2023-06-29T03:02:20Z) - A Task-guided, Implicitly-searched and Meta-initialized Deep Model for
Image Fusion [69.10255211811007]
We present a Task-guided, Implicit-searched and Meta- generalizationd (TIM) deep model to address the image fusion problem in a challenging real-world scenario.
Specifically, we propose a constrained strategy to incorporate information from downstream tasks to guide the unsupervised learning process of image fusion.
Within this framework, we then design an implicit search scheme to automatically discover compact architectures for our fusion model with high efficiency.
arXiv Detail & Related papers (2023-05-25T08:54:08Z) - DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion [144.9653045465908]
We propose a novel fusion algorithm based on the denoising diffusion probabilistic model (DDPM)
Our approach yields promising fusion results in infrared-visible image fusion and medical image fusion.
arXiv Detail & Related papers (2023-03-13T04:06:42Z) - MMLatch: Bottom-up Top-down Fusion for Multimodal Sentiment Analysis [84.7287684402508]
Current deep learning approaches for multimodal fusion rely on bottom-up fusion of high and mid-level latent modality representations.
Models of human perception highlight the importance of top-down fusion, where high-level representations affect the way sensory inputs are perceived.
We propose a neural architecture that captures top-down cross-modal interactions, using a feedback mechanism in the forward pass during network training.
arXiv Detail & Related papers (2022-01-24T17:48:04Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.