MultiModN- Multimodal, Multi-Task, Interpretable Modular Networks
- URL: http://arxiv.org/abs/2309.14118v2
- Date: Mon, 6 Nov 2023 14:55:52 GMT
- Title: MultiModN- Multimodal, Multi-Task, Interpretable Modular Networks
- Authors: Vinitra Swamy, Malika Satayeva, Jibril Frej, Thierry Bossy, Thijs
Vogels, Martin Jaggi, Tanja K\"aser, Mary-Anne Hartley
- Abstract summary: We present MultiModN, a network that fuses latent representations in a sequence of any number, combination, or type of modality.
We show that MultiModN's sequential MM fusion does not compromise performance compared with a baseline of parallel fusion.
- Score: 31.59812777504438
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Predicting multiple real-world tasks in a single model often requires a
particularly diverse feature space. Multimodal (MM) models aim to extract the
synergistic predictive potential of multiple data types to create a shared
feature space with aligned semantic meaning across inputs of drastically
varying sizes (i.e. images, text, sound). Most current MM architectures fuse
these representations in parallel, which not only limits their interpretability
but also creates a dependency on modality availability. We present MultiModN, a
multimodal, modular network that fuses latent representations in a sequence of
any number, combination, or type of modality while providing granular real-time
predictive feedback on any number or combination of predictive tasks.
MultiModN's composable pipeline is interpretable-by-design, as well as innately
multi-task and robust to the fundamental issue of biased missingness. We
perform four experiments on several benchmark MM datasets across 10 real-world
tasks (predicting medical diagnoses, academic performance, and weather), and
show that MultiModN's sequential MM fusion does not compromise performance
compared with a baseline of parallel fusion. By simulating the challenging bias
of missing not-at-random (MNAR), this work shows that, contrary to MultiModN,
parallel fusion baselines erroneously learn MNAR and suffer catastrophic
failure when faced with different patterns of MNAR at inference. To the best of
our knowledge, this is the first inherently MNAR-resistant approach to MM
modeling. In conclusion, MultiModN provides granular insights, robustness, and
flexibility without compromising performance.
Related papers
- MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks [50.98856172702256]
We propose the Modality-INformed knowledge Distillation (MIND) framework, a multimodal model compression approach.
MIND transfers knowledge from ensembles of pre-trained deep neural networks of varying sizes into a smaller multimodal student.
We evaluate MIND on binary and multilabel clinical prediction tasks using time series data and chest X-ray images.
arXiv Detail & Related papers (2025-02-03T08:50:00Z) - MIFNet: Learning Modality-Invariant Features for Generalizable Multimodal Image Matching [54.740256498985026]
Keypoint detection and description methods often struggle with multimodal data.
We propose a modality-invariant feature learning network (MIFNet) to compute modality-invariant features for keypoint descriptions in multimodal image matching.
arXiv Detail & Related papers (2025-01-20T06:56:30Z) - MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt [60.10555128510744]
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary image information from different modalities.
Recently, large-scale pre-trained models like CLIP have demonstrated impressive performance in traditional single-modal object ReID tasks.
We introduce a novel framework called MambaPro for multi-modal object ReID.
arXiv Detail & Related papers (2024-12-14T06:33:53Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Toward Robust Multimodal Learning using Multimodal Foundational Models [30.755818450393637]
We propose TRML, Toward Robust Multimodal Learning using Multimodal Foundational Models.
TRML employs generated virtual modalities to replace missing modalities.
We also design a semantic matching learning module to align semantic spaces generated and missing modalities.
arXiv Detail & Related papers (2024-01-20T04:46:43Z) - MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal
Contributions in Vision and Language Models & Tasks [20.902155496422417]
Vision and language models exploit unrobust indicators in individual modalities instead of focusing on relevant information in each modality.
We propose MM-SHAP, a performance-agnostic multimodality score based on Shapley values.
arXiv Detail & Related papers (2022-12-15T21:41:06Z) - OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist
Models [72.8156832931841]
Generalist models are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model.
We release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction.
arXiv Detail & Related papers (2022-12-08T17:07:09Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - Variational Dynamic Mixtures [18.730501689781214]
We develop variational dynamic mixtures (VDM) to infer sequential latent variables.
In an empirical study, we show that VDM outperforms competing approaches on highly multi-modal datasets.
arXiv Detail & Related papers (2020-10-20T16:10:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.