Neuro-Inspired Hierarchical Multimodal Learning
- URL: http://arxiv.org/abs/2309.15877v3
- Date: Tue, 23 Apr 2024 17:57:33 GMT
- Title: Neuro-Inspired Hierarchical Multimodal Learning
- Authors: Xiongye Xiao, Gengshuo Liu, Gaurav Gupta, Defu Cao, Shixuan Li, Yaxing Li, Tianqing Fang, Mingxi Cheng, Paul Bogdan,
- Abstract summary: We develop the Information-Theoretic Hierarchical Perception (ITHP) model, which utilizes the concept of information bottleneck.
Our model designates the prime modality as input, while the remaining modalities act as detectors in the information pathway.
Our model consistently distills crucial information in multimodal learning scenarios, outperforming state-of-the-art benchmarks.
- Score: 16.8379583872582
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Integrating and processing information from various sources or modalities are critical for obtaining a comprehensive and accurate perception of the real world. Drawing inspiration from neuroscience, we develop the Information-Theoretic Hierarchical Perception (ITHP) model, which utilizes the concept of information bottleneck. Distinct from most traditional fusion models that aim to incorporate all modalities as input, our model designates the prime modality as input, while the remaining modalities act as detectors in the information pathway. Our proposed perception model focuses on constructing an effective and compact information flow by achieving a balance between the minimization of mutual information between the latent state and the input modal state, and the maximization of mutual information between the latent states and the remaining modal states. This approach leads to compact latent state representations that retain relevant information while minimizing redundancy, thereby substantially enhancing the performance of downstream tasks. Experimental evaluations on both the MUStARD and CMU-MOSI datasets demonstrate that our model consistently distills crucial information in multimodal learning scenarios, outperforming state-of-the-art benchmarks.
Related papers
- An Information Criterion for Controlled Disentanglement of Multimodal Data [39.601584166020274]
Multimodal representation learning seeks to relate and decompose information inherent in multiple modalities.
Disentangled Self-Supervised Learning (DisentangledSSL) is a novel self-supervised approach for learning disentangled representations.
arXiv Detail & Related papers (2024-10-31T14:57:31Z) - On Discriminative Probabilistic Modeling for Self-Supervised Representation Learning [85.75164588939185]
We study the discriminative probabilistic modeling problem on a continuous domain for (multimodal) self-supervised representation learning.
We conduct generalization error analysis to reveal the limitation of current InfoNCE-based contrastive loss for self-supervised representation learning.
arXiv Detail & Related papers (2024-10-11T18:02:46Z) - Neuro-Inspired Information-Theoretic Hierarchical Perception for Multimodal Learning [16.8379583872582]
We develop the Information-Theoretic Hierarchical Perception (ITHP) model, which utilizes the concept of information bottleneck.
We show that ITHP consistently distills crucial information in multimodal learning scenarios, outperforming state-of-the-art benchmarks.
arXiv Detail & Related papers (2024-04-15T01:34:44Z) - NativE: Multi-modal Knowledge Graph Completion in the Wild [51.80447197290866]
We propose a comprehensive framework NativE to achieve MMKGC in the wild.
NativE proposes a relation-guided dual adaptive fusion module that enables adaptive fusion for any modalities.
We construct a new benchmark called WildKGC with five datasets to evaluate our method.
arXiv Detail & Related papers (2024-03-28T03:04:00Z) - Balanced Multi-modal Federated Learning via Cross-Modal Infiltration [19.513099949266156]
Federated learning (FL) underpins advancements in privacy-preserving distributed computing.
We propose a novel Cross-Modal Infiltration Federated Learning (FedCMI) framework.
arXiv Detail & Related papers (2023-12-31T05:50:15Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Correlation Information Bottleneck: Towards Adapting Pretrained
Multimodal Models for Robust Visual Question Answering [63.87200781247364]
Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations.
We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
arXiv Detail & Related papers (2022-09-14T22:04:10Z) - DIME: Fine-grained Interpretations of Multimodal Models via Disentangled
Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models.
Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv Detail & Related papers (2022-03-03T20:52:47Z) - Discriminative Multimodal Learning via Conditional Priors in Generative
Models [21.166519800652047]
This research studies the realistic scenario in which all modalities and class labels are available for model training.
We show, in this scenario, that the variational lower bound limits mutual information between joint representations and missing modalities.
arXiv Detail & Related papers (2021-10-09T17:22:24Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - MHVAE: a Human-Inspired Deep Hierarchical Generative Model for
Multimodal Representation Learning [8.70928211339504]
We contribute the Multimodal Hierarchical Vari Auto-encoder (MHVAE), a hierarchical multimodal generative model for representation learning.
Inspired by human cognitive models, the MHVAE is able to learn modality-specific distributions and a joint-modality distribution, responsible for cross-modality inference.
Our model performs on par with other state-of-the-art generative models regarding joint-modality reconstruction from arbitrary input modalities and cross-modality inference.
arXiv Detail & Related papers (2020-06-04T16:24:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.