Unity by Diversity: Improved Representation Learning in Multimodal VAEs
- URL: http://arxiv.org/abs/2403.05300v4
- Date: Fri, 01 Nov 2024 10:19:01 GMT
- Title: Unity by Diversity: Improved Representation Learning in Multimodal VAEs
- Authors: Thomas M. Sutter, Yang Meng, Andrea Agostini, Daphné Chopard, Norbert Fortin, Julia E. Vogt, Bahbak Shahbaba, Stephan Mandt,
- Abstract summary: We show that a better latent representation can be obtained by replacing hard constraints with a soft constraint.
We show improved learned latent representations and imputation of missing data modalities compared to existing methods.
- Score: 24.691068754720106
- License:
- Abstract: Variational Autoencoders for multimodal data hold promise for many tasks in data analysis, such as representation learning, conditional generation, and imputation. Current architectures either share the encoder output, decoder input, or both across modalities to learn a shared representation. Such architectures impose hard constraints on the model. In this work, we show that a better latent representation can be obtained by replacing these hard constraints with a soft constraint. We propose a new mixture-of-experts prior, softly guiding each modality's latent representation towards a shared aggregate posterior. This approach results in a superior latent representation and allows each encoding to preserve information better from its uncompressed original features. In extensive experiments on multiple benchmark datasets and two challenging real-world datasets, we show improved learned latent representations and imputation of missing data modalities compared to existing methods.
Related papers
- IBMEA: Exploring Variational Information Bottleneck for Multi-modal Entity Alignment [17.570243718626994]
Multi-modal entity alignment (MMEA) aims to identify equivalent entities between multi-modal knowledge graphs (MMKGs)
We devise multi-modal variational encoders to generate modal-specific entity representations as probability distributions.
We also propose four modal-specific information bottleneck regularizers, limiting the misleading clues in refining modal-specific entity representations.
arXiv Detail & Related papers (2024-07-27T17:12:37Z) - ReFiNe: Recursive Field Networks for Cross-modal Multi-scene Representation [37.24514001359966]
We show how to encode multiple shapes represented as continuous neural fields with a higher degree of precision than previously possible.
We demonstrate state-of-the-art multi-scene reconstruction and compression results with a single network per dataset.
arXiv Detail & Related papers (2024-06-06T17:55:34Z) - NativE: Multi-modal Knowledge Graph Completion in the Wild [51.80447197290866]
We propose a comprehensive framework NativE to achieve MMKGC in the wild.
NativE proposes a relation-guided dual adaptive fusion module that enables adaptive fusion for any modalities.
We construct a new benchmark called WildKGC with five datasets to evaluate our method.
arXiv Detail & Related papers (2024-03-28T03:04:00Z) - Preserving Modality Structure Improves Multi-Modal Learning [64.10085674834252]
Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings without relying on human annotations.
These methods often struggle to generalize well on out-of-domain data as they ignore the semantic structure present in modality-specific embeddings.
We propose a novel Semantic-Structure-Preserving Consistency approach to improve generalizability by preserving the modality-specific relationships in the joint embedding space.
arXiv Detail & Related papers (2023-08-24T20:46:48Z) - Semi-supervised Multimodal Representation Learning through a Global Workspace [2.8948274245812335]
"Global Workspace" is a shared representation for two input modalities.
This architecture is amenable to self-supervised training via cycle-consistency.
We show that such an architecture can be trained to align and translate between two modalities with very little need for matched data.
arXiv Detail & Related papers (2023-06-27T12:41:36Z) - Learning Multimodal Data Augmentation in Feature Space [65.54623807628536]
LeMDA is an easy-to-use method that automatically learns to jointly augment multimodal data in feature space.
We show that LeMDA can profoundly improve the performance of multimodal deep learning architectures.
arXiv Detail & Related papers (2022-12-29T20:39:36Z) - Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z) - Unsupervised Multimodal Language Representations using Convolutional
Autoencoders [5.464072883537924]
We propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks.
We map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets.
It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters.
arXiv Detail & Related papers (2021-10-06T18:28:07Z) - Deep Partial Multi-View Learning [94.39367390062831]
We propose a novel framework termed Cross Partial Multi-View Networks (CPM-Nets)
We fifirst provide a formal defifinition of completeness and versatility for multi-view representation.
We then theoretically prove the versatility of the learned latent representations.
arXiv Detail & Related papers (2020-11-12T02:29:29Z) - Relation-Guided Representation Learning [53.60351496449232]
We propose a new representation learning method that explicitly models and leverages sample relations.
Our framework well preserves the relations between samples.
By seeking to embed samples into subspace, we show that our method can address the large-scale and out-of-sample problem.
arXiv Detail & Related papers (2020-07-11T10:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.