Hellinger Multimodal Variational Autoencoders
- URL: http://arxiv.org/abs/2601.06572v1
- Date: Sat, 10 Jan 2026 13:39:36 GMT
- Title: Hellinger Multimodal Variational Autoencoders
- Authors: Huyen Khanh Vo, Isabel Valera,
- Abstract summary: Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities.<n>We propose HELVAE, a multimodal VAE that avoids sub-sampling.<n>We empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.
- Score: 7.778719963322215
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of experts (PoE), a mixture of experts (MoE), or their combinations to approximate the joint posterior. In this work, we revisit multimodal inference through the lens of probabilistic opinion pooling, an optimization-based approach. We start from Hölder pooling with $α=0.5$, which corresponds to the unique symmetric member of the $α\text{-divergence}$ family, and derive a moment-matching approximation, termed Hellinger. We then leverage such an approximation to propose HELVAE, a multimodal VAE that avoids sub-sampling, yielding an efficient yet effective model that: (i) learns more expressive latent representations as additional modalities are observed; and (ii) empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.
Related papers
- HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation [72.69742127579508]
Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models)<n>In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors.<n> Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge.
arXiv Detail & Related papers (2025-11-25T17:23:38Z) - Amplifying Prominent Representations in Multimodal Learning via Variational Dirichlet Process [55.91649771370862]
Dirichlet process (DP) mixture model is a powerful non-parametric method that can amplify the most prominent features.<n>We propose a new DP-driven multimodal learning framework that automatically achieves an optimal balance between prominent intra-modal representation learning and cross-modal alignment.
arXiv Detail & Related papers (2025-10-23T16:53:24Z) - Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization [68.64764778089229]
We propose MISP-DPO, the first framework to incorporate multiple, semantically diverse negative images in multimodal DPO.<n>Our method embeds prompts and candidate images in CLIP space and applies a sparse autoencoder to uncover semantic deviations into interpretable factors.<n>Experiments across five benchmarks demonstrate that MISP-DPO consistently improves multimodal alignment over prior methods.
arXiv Detail & Related papers (2025-09-30T03:24:09Z) - Aggregation of Dependent Expert Distributions in Multimodal Variational Autoencoders [32.87811217394167]
Multimodal learning with variational autoencoders (VAEs) requires estimating joint distributions to evaluate the evidence lower bound (ELBO)<n>This research introduces a novel methodology for aggregating single-modality distributions by exploiting the principle of consensus of dependent experts (CoDE)<n>The resulting CoDE-VAE model demonstrates better performance in terms of balancing the trade-off between generative coherence and generative quality, as well as generating more precise log-likelihood estimations.
arXiv Detail & Related papers (2025-05-02T09:24:10Z) - Multimodal Variational Autoencoder: a Barycentric View [3.413330490927693]
We provide an alternative generic and theoretical formulation of multimodal VAE through the lens of barycenter.<n>In particular, we explore the Wasserstein barycenter defined by the 2-Wasserstein distance, which better preserves the geometry of unimodal distributions.<n> Empirical studies on three multimodal benchmarks demonstrated the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-12-29T15:02:50Z) - Score-Based Multimodal Autoencoder [0.9208007322096533]
Multimodal Variational Autoencoders (VAEs) facilitate the construction of a tractable posterior within the latent space given multiple modalities.<n>Previous studies have shown that as the number of modalities increases, the generative quality of each modality declines.<n>In this study, we explore an alternative approach to enhance the generative performance of multimodal VAEs by jointly modeling the latent space of independently trained unimodal VAEs.
arXiv Detail & Related papers (2023-05-25T04:43:47Z) - Generalizing Multimodal Variational Methods to Sets [35.69942798534849]
This paper presents a novel variational method on sets called the Set Multimodal VAE (SMVAE) for learning a multimodal latent space.
By modeling the joint-modality posterior distribution directly, the proposed SMVAE learns to exchange information between multiple modalities and compensate for the drawbacks caused by factorization.
arXiv Detail & Related papers (2022-12-19T23:50:19Z) - A Unified Framework for Multi-distribution Density Ratio Estimation [101.67420298343512]
Binary density ratio estimation (DRE) provides the foundation for many state-of-the-art machine learning algorithms.
We develop a general framework from the perspective of Bregman minimization divergence.
We show that our framework leads to methods that strictly generalize their counterparts in binary DRE.
arXiv Detail & Related papers (2021-12-07T01:23:20Z) - Efficient Model-Based Multi-Agent Mean-Field Reinforcement Learning [89.31889875864599]
We propose an efficient model-based reinforcement learning algorithm for learning in multi-agent systems.
Our main theoretical contributions are the first general regret bounds for model-based reinforcement learning for MFC.
We provide a practical parametrization of the core optimization problem.
arXiv Detail & Related papers (2021-07-08T18:01:02Z) - Permutation Invariant Policy Optimization for Mean-Field Multi-Agent
Reinforcement Learning: A Principled Approach [128.62787284435007]
We propose the mean-field proximal policy optimization (MF-PPO) algorithm, at the core of which is a permutation-invariant actor-critic neural architecture.
We prove that MF-PPO attains the globally optimal policy at a sublinear rate of convergence.
In particular, we show that the inductive bias introduced by the permutation-invariant neural architecture enables MF-PPO to outperform existing competitors.
arXiv Detail & Related papers (2021-05-18T04:35:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.