Related papers: Hellinger Multimodal Variational Autoencoders

Hellinger Multimodal Variational Autoencoders

URL: http://arxiv.org/abs/2601.06572v1
Date: Sat, 10 Jan 2026 13:39:36 GMT
Title: Hellinger Multimodal Variational Autoencoders
Authors: Huyen Khanh Vo, Isabel Valera,
Abstract summary: Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities.<n>We propose HELVAE, a multimodal VAE that avoids sub-sampling.<n>We empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.
Score: 7.778719963322215
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of experts (PoE), a mixture of experts (MoE), or their combinations to approximate the joint posterior. In this work, we revisit multimodal inference through the lens of probabilistic opinion pooling, an optimization-based approach. We start from Hölder pooling with $α=0.5$, which corresponds to the unique symmetric member of the $α\text{-divergence}$ family, and derive a moment-matching approximation, termed Hellinger. We then leverage such an approximation to propose HELVAE, a multimodal VAE that avoids sub-sampling, yielding an efficient yet effective model that: (i) learns more expressive latent representations as additional modalities are observed; and (ii) empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.

Related papers

HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation [72.69742127579508]
Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models)<n>In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors.<n> Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge.
arXiv Detail & Related papers (2025-11-25T17:23:38Z)
Amplifying Prominent Representations in Multimodal Learning via Variational Dirichlet Process [55.91649771370862]
Dirichlet process (DP) mixture model is a powerful non-parametric method that can amplify the most prominent features.<n>We propose a new DP-driven multimodal learning framework that automatically achieves an optimal balance between prominent intra-modal representation learning and cross-modal alignment.
arXiv Detail & Related papers (2025-10-23T16:53:24Z)
Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization [68.64764778089229]
We propose MISP-DPO, the first framework to incorporate multiple, semantically diverse negative images in multimodal DPO.<n>Our method embeds prompts and candidate images in CLIP space and applies a sparse autoencoder to uncover semantic deviations into interpretable factors.<n>Experiments across five benchmarks demonstrate that MISP-DPO consistently improves multimodal alignment over prior methods.
arXiv Detail & Related papers (2025-09-30T03:24:09Z)
Aggregation of Dependent Expert Distributions in Multimodal Variational Autoencoders [32.87811217394167]
Multimodal learning with variational autoencoders (VAEs) requires estimating joint distributions to evaluate the evidence lower bound (ELBO)<n>This research introduces a novel methodology for aggregating single-modality distributions by exploiting the principle of consensus of dependent experts (CoDE)<n>The resulting CoDE-VAE model demonstrates better performance in terms of balancing the trade-off between generative coherence and generative quality, as well as generating more precise log-likelihood estimations.
arXiv Detail & Related papers (2025-05-02T09:24:10Z)
Multimodal Variational Autoencoder: a Barycentric View [3.413330490927693]
We provide an alternative generic and theoretical formulation of multimodal VAE through the lens of barycenter.<n>In particular, we explore the Wasserstein barycenter defined by the 2-Wasserstein distance, which better preserves the geometry of unimodal distributions.<n> Empirical studies on three multimodal benchmarks demonstrated the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-12-29T15:02:50Z)
Score-Based Multimodal Autoencoder [0.9208007322096533]
Multimodal Variational Autoencoders (VAEs) facilitate the construction of a tractable posterior within the latent space given multiple modalities.<n>Previous studies have shown that as the number of modalities increases, the generative quality of each modality declines.<n>In this study, we explore an alternative approach to enhance the generative performance of multimodal VAEs by jointly modeling the latent space of independently trained unimodal VAEs.
arXiv Detail & Related papers (2023-05-25T04:43:47Z)
Generalizing Multimodal Variational Methods to Sets [35.69942798534849]
This paper presents a novel variational method on sets called the Set Multimodal VAE (SMVAE) for learning a multimodal latent space. By modeling the joint-modality posterior distribution directly, the proposed SMVAE learns to exchange information between multiple modalities and compensate for the drawbacks caused by factorization.
arXiv Detail & Related papers (2022-12-19T23:50:19Z)
A Unified Framework for Multi-distribution Density Ratio Estimation [101.67420298343512]
Binary density ratio estimation (DRE) provides the foundation for many state-of-the-art machine learning algorithms. We develop a general framework from the perspective of Bregman minimization divergence. We show that our framework leads to methods that strictly generalize their counterparts in binary DRE.
arXiv Detail & Related papers (2021-12-07T01:23:20Z)
Efficient Model-Based Multi-Agent Mean-Field Reinforcement Learning [89.31889875864599]
We propose an efficient model-based reinforcement learning algorithm for learning in multi-agent systems. Our main theoretical contributions are the first general regret bounds for model-based reinforcement learning for MFC. We provide a practical parametrization of the core optimization problem.
arXiv Detail & Related papers (2021-07-08T18:01:02Z)
Permutation Invariant Policy Optimization for Mean-Field Multi-Agent Reinforcement Learning: A Principled Approach [128.62787284435007]
We propose the mean-field proximal policy optimization (MF-PPO) algorithm, at the core of which is a permutation-invariant actor-critic neural architecture. We prove that MF-PPO attains the globally optimal policy at a sublinear rate of convergence. In particular, we show that the inductive bias introduced by the permutation-invariant neural architecture enables MF-PPO to outperform existing competitors.
arXiv Detail & Related papers (2021-05-18T04:35:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.