Fugu-MT 論文翻訳(概要): Learning Multimodal Energy-Based Model with Multimodal Variational Auto-Encoder via MCMC Revision

論文の概要: Learning Multimodal Energy-Based Model with Multimodal Variational Auto-Encoder via MCMC Revision

arxiv url: http://arxiv.org/abs/2605.00644v1
Date: Fri, 01 May 2026 13:25:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 17:43:28.964768
Title: Learning Multimodal Energy-Based Model with Multimodal Variational Auto-Encoder via MCMC Revision
Title（参考訳）: MCMC改訂によるマルチモーダル変分オートエンコーダを用いたマルチモーダルエネルギーベースモデルの学習
Authors: Jiali Cui, Zhiqiang Lao, Heather Yu,
Abstract要約: マルチモーダルEMM,共有潜時発生器,共同推論モデルの学習問題について検討した。我々はESMサンプリングの強い初期状態として機能するコヒーレントなマルチモーダルサンプルを作成することを学ぶ。
参考スコア（独自算出の注目度）: 9.644873133156656
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Energy-based models (EBMs) are a flexible class of deep generative models and are well-suited to capture complex dependencies in multimodal data. However, learning multimodal EBM by maximum likelihood requires Markov Chain Monte Carlo (MCMC) sampling in the joint data space, where noise-initialized Langevin dynamics often mixes poorly and fails to discover coherent inter-modal relationships. Multimodal VAEs have made progress in capturing such inter-modal dependencies by introducing a shared latent generator and a joint inference model. However, both the shared latent generator and joint inference model are parameterized as unimodal Gaussian (or Laplace), which severely limits their ability to approximate the complex structure induced by multimodal data. In this work, we study the learning problem of the multimodal EBM, shared latent generator, and joint inference model. We present a learning framework that effectively interweaves their MLE updates with corresponding MCMC refinements in both the data and latent spaces. Specifically, the generator is learned to produce coherent multimodal samples that serve as strong initial states for EBM sampling, while the inference model is learned to provide informative latent initializations for generator posterior sampling. Together, these two models serve as complementary models that enable effective EBM sampling and learning, yielding realistic and coherent multimodal EBM samples. Extensive experiments demonstrate superior performance for multimodal synthesis quality and coherence compared to various baselines. We conduct various analyses and ablation studies to validate the effectiveness and scalability of the proposed multimodal framework.
Abstract（参考訳）: エネルギーベースモデル(EBMs)は、深層生成モデルの柔軟なクラスであり、マルチモーダルデータの複雑な依存関係を捉えるのに適している。しかし、マルチモーダルEMMを最大限に学習するには、マルコフ・チェイン・モンテ・カルロ(MCMC)が結合データ空間でサンプリングする必要がある。マルチモーダルVAEは、共振器と共振器モデルを導入することで、そのようなモーダル間の依存関係を捕捉する。しかし、共振子生成モデルと共振子推論モデルの両方はユニモーダルガウス(またはラプラス)としてパラメータ化され、マルチモーダルデータによって引き起こされる複素構造を近似する能力を著しく制限する。本研究では,マルチモーダルEMM,共有潜在生成器,共同推論モデルの学習問題について検討する。本稿では,MLE更新をデータと潜伏空間の両方でMCMCの改良に効果的に織り込む学習フレームワークを提案する。具体的には、ジェネレータは、ESMサンプリングの強い初期状態として機能するコヒーレントなマルチモーダルサンプルを生成することを学習し、推論モデルは、ジェネレータ後続サンプリングのための情報的な潜時初期化を提供するように学習する。これら2つのモデルは、効率的なEMMサンプリングと学習を可能にする補完モデルとして機能し、現実的でコヒーレントなマルチモーダルEMMサンプルを生成する。広範囲な実験により、多モード合成の品質とコヒーレンスにおいて、様々なベースラインと比較して優れた性能を示す。提案するマルチモーダルフレームワークの有効性と拡張性を検証するために,様々な分析およびアブレーション研究を行う。

論文の概要: Learning Multimodal Energy-Based Model with Multimodal Variational Auto-Encoder via MCMC Revision

関連論文リスト