Fugu-MT 論文翻訳(概要): Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

論文の概要: Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

arxiv url: http://arxiv.org/abs/2508.05954v1
Date: Fri, 08 Aug 2025 02:38:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-11 20:39:06.052606
Title: Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents
Title（参考訳）: Bifrost-1: パッチレベルのCLIP潜水剤を用いた多モードLCMと拡散モデル
Authors: Han Lin, Jaemin Cho, Amir Zadeh, Chuan Li, Mohit Bansal,
Abstract要約: Bifrost-1は、事前訓練されたマルチモーダルLLM(MLLM)と拡散モデルをブリッジする統合フレームワークである。予め訓練したMLLMと拡散モデルとパッチレベルのCLIPラプタントをシームレスに統合することにより,高忠実度制御可能な画像生成を実現する。実験の結果,Bifrost-1は視覚的忠実度やマルチモーダル理解の観点から,従来の手法と同等あるいは優れた性能を達成できた。
参考スコア（独自算出の注目度）: 55.82787697101274
License: http://creativecommons.org/licenses/by/4.0/
Abstract: There is growing interest in integrating high-fidelity visual synthesis capabilities into large language models (LLMs) without compromising their strong reasoning capabilities. Existing methods that directly train LLMs or bridge LLMs and diffusion models usually suffer from costly training since the backbone LLMs have not seen image representations during pretraining. We present Bifrost-1, a unified framework that bridges pretrained multimodal LLMs (MLLMs) and diffusion models using patch-level CLIP image embeddings as latent variables, which are natively aligned with the MLLM's CLIP visual encoder. These patch-level image embeddings are integrated into the diffusion model with a lightweight adaptation of its ControlNet. To retain the original multimodal reasoning capabilities of MLLMs, we equip the MLLM with a visual generation branch initialized from the original MLLM parameters when predicting the patch-level image embeddings. By seamlessly integrating pretrained MLLMs and diffusion models with patch-level CLIP latents, our framework enables high-fidelity controllable image generation with significant training efficiency. Our experiments demonstrate that Bifrost-1 achieves comparable or better performance than previous methods in terms of visual fidelity and multimodal understanding, with substantially lower compute during training. We also provide comprehensive ablation studies showing the effectiveness of our design choices.
Abstract（参考訳）: 高忠実度視覚合成能力を大きな言語モデル(LLM)に統合することへの関心が高まっている。 LLMを直接訓練する既存の方法や、LLMのブリッジや拡散モデルでは、バックボーンのLLMが事前訓練中に画像表現を見ていないため、通常、コストのかかる訓練に悩まされる。本稿では,MLLMのCLIPビジュアルエンコーダとネイティブに整合した,パッチレベルのCLIPイメージ埋め込みを潜在変数として使用する,事前学習型マルチモーダルLLM(MLLM)と拡散モデルをブリッジする統合フレームワークであるBifrost-1を提案する。これらのパッチレベルのイメージ埋め込みは、そのControlNetの軽量な適応で拡散モデルに統合される。 MLLMの元々のマルチモーダル推論能力を維持するため、パッチレベルの画像埋め込みを予測する際に、元のMLLMパラメータから初期化された視覚生成ブランチをMLLMに装備する。事前学習したMLLMと拡散モデルとパッチレベルのCLIPラプタントをシームレスに統合することにより、トレーニング効率の高い高忠実度制御可能な画像生成を可能にする。実験の結果,Bifrost-1は視覚的忠実度やマルチモーダル理解の点で従来の手法と同等あるいは優れた性能を示し,トレーニング時の計算能力は大幅に低下した。また、設計選択の有効性を示す包括的なアブレーション研究も提供する。

論文の概要: Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

関連論文リスト