Fugu-MT 論文翻訳(概要): UniFusion: Vision-Language Model as Unified Encoder in Image Generation

論文の概要: UniFusion: Vision-Language Model as Unified Encoder in Image Generation

arxiv url: http://arxiv.org/abs/2510.12789v1
Date: Tue, 14 Oct 2025 17:57:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 19:02:32.439611
Title: UniFusion: Vision-Language Model as Unified Encoder in Image Generation
Title（参考訳）: UniFusion:画像生成における統一エンコーダとしてのビジョンランゲージモデル
Authors: Kevin Li, Manuel Brack, Sudeep Katakol, Hareesh Ravi, Ajinkya Kale,
Abstract要約: We present UniFusion, a diffusion-based generative model conditioned on a frozen large vision- language model (VLM) that serve as an unified multimodal encoder。 LAPは、VLMから編集の鍵となる拡散モデルへの視覚情報の生成と忠実な伝達のために、テキストイメージアライメントにおいて、他の浅層融合アーキテクチャよりも優れていることを示す。本稿では,VLMが生成するテキストトークンにのみ拡散変換器(DiT)を条件として,フレキシブル推論を用いたVLM-Enabled Rewriting Injectionを提案する。
参考スコア（独自算出の注目度）: 12.811191961286852
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Although recent advances in visual generation have been remarkable, most existing architectures still depend on distinct encoders for images and text. This separation constrains diffusion models' ability to perform cross-modal reasoning and knowledge transfer. Prior attempts to bridge this gap often use the last layer information from VLM, employ multiple visual encoders, or train large unified models jointly for text and image generation, which demands substantial computational resources and large-scale data, limiting its accessibility.We present UniFusion, a diffusion-based generative model conditioned on a frozen large vision-language model (VLM) that serves as a unified multimodal encoder. At the core of UniFusion is the Layerwise Attention Pooling (LAP) mechanism that extracts both high level semantics and low level details from text and visual tokens of a frozen VLM to condition a diffusion generative model. We demonstrate that LAP outperforms other shallow fusion architectures on text-image alignment for generation and faithful transfer of visual information from VLM to the diffusion model which is key for editing. We propose VLM-Enabled Rewriting Injection with Flexibile Inference (VERIFI), which conditions a diffusion transformer (DiT) only on the text tokens generated by the VLM during in-model prompt rewriting. VERIFI combines the alignment of the conditioning distribution with the VLM's reasoning capabilities for increased capabilities and flexibility at inference. In addition, finetuning on editing task not only improves text-image alignment for generation, indicative of cross-modality knowledge transfer, but also exhibits tremendous generalization capabilities. Our model when trained on single image editing, zero-shot generalizes to multiple image references further motivating the unified encoder design of UniFusion.
Abstract（参考訳）: 近年の視覚生成の進歩は目覚ましいが、既存のアーキテクチャの多くは画像やテキストのエンコーダに依存している。この分離は拡散モデルのクロスモーダル推論と知識伝達を行う能力を制約する。このギャップを埋める以前の試みでは、VLMからの最後のレイヤ情報を使用したり、複数のビジュアルエンコーダを使用したり、テキストと画像の生成に共同で大規模な統一モデルを訓練したりすることが多かった。 UniFusionのコアとなるのはLayerwise Attention Pooling(LAP)メカニズムで、フリーズされたVLMのテキストとビジュアルトークンから高レベルのセマンティクスと低レベルの詳細の両方を抽出し、拡散生成モデルを条件とする。 LAPはテキストイメージアライメントにおいて、VLMから編集の鍵となる拡散モデルへの視覚情報の生成と忠実な転送のために、他の浅層融合アーキテクチャよりも優れていることを示す。本稿では,VLMが生成するテキストトークンにのみ拡散変換器(DiT)を条件として,フレキシブル推論を用いたVLM-Enabled Rewriting Injectionを提案する。 VERIFIは条件分布のアライメントとVLMの推論能力を組み合わせることで、推論時の能力と柔軟性を向上させる。さらに、編集タスクの微調整は、生成のためのテキストイメージアライメントを改善するだけでなく、モダリティ間の知識伝達を示すだけでなく、膨大な一般化能力を示す。単一画像編集の訓練において、ゼロショットは複数の画像参照に一般化され、UniFusionの統一エンコーダ設計がさらに動機付けられる。

論文の概要: UniFusion: Vision-Language Model as Unified Encoder in Image Generation

関連論文リスト