Fugu-MT 論文翻訳(概要): Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

論文の概要: Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

arxiv url: http://arxiv.org/abs/2605.26111v1
Date: Mon, 25 May 2026 17:59:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:20.659549
Title: Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation
Title（参考訳）: 主観的生成のための多モーダル大言語モデルからのスクイーズ能力
Authors: Shuhong Zheng, Aashish Kumar Misraa, Yu-Teng Li, Yu-Jhe Li, Igor Gilitschenski,
Abstract要約: 既存のアプローチはしばしばテキストと参照画像を別々にエンコードする。マルチモーダルモデルと拡散モデルを結ぶ最近のフレームワークは、命令のフォローを改善するが、ほとんどはアイデンティティの保存を見落としている。テキストと参照画像を共同でエンコードする多モーダル大言語モデル上での拡散モデルを構築し,それをVAEベースのID条件付きで拡張する。提案手法は,マルチモーダル理解とアイデンティティ保護を調和させ,コピー・ペースト問題を緩和し,主観的画像生成における人間の嗜好に関する優れた性能を実現する。
参考スコア（独自算出の注目度）: 22.419513267677278
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-modal reasoning abilities and causes copy-paste artifacts. Recent frameworks that connect multimodal models and diffusion models improve instruction following, but largely overlook identity preservation. To address these limitations, we condition diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, and augment it with VAE-based identity conditioning. A novel Dual Layer Aggregation (DLA) module is designed to aggregate multi-level MLLM features for optimal conditioning, and a multi-stage denoising strategy is applied to progressively balance the semantic information from MLLM and fine-detail identity from VAE during inference. Extensive experiments demonstrate that our approach harmonizes multimodal understanding with identity preservation, mitigates copy-paste issues, and achieves superior performance regarding human preference on subject-driven image generation. Our project website is available at https://zsh2000.github.io/squeeze-mllm-subject-gen/.
Abstract（参考訳）: 対象駆動画像生成は,テキストの指示に従いながら,対象者の同一性を保った新たな画像の合成を目的とする。既存のアプローチはしばしばテキストと参照画像を別々にエンコードする。これにより、クロスモーダル推論能力が制限され、コピー・ペースト・アーティファクトが生じる。マルチモーダルモデルと拡散モデルを結ぶ最近のフレームワークは、命令のフォローを改善するが、ほとんどはアイデンティティの保存を見落としている。これらの制約に対処するため、テキストと参照画像を共同で符号化するMLLM(Multimodal Large Language Models)上で拡散モデルを条件化し、それをVAEベースのID条件で拡張する。最適条件付けのためのマルチレベルMLLM特徴を集約する新しいDual Layer Aggregation(DLA)モジュールを設計し、推論中にMLLMからの意味情報とVAEからの細かなアイデンティティを段階的にバランスさせるマルチステージデノナイズ戦略を適用した。広汎な実験により,本手法はアイデンティティの保存と調和し,コピー・ペースト問題を緩和し,主観的画像生成における人間の嗜好に関する優れた性能を実現する。プロジェクトのWebサイトはhttps://zsh2000.github.io/squeeze-mllm-subject-gen/で公開されている。

論文の概要: Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

関連論文リスト