Fugu-MT 論文翻訳(概要): UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

論文の概要: UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

arxiv url: http://arxiv.org/abs/2605.12088v2
Date: Wed, 13 May 2026 15:41:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 17:13:58.898554
Title: UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
Title（参考訳）: UniCustom:マルチ参照画像生成のための統一ビジュアルコンディショニング
Authors: Yiyan Xu, Qiulin Wang, Wenjie Wang, Yunyao Mao, Xintao Wang, Pengfei Wan, Kun Gai, Fuli Feng,
Abstract要約: VLMエンコーディングの前にVTとVAE機能を融合した統合ビジュアルコンディショニングフレームワークを提案する。 2つのマルチ参照生成ベンチマークの実験により、UniCustomは主題の一貫性、命令従順、構成の忠実さを一貫して改善することを示した。
参考スコア（独自算出の注目度）: 65.53694602893042
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-reference image generation aims to synthesize images from textual instructions while faithfully preserving subject identities from multiple reference images. Existing VLM-enhanced diffusion models commonly rely on decoupled visual conditioning: semantic ViT features are processed by the VLM for instruction understanding, whereas appearance-rich VAE features are injected later into the diffusion backbone. Despite its intuitive design, this separation makes it difficult for the model to associate each semantically grounded subject with visual details from the correct reference image. As a result, the model may recognize which subject is being referred to, but fail to preserve its identity and fine-grained appearance, leading to attribute leakage and cross-reference confusion in complex multi-reference settings. To address this issue, we propose UniCustom, a unified visual conditioning framework that fuses ViT and VAE features before VLM encoding. This early fusion exposes the VLM to both semantic cues and appearance-rich details, enabling its hidden states to jointly encode the referred subject and corresponding visual appearance with only a lightweight linear fusion layer. To learn such unified representations, we adopt a two-stage training strategy: reconstruction-oriented pretraining that preserves reference-specific appearance details in the fused hidden states, followed by supervised finetuning on single- and multi-reference generation tasks. We further introduce a slot-wise binding regularization that encourages each image slot to preserve low-level details of its corresponding reference, thereby reducing cross-reference entanglement. Experiments on two multi-reference generation benchmarks demonstrate that UniCustom consistently improves subject consistency, instruction following, and compositional fidelity over strong baselines.
Abstract（参考訳）: マルチ参照画像生成は、複数の参照画像から被写体を忠実に保存しながら、テキスト命令から画像を合成することを目的としている。既存のVLM拡張拡散モデルは分離された視覚条件に依存しており、意味的なViT特徴は命令理解のためにVLMによって処理されるが、外見に富んだVAE特徴は後に拡散バックボーンに注入される。直感的な設計にもかかわらず、この分離により、各意味論的対象と正しい参照画像からの視覚的詳細を関連付けることが困難になる。結果として、モデルはどの主題が参照されているかを認識するが、そのアイデンティティときめ細かい外観を保たず、複雑なマルチ参照設定において属性の漏洩と相互参照の混乱を引き起こす。この問題に対処するために、VLMエンコーディングの前にVTとVAE機能を融合した統合ビジュアルコンディショニングフレームワークUniCustomを提案する。この初期の融合は、VLMをセマンティックな手がかりと外観に富んだ詳細の両方に露出させ、その隠された状態が、参照対象と対応する視覚的外観を、軽量な線形融合層のみで共同的に符号化することを可能にする。このような統一表現を学習するために、我々は2段階の訓練戦略を採用する: 融合した隠蔽状態における参照特化の詳細を保存した再構成指向の事前訓練と、単一および複数参照生成タスクの教師付き微調整を行う。さらに、スロットワイドバインディングの正規化を導入し、各画像スロットが対応する参照の低レベルの詳細を保存できるようにし、参照の絡み合いを低減する。 2つのマルチ参照生成ベンチマークの実験により、UniCustomは、強いベースラインに対する主観的一貫性、命令追従、構成的忠実度を一貫して改善することを示した。

論文の概要: UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

関連論文リスト