Fugu-MT 論文翻訳(概要): Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers

論文の概要: Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers

arxiv url: http://arxiv.org/abs/2603.05769v1
Date: Fri, 06 Mar 2026 00:09:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:44.699564
Title: Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers
Title（参考訳）: テキスト・画像拡散変換器における局所・排他制御のためのレイヤワイズ・インスタンス結合
Authors: Ruidong Chen, Yancheng Bai, Xuanpu Zhang, Jianhao Zeng, Lanjun Wang, Dan Song, Lei Sun, Xiangxiang Chu, Anan Liu,
Abstract要約: テキスト・画像生成における領域命令によるレイアウト制御は非常に実用的だが、既存の手法は制限に悩まされている。地域生成を異なるレイヤとしてモデル化し、生成中にそれらを結合することにより、LayerBindを提案する。
参考スコア（独自算出の注目度）: 49.08465459791972
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Region-instructed layout control in text-to-image generation is highly practical, yet existing methods suffer from limitations: (i) training-based approaches inherit data bias and often degrade image quality, and (ii) current techniques struggle with occlusion order, limiting real-world usability. To address these issues, we propose LayerBind. By modeling regional generation as distinct layers and binding them during the generation, our method enables precise regional and occlusion controllability. Our motivation stems from the observation that spatial layout and occlusion are established at a very early denoising stage, suggesting that rearranging the early latent structure is sufficient to modify the final output. Building on this, we structure the scheme into two phases: instance initialization and subsequent semantic nursing. (1) First, leveraging the contextual sharing mechanism in multimodal joint attention, Layer-wise Instance Initialization creates per-instance branches that attend to their own regions while anchoring to the shared background. At a designated early step, these branches are fused according to the layer order to form a unified latent with a pre-established layout. (2) Then, Layer-wise Semantic Nursing reinforces regional details and maintains the occlusion order via a layer-wise attention enhancement. Specifically, a sequential layered attention path operates alongside the standard global path, with updates composited under a layer-transparency scheduler. LayerBind is training-free and plug-and-play, serving as a regional and occlusion controller across Diffusion Transformers. Beyond generation, it natively supports editable workflows, allowing for flexible modifications like changing instances or rearranging visible orders. Both qualitative and quantitative results demonstrate LayerBind's effectiveness, highlighting its strong potential for creative applications.
Abstract（参考訳）: テキスト・画像生成における領域命令によるレイアウト制御は非常に実用的だが、既存の手法は制限に悩まされている。 i) トレーニングベースのアプローチは、データのバイアスを継承し、画像品質を劣化させることが多い。 (二)現在の技術は、現実世界のユーザビリティを制限する排他的秩序に苦しむ。これらの問題に対処するために、LayerBindを提案する。地域生成を異なる層としてモデル化し、生成中にそれらを結合することにより、正確な地域的・排他的制御を可能にする。我々のモチベーションは、空間的配置と閉塞が非常に初期の認知段階に確立されることから来ており、初期潜伏構造の再構成が最終的な出力を変更するのに十分であることを示している。これに基づいて、本手法をインスタンス初期化とその後のセマンティック看護という2つのフェーズに構成する。 1)マルチモーダル共同注意におけるコンテキスト共有機構を活用することにより,レイヤワイズインスタンス初期化は,共有背景に固定しつつ,それぞれの領域に従属するインスタンス単位のブランチを生成する。指定された初期段階では、これらの分岐は層順に従って融合され、予め確立されたレイアウトで統一された潜水線を形成する。 2) レイヤーワイド・セマンティック・ナーシングは, 地域細部を補強し, 層ワイド・アテンション・エンハンスメントを通じて隠蔽順序を維持する。具体的には、シーケンシャルな階層化されたアテンションパスが標準のグローバルパスに沿って動作し、更新は層透過スケジューラの下で合成される。 LayerBindはトレーニングフリーでプラグアンドプレイで、Diffusion Transformerをまたいだ地域的および排他的コントローラとして機能する。生成以外にも、編集可能なワークフローをネイティブにサポートし、インスタンスの変更や可視的な順序の変更など、柔軟な修正を可能にする。質的な結果と定量的な結果の両方がLayerBindの有効性を示し、クリエイティブなアプリケーションに対する強力な可能性を強調している。

論文の概要: Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers

関連論文リスト