Fugu-MT 論文翻訳(概要): AI-T2I: Aggregating-and-Isolating Cross-Attention to Diffusion Models for Text-to-Image Synthesis

論文の概要: AI-T2I: Aggregating-and-Isolating Cross-Attention to Diffusion Models for Text-to-Image Synthesis

arxiv url: http://arxiv.org/abs/2605.25763v2
Date: Tue, 26 May 2026 02:36:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:41.168715
Title: AI-T2I: Aggregating-and-Isolating Cross-Attention to Diffusion Models for Text-to-Image Synthesis
Title（参考訳）: AI-T2I:テキストと画像の合成のための拡散モデルへの集約と分離のクロスアテンション
Authors: Shipeng Cao, Biao Qian, Haipeng Liu, Yang Wang, Meng Wang,
Abstract要約: 本稿では,AI-T2Iと呼ばれるテキスト間合成のための拡散モデルに対するアグリゲーション・アンド・アイソレート・クロスアテンション手法を提案する。我々のAI-T2Iは、例えば、制御可能なレイアウト生成やパーソナライズされた生成など、他のタスクに対して優れた一般化を示す。
参考スコア（独自算出の注目度）: 12.76456980137364
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image synthesis has made significant progress, benefiting from the strong generative capabilities of diffusion models. However, these models struggle to achieve precise text-to-image alignment within cross-attention maps during the denoising process. Existing works primarily focus on inter-subject-token activations (i.e., cross-attention scores) overlap for different subjects, overlooking the intra-subject-token activations scattering issue for identical subjects. In this paper, we propose an Aggregating-and-Isolating cross-attention approach to diffusion models for Text-to-Image synthesis, dubbed AI-T2I. Technically, to address the scattering issue, we devise an aggregation loss to identify and consolidate the scattered intra-token activations, which implicitly helps mitigate the potential overlap issue. Upon that, an isolation loss is further introduced to push the inter-token activations apart, thus fulfilling precise text-to-image alignment. Extensive experiments on various benchmarks demonstrate the superiority of AI-T2I over the state-of-the-art works for text-to-image synthesis. Furthermore, our AI-T2I exhibits excellent generalization across other tasks, e.g., controllable layout generation and personalized generation.
Abstract（参考訳）: テキストと画像の合成は、拡散モデルの強力な生成能力の恩恵を受け、大きな進歩を遂げた。しかし,これらのモデルでは,デノナイジング過程において,横断アテンションマップ内の正確なテキストと画像のアライメントを実現するのに苦労している。既存の研究は主に、異なる被験者に対するオブジェクト間のアクティベーション(すなわち、クロスアテンションスコア)のオーバーラップに焦点を当て、同一被験者に対するオブジェクト間のアクティベーションの散乱問題を見下ろしている。本稿では,AI-T2Iと呼ばれるテキスト対画像合成のための拡散モデルに対するアグリゲーション・アンド・アイソレーション・クロスアテンション手法を提案する。技術的には,散乱問題に対処するために,散乱したトケイン内アクティベーションの同定と統合を行うアグリゲーション損失を考案する。これにより、分離損失がさらに導入され、トークン間のアクティベーションを分離し、正確なテキストと画像のアライメントが実現される。様々なベンチマークでの大規模な実験は、テキストから画像への合成のための最先端の作業よりもAI-T2Iの方が優れていることを示した。さらに、我々のAI-T2Iは、例えば、制御可能なレイアウト生成、パーソナライズされた生成など、他のタスクにまたがる優れた一般化を示す。

論文の概要: AI-T2I: Aggregating-and-Isolating Cross-Attention to Diffusion Models for Text-to-Image Synthesis

関連論文リスト