Fugu-MT 論文翻訳(概要): DeCoT: Decomposing Complex Instructions for Enhanced Text-to-Image Generation with Large Language Models

論文の概要: DeCoT: Decomposing Complex Instructions for Enhanced Text-to-Image Generation with Large Language Models

arxiv url: http://arxiv.org/abs/2508.12396v1
Date: Sun, 17 Aug 2025 15:15:39 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-19 14:49:10.740284
Title: DeCoT: Decomposing Complex Instructions for Enhanced Text-to-Image Generation with Large Language Models
Title（参考訳）: DeCoT:大規模言語モデルを用いたテキスト・画像生成のための複雑な命令を分解する
Authors: Xiaochuan Lin, Xiangyong Chen, Xuan Li, Yichen Su,
Abstract要約: 本稿では,T2Iモデルの複雑な命令の理解と実行を強化するフレームワークであるDeCoT(Decomposition-CoT)を提案する。 LongBench-T2Iデータセットの大規模な実験は、DeCoTが一貫し、主要なT2Iモデルの性能を大幅に向上することを示した。
参考スコア（独自算出の注目度）: 9.800887055353096
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite remarkable advancements, current Text-to-Image (T2I) models struggle with complex, long-form textual instructions, frequently failing to accurately render intricate details, spatial relationships, or specific constraints. This limitation is highlighted by benchmarks such as LongBench-T2I, which reveal deficiencies in handling composition, specific text, and fine textures. To address this, we propose DeCoT (Decomposition-CoT), a novel framework that leverages Large Language Models (LLMs) to significantly enhance T2I models' understanding and execution of complex instructions. DeCoT operates in two core stages: first, Complex Instruction Decomposition and Semantic Enhancement, where an LLM breaks down raw instructions into structured, actionable semantic units and clarifies ambiguities; second, Multi-Stage Prompt Integration and Adaptive Generation, which transforms these units into a hierarchical or optimized single prompt tailored for existing T2I models. Extensive experiments on the LongBench-T2I dataset demonstrate that DeCoT consistently and substantially improves the performance of leading T2I models across all evaluated dimensions, particularly in challenging aspects like "Text" and "Composition". Quantitative results, validated by multiple MLLM evaluators (Gemini-2.0-Flash and InternVL3-78B), show that DeCoT, when integrated with Infinity-8B, achieves an average score of 3.52, outperforming the baseline Infinity-8B (3.44). Ablation studies confirm the critical contribution of each DeCoT component and the importance of sophisticated LLM prompting. Furthermore, human evaluations corroborate these findings, indicating superior perceptual quality and instruction fidelity. DeCoT effectively bridges the gap between high-level user intent and T2I model requirements, leading to more faithful and accurate image generation.
Abstract（参考訳）: 顕著な進歩にもかかわらず、現在のテキスト・トゥ・イメージ(T2I)モデルは複雑な長文の命令に苦しむが、複雑な詳細や空間的関係、特定の制約を正確にレンダリングすることができないことが多い。この制限は、LongBench-T2Iのようなベンチマークによって強調されている。そこで本研究では,大規模言語モデル(LLM)を利用した複雑な命令の理解と実行を大幅に向上させる新しいフレームワークであるDeCoTを提案する。 DeCoTは2つのコアステージで運用されている: 複合命令分解とセマンティックエンハンスメント: LLMが生の命令を構造化された動作可能なセマンティックユニットに分解し、あいまいさを明確にする。 LongBench-T2Iデータセットの大規模な実験により、DeCoTは、特に"Text"や"Composition"のような困難な側面において、すべての評価次元にわたって、主要なT2Iモデルのパフォーマンスを一貫して、実質的に改善することを示した。複数のMLLM評価器(Gemini-2.0-FlashとInternVL3-78B)によって検証された定量的結果は、DeCoTがInfinity-8Bに統合された場合、平均スコアは3.52で、ベースラインのInfinity-8B(3.44)を上回っていることを示している。アブレーション研究は、各DeCoT成分の臨界寄与と高度なLCMプロンプトの重要性を裏付ける。さらに、人間の評価はこれらの知見と相関し、知覚的品質と指示忠実度が優れていることを示す。 DeCoTは、高レベルのユーザ意図とT2Iモデル要件のギャップを効果的に埋め、より忠実で正確な画像生成につながる。

論文の概要: DeCoT: Decomposing Complex Instructions for Enhanced Text-to-Image Generation with Large Language Models

関連論文リスト