Fugu-MT 論文翻訳(概要): The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models

論文の概要: The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models

arxiv url: http://arxiv.org/abs/2510.19557v1
Date: Wed, 22 Oct 2025 13:13:27 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:15.821547
Title: The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models
Title（参考訳）: T2Iモデルにおけるプロンプト複雑度、品質、多様性、一貫性の複雑なダンス
Authors: Xiaofeng Zhang, Aaron Courville, Michal Drozdzal, Adriana Romero-Soriano,
Abstract要約: テキスト・ツー・イメージ(T2I)モデルは、無限の合成データを作成する大きな可能性を提供します。これまでの研究は、T2Iモデルの3つの重要なデシダータ(品質、多様性、一貫性)における合成データの有用性を評価してきた。実データと合成データの有用性を比較するための新しい評価フレームワークを提案する。
参考スコア（独自算出の注目度）: 12.156662936278751
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Text-to-image (T2I) models offer great potential for creating virtually limitless synthetic data, a valuable resource compared to fixed and finite real datasets. Previous works evaluate the utility of synthetic data from T2I models on three key desiderata: quality, diversity, and consistency. While prompt engineering is the primary means of interacting with T2I models, the systematic impact of prompt complexity on these critical utility axes remains underexplored. In this paper, we first conduct synthetic experiments to motivate the difficulty of generalization w.r.t. prompt complexity and explain the observed difficulty with theoretical derivations. Then, we introduce a new evaluation framework that can compare the utility of real data and synthetic data, and present a comprehensive analysis of how prompt complexity influences the utility of synthetic data generated by commonly used T2I models. We conduct our study across diverse datasets, including CC12M, ImageNet-1k, and DCI, and evaluate different inference-time intervention methods. Our synthetic experiments show that generalizing to more general conditions is harder than the other way round, since the former needs an estimated likelihood that is not learned by diffusion models. Our large-scale empirical experiments reveal that increasing prompt complexity results in lower conditional diversity and prompt consistency, while reducing the synthetic-to-real distribution shift, which aligns with the synthetic experiments. Moreover, current inference-time interventions can augment the diversity of the generations at the expense of moving outside the support of real data. Among those interventions, prompt expansion, by deliberately using a pre-trained language model as a likelihood estimator, consistently achieves the highest performance in both image diversity and aesthetics, even higher than that of real data.
Abstract（参考訳）: テキスト・トゥ・イメージ(T2I)モデルは、固定された実データや有限個の実データに対して貴重なリソースである、事実上の制限のない合成データを作成する大きな可能性を提供します。これまでの研究は、T2Iモデルの3つの重要なデシダータ(品質、多様性、一貫性)における合成データの有用性を評価してきた。プロンプトエンジニアリングはT2Iモデルと相互作用する主要な手段であるが、これらの重要なユーティリティー軸に対する急激な複雑さの体系的な影響は未解明のままである。本稿では、まず、一般化の難しさを動機付けるための合成実験を行い、その難しさを理論的導出で説明する。そこで,本研究では,実データと合成データの有用性を比較可能な新しい評価フレームワークを提案し,その複雑さが一般的なT2Iモデルによって生成される合成データの有用性にどのように影響するかを包括的に分析する。我々は,CC12M,ImageNet-1k,DCIなど,さまざまなデータセットを対象とした調査を行い,異なる推論時間介入手法の評価を行った。より一般的な条件への一般化は、拡散モデルでは学ばない推定確率を必要とするため、他の方法よりも難しいことが、我々の合成実験によって示されている。大規模実験により, 急激な複雑さの増加は条件の多様性を低下させ, 一貫性を促進させるとともに, 合成実験と整合する合成-実分布シフトを減少させることが明らかとなった。さらに、現在の推論時間の介入は、実際のデータのサポート外に移ることによる世代間の多様性を増大させる可能性がある。これらの介入の中で、事前訓練された言語モデルを潜在的推定子として意図的に使用することにより、画像の多様性と美学の両方において、実際のデータよりも高いパフォーマンスを確実に達成する。

論文の概要: The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models

関連論文リスト