Fugu-MT 論文翻訳(概要): Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

論文の概要: Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

arxiv url: http://arxiv.org/abs/2509.03516v2
Date: Wed, 01 Oct 2025 15:26:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-02 14:33:21.733414
Title: Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?
Title（参考訳）: テキスト・トゥ・イメージ・モデルで舞台に立つことはできるのか?
Authors: Ouxiang Li, Yuan Wang, Xinting Hu, Huijuan Huang, Rui Chen, Jiarong Ou, Xin Tao, Pengfei Wan, Xiaojuan Qi, Fuli Feng,
Abstract要約: T2I-CoReBenchは、T2Iモデルの合成能力と推論能力の両方を評価する包括的で複雑なベンチマークである。実世界の複雑さによって引き起こされる複雑さを増大させるために、我々は各プロンプトをより高い組成密度でキュレートする。統計学では、我々のベンチマークは1080の挑戦的なプロンプトと約1,500のチェックリスト質問で構成されている。
参考スコア（独自算出の注目度）: 63.66192651248858
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, which thus correspond to two core capabilities: composition and reasoning. Despite recent advances of T2I models in both composition and reasoning, existing benchmarks remain limited in evaluation. They not only fail to provide comprehensive coverage across and within both capabilities, but also largely restrict evaluation to low scene density and simple one-to-one reasoning. To address these limitations, we propose T2I-CoReBench, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (instance, attribute, and relation) and reasoning around the philosophical framework of inference (deductive, inductive, and abductive), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent real-world complexities, we curate each prompt with higher compositional density for composition and greater reasoning intensity for reasoning. To facilitate fine-grained and reliable evaluation, we also pair each evaluation prompt with a checklist that specifies individual yes/no questions to assess each intended element independently. In statistics, our benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions. Experiments across 28 current T2I models reveal that their composition capability still remains limited in high compositional scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts.
Abstract（参考訳）: Text-to-image (T2I) 生成はテキストプロンプトから画像を合成することを目的としている。近年のT2Iモデルは、構成と推論の両方において進歩しているが、既存のベンチマークは評価に限られている。両者の能力を包括的に網羅するだけでなく、低いシーン密度と単純な1対1の推論に大きく制限されている。これらの制約に対処するために、T2Iモデルの構成と推論能力の両方を評価する包括的かつ複雑なベンチマークであるT2I-CoReBenchを提案する。包括性を確保するために、シーングラフ要素(インスタンス、属性、関係)を中心に構成し、12次元評価分類を定式化した推論(帰納的、帰納的、帰納的)の哲学的枠組みについて推論する。実世界の複雑さによって引き起こされる複雑さを増大させるため、我々は各プロンプトを、より高い組成密度と推論の推論強度でキュレートする。また,各評価プロンプトを個別のイエス/ノー質問を指定するチェックリストと組み合わせて,目的の要素を個別に評価する。統計学では、我々のベンチマークは1080の挑戦的なプロンプトと約1,500のチェックリスト質問で構成されている。現在の28のT2Iモデルでの実験では、構成能力は依然として高い構成シナリオで制限されているが、推論能力は重要なボトルネックとしてさらに遅れており、すべてのモデルがプロンプトから暗黙的な要素を推論するのに苦労している。

論文の概要: Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

関連論文リスト