Fugu-MT 論文翻訳(概要): AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models

論文の概要: AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models

arxiv url: http://arxiv.org/abs/2509.16141v1
Date: Fri, 19 Sep 2025 16:41:39 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-22 18:18:11.247982
Title: AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models
Title（参考訳）: AcT2I:テキスト・ツー・イメージ・モデルにおけるアクション・デピクションの評価と改善
Authors: Vatsal Malaviya, Agneet Chatterjee, Maitreya Patel, Yezhou Yang, Chitta Baral,
Abstract要約: 本稿では、アクション中心のプロンプトから画像を生成する際のT2Iモデルの性能を評価するためのベンチマークであるAcT2Iを紹介する。我々は、先行するT2IモデルがAcT2Iにうまく対応していないことを実験的に検証した。我々は,この制限に対処するために,大規模言語モデルを用いた訓練不要の知識蒸留技術を開発した。
参考スコア（独自算出の注目度）: 58.85362281293525
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Text-to-Image (T2I) models have recently achieved remarkable success in generating images from textual descriptions. However, challenges still persist in accurately rendering complex scenes where actions and interactions form the primary semantic focus. Our key observation in this work is that T2I models frequently struggle to capture nuanced and often implicit attributes inherent in action depiction, leading to generating images that lack key contextual details. To enable systematic evaluation, we introduce AcT2I, a benchmark designed to evaluate the performance of T2I models in generating images from action-centric prompts. We experimentally validate that leading T2I models do not fare well on AcT2I. We further hypothesize that this shortcoming arises from the incomplete representation of the inherent attributes and contextual dependencies in the training corpora of existing T2I models. We build upon this by developing a training-free, knowledge distillation technique utilizing Large Language Models to address this limitation. Specifically, we enhance prompts by incorporating dense information across three dimensions, observing that injecting prompts with temporal details significantly improves image generation accuracy, with our best model achieving an increase of 72%. Our findings highlight the limitations of current T2I methods in generating images that require complex reasoning and demonstrate that integrating linguistic knowledge in a systematic way can notably advance the generation of nuanced and contextually accurate images.
Abstract（参考訳）: テキスト・ツー・イメージ(T2I)モデルは近年,テキスト記述から画像を生成することに成功している。しかしながら、アクションとインタラクションが主要なセマンティックフォーカスを形成する複雑なシーンを正確にレンダリングする上で、課題はまだ続いている。我々の研究における重要な観察は、T2Iモデルは、アクションの描写に固有のニュアンスや暗黙の属性を捉えるのにしばしば苦労し、重要な文脈の詳細を欠いた画像を生成することである。本稿では,行動中心のプロンプトから画像を生成する際のT2Iモデルの性能を評価するベンチマークであるAcT2Iを紹介する。我々は、先行するT2IモデルがAcT2Iにうまく対応していないことを実験的に検証した。さらに、既存のT2Iモデルのトレーニングコーパスにおける固有の属性と文脈依存性の不完全表現から、この欠点が生じるという仮説を立てる。我々は,この制限に対処するために,大規模言語モデルを用いた訓練不要の知識蒸留技術を開発した。具体的には,3次元にわたる濃密な情報を組み込んでプロンプトを強化し,時間的詳細でプロンプトを注入することで画像生成精度が有意に向上し,最良のモデルでは72%の増加を実現している。本研究は, 複雑な推論を必要とする画像の生成における現在のT2I手法の限界を強調し, 言語知識を体系的に統合することで, ニュアンスや文脈的に正確な画像の生成を促進できることを示した。

論文の概要: AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models

関連論文リスト