Fugu-MT 論文翻訳(概要): AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

論文の概要: AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

arxiv url: http://arxiv.org/abs/2605.22645v1
Date: Thu, 21 May 2026 15:51:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:42.332635
Title: AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters
Title（参考訳）: AtelierEval:テキスト・ツー・イメージ・プロンプタとしての人間とLLMのエージェント的評価
Authors: Hanjun Luo, Zhimu Huang, Sylvia Chung, Yiran Wang, Yingbin Jin, Jialin Li, Jiang Li, Xinfeng Li, Hanan Salam,
Abstract要約: AtelierEvalは、360のエキスパートによるタスクにまたがる習熟度を定量化する最初の統一ベンチマークである。スケーラブルで信頼性の高い評価を実現するために,スキルベース,メモリ拡張型エージェント評価器であるAtelierJudgeを提案する。
参考スコア（独自算出の注目度）: 10.947354016765097
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image (T2I) systems increasingly rely on upstream prompters, either humans or multimodal large language models (MLLMs), to translate user intent into detailed prompts. Yet current benchmarks fix the prompt and only evaluate T2I models, leaving the prompting proficiency of this upstream component entirely unmeasured. We introduce AtelierEval, the first unified benchmark that quantifies prompting proficiency across 360 expert-crafted tasks. Grounded in a cognitive view, it spans three task categories and instantiates tasks using a taxonomy of real-world challenges, with a dual interface for both humans and MLLMs. To enable scalable and reliable evaluation, we propose AtelierJudge, a skill-based, memory-augmented agentic evaluator. It produces subjective and objective scores for prompt-image pairs, achieving a Spearman correlation of 0.79 with human experts, approaching human performance. Extensive experiments benchmark 8 MLLMs against 48 human users across 4 T2I backends, validate AtelierEval as a robust diagnostic tool, and reveal the superiority of mimicry over planning, advocating for an image-augmented direction for future prompters. Our work is released to support future research.
Abstract（参考訳）: テキスト・トゥ・イメージ(T2I)システムは、ユーザ意図を詳細なプロンプトに変換するために、人間やマルチモーダルな大規模言語モデル(MLLM)といった上流のプロンプトにますます依存している。しかし、現在のベンチマークではプロンプトが修正され、T2Iモデルのみが評価され、上流コンポーネントの迅速な習熟度は完全に測定されていない。 AtelierEvalは、360のエキスパートによるタスクにまたがる習熟度を定量化する最初の統一ベンチマークである。認知的な視点で見れば、それは3つのタスクカテゴリにまたがり、実際の課題の分類を用いてタスクをインスタンス化し、人間とMLLMの両方のための二重インターフェースを持つ。スケーラブルで信頼性の高い評価を実現するために,スキルベース,メモリ拡張型エージェント評価器であるAtelierJudgeを提案する。プロンプト・イメージのペアに対して主観的、客観的なスコアを生成し、人間の専門家とのスピアマンの相関を0.79と達成し、人間のパフォーマンスに近づいた。大規模な実験では、4つのT2Iバックエンドにまたがる48人のユーザに対して8つのMLLMをベンチマークし、AtelierEvalを堅牢な診断ツールとして検証し、将来のプロンプトに対してイメージ拡張された方向を提唱する計画よりも模倣の優位性を明らかにする。私たちの仕事は将来の研究を支援するために解放されます。

論文の概要: AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

関連論文リスト