Fugu-MT 論文翻訳(概要): ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

論文の概要: ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

arxiv url: http://arxiv.org/abs/2603.27862v1
Date: Sun, 29 Mar 2026 20:42:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.147782
Title: ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks
Title（参考訳）: ImagenWorld:オープンエンド実世界の課題に対する人間の説明可能な評価によるストレステスト画像生成モデル
Authors: Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, Donald Wai Tong Tsang, Chiao-Wei Hsu, Ting Wai Lam, Ho Yin Sam Ng, Chiafeng Chu, Chak-Wing Mak, Keming Wu, Hiu Tung Wong, Yik Chun Ho, Chi Ruan, Zhuofeng Li, I-Sheng Fang, Shih-Ying Yeh, Ho Kei Cheng, Ping Nie, Wenhu Chen,
Abstract要約: ImagenWorldは6つのコアタスク(生成と編集、単一または複数参照)と6つのトピックドメイン(アートワーク、フォトリアリスティックイメージ、情報グラフィックス、テキストグラフィックス、コンピュータグラフィックス、スクリーンショット)にまたがる3.6K条件セットのベンチマークである。このベンチマークは20Kの細かい人間のアノテーションと、ローカライズされたオブジェクトレベルとセグメントレベルのエラーをタグ付けする説明可能な評価スキーマによってサポートされている。
参考スコア（独自算出の注目度）: 33.143423584118516
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce \textbf{ImagenWorld}, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.
Abstract（参考訳）: 拡散、自己回帰、ハイブリッドモデルの進歩により、テキスト・ツー・イメージ、編集、参照誘導合成といったタスクのための高品質な画像合成が可能になった。しかし、既存のベンチマークは限定的であり、独立したタスクにフォーカスするか、狭いドメインのみをカバーするか、障害モードを説明することなく不透明なスコアを提供する。これは6つのコアタスク(生成と編集、単一または複数参照)と6つのトピックドメイン(アートワーク、フォトリアリスティックイメージ、情報グラフィックス、テキストグラフィックス、コンピュータグラフィックス、スクリーンショット)にまたがる3.6K条件セットのベンチマークである。このベンチマークは、20Kの細かい人間のアノテーションと、ローカライズされたオブジェクトレベルとセグメントレベルのエラーをタグ付けし、自動VLMベースのメトリクスを補完する説明可能な評価スキーマによってサポートされている。 1)モデルは通常、生成タスクよりも、特にローカル編集において、タスクの編集に苦労する。 2) モデルは芸術的, フォトリアリスティックな設定で優れているが, スクリーンショットや情報グラフィックスといった象徴的, テキストに富む領域では苦戦している。 (3)クローズドソースシステムは全体としてリードし,ターゲットデータキュレーション(例:Qwen-Image)はテキスト重の場合のギャップを狭める。 (4) 現代のVLMベースのメトリクスは、人間のランクを近似して最大0.79までケダルの精度を達成するが、きめ細かな説明可能な誤り属性に欠ける。 ImagenWorldは厳格なベンチマークと、堅牢な画像生成を促進するための診断ツールを提供する。

論文の概要: ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

関連論文リスト