Fugu-MT 論文翻訳(概要): WeGenBench: A Multidimensional Diagnostic Benchmark towards Text-to-Image Model Optimization

論文の概要: WeGenBench: A Multidimensional Diagnostic Benchmark towards Text-to-Image Model Optimization

arxiv url: http://arxiv.org/abs/2606.20100v1
Date: Thu, 18 Jun 2026 11:20:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-19 18:23:39.817641
Title: WeGenBench: A Multidimensional Diagnostic Benchmark towards Text-to-Image Model Optimization
Title（参考訳）: WeGenBench: テキスト・画像モデル最適化のための多次元診断ベンチマーク
Authors: Qian Liang, Xiaomin Li, Ying Zhang, Jia Xu, Lihao Ni, Hongrui Li, Jingjing Li, Jing Lyu, Chen Li,
Abstract要約: WeGenBenchは、テキストから画像への生成能力を総合的に評価するための新しいベンチマークである。私たちのベンチマークでは、中国語と英語の微妙なバランスのとれた2つの主要カテゴリで合計4000のテストプロンプトで構成されています。提案手法は,評価結果と詳細な推論軌跡の両方を導出し,評価結果の精度と音質の厳密な検証を容易にする。
参考スコア（独自算出の注目度）: 16.83270042322323
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent text-to-image generation models have demonstrated remarkable capabilities in synthesizing highly realistic images from text inputs alone. Although existing benchmarks can evaluate the generation capabilities of various models to some extent, they struggle to comprehensively and accurately measure performance across multiple dimensions, often failing to reveal the inherent deficiencies of models in specific categories. To address these limitations, we propose WeGenBench, a novel benchmark designed for the comprehensive, multi-perspective evaluation of text-to-image generation capabilities. Our benchmark comprises a total of 4,000 test prompts across two primary categories, meticulously balanced between Chinese and English to evaluate bilingual and cross-cultural generation capabilities. Beyond macroscopic scene classification, we annotate each prompt with multi-dimensional tags tailored to the distinct content and challenges of each language, thereby refining the generation tasks into more specific sub-categories. Through a cross-dimensional evaluation mechanism leveraging both scene classifications and multi-dimensional tags, WeGenBench can precisely pinpoint model shortcomings in specific generation categories. Furthermore, to measure generation quality more accurately, we design and validate several novel evaluation metrics by integrating Vision-Language Models (VLMs), which assess model performance on domain-specific tasks from three core aspects. Crucially, our approach yields both the assessment outcomes and the detailed reasoning trajectories, facilitating a rigorous verification of the accuracy and soundness of the evaluation results. Finally, we conduct systematic benchmarking on current state-of-the-art methods and provide an in-depth analysis of the limitations present in existing models.
Abstract（参考訳）: 近年のテキスト・ツー・イメージ生成モデルは、テキスト入力のみから高度にリアルなイメージを合成する際、顕著な能力を示している。既存のベンチマークは、様々なモデルの生成能力をある程度評価できるが、複数の次元にわたるパフォーマンスを包括的かつ正確に測定することは困難であり、しばしば特定のカテゴリにおけるモデル固有の欠陥を明らかにするのに失敗する。これらの制約に対処するために,テキスト・ツー・イメージ生成能力の包括的・多視点評価のために設計された新しいベンチマークWeGenBenchを提案する。我々のベンチマークでは、中国語と英語の微妙なバランスをとり、バイリンガルと異文化間の生成能力を評価するため、2つの主要カテゴリで合計4,000の試験プロンプトを作成した。マクロなシーン分類以外にも、各言語の異なる内容や課題に合わせた多次元タグで各プロンプトに注釈を付け、生成タスクをより特定のサブカテゴリに書き換える。 WeGenBenchは、シーン分類と多次元タグの両方を活用するクロス次元評価機構により、特定の生成カテゴリにおけるモデルの欠点を正確に特定することができる。さらに、生成品質をより正確に測定するために、3つのコア側面からドメイン固有のタスクのモデル性能を評価するビジョン・ランゲージ・モデル(VLM)を統合することにより、いくつかの新しい評価指標を設計し、検証する。本手法は,評価結果と詳細な推論軌跡の両方を導出し,評価結果の精度と音質の厳密な検証を容易にする。最後に、現在の最先端手法の系統的なベンチマークを行い、既存のモデルに存在する限界を詳細に分析する。

論文の概要: WeGenBench: A Multidimensional Diagnostic Benchmark towards Text-to-Image Model Optimization

関連論文リスト