Fugu-MT 論文翻訳(概要): GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

論文の概要: GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

arxiv url: http://arxiv.org/abs/2510.11026v1
Date: Mon, 13 Oct 2025 05:50:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:30.213271
Title: GIR-Bench: Versatile Benchmark for Generating Images with Reasoning
Title（参考訳）: GIR-Bench: 推論による画像生成のためのVersatileベンチマーク
Authors: Hongxiang Li, Yaowei Li, Bin Lin, Yuwei Niu, Yuhang Yang, Xiaoshuang Huang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Long Chen,
Abstract要約: 統一マルチモーダルモデルは、画像理解と生成の両方で大きな言語モデルの推論能力を統合する。 textbfGIR-Benchは3つの相補的な視点で統一されたモデルを評価する包括的なベンチマークである。
参考スコア（独自算出の注目度）: 40.09327641816171
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks. To this end, we introduce \textbf{GIR-Bench}, a comprehensive benchmark that evaluates unified models across three complementary perspectives. Firstly, we investigate understanding-generation consistency (GIR-Bench-UGC), asking whether models can consistently leverage the same knowledge in both understanding and generation tasks. Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I). Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit). For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm. Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at \href{https://hkust-longgroup.github.io/GIR-Bench}{https://hkust-longgroup.github.io/GIR-Bench}.
Abstract（参考訳）: 統一マルチモーダルモデルは、画像理解と生成の両方で大規模言語モデルの推論能力を統合し、高度なマルチモーダルインテリジェンスを実現する。しかし、コミュニティには、理解と生成の整合性を体系的に評価する厳密な推論中心のベンチマークがまだ欠けている。この目的のために,3つの相補的な視点で統一モデルを評価する総合的なベンチマークである \textbf{GIR-Bench} を導入する。まず,理解世代一貫性(GIR-Bench-UGC)について検討し,モデルが理解タスクと生成タスクの両方において,同じ知識を一貫して活用できるかどうかを問う。次に,モデルが論理的制約や暗黙的知識を適用して忠実な視覚コンテンツ(GIR-Bench-T2I)を生成するような推論中心のテキスト・ツー・イメージ生成を行うことができるかどうかを検討する。第3に、モデルが編集における多段階推論(GIR-Bench-Edit)を扱えるかどうかを評価する。各サブセットに対して、各タスクに適したタスク固有の評価パイプラインを慎重に設計する。これにより、MLLM-as-a-Judgeパラダイムからのバイアスを軽減しつつ、きめ細かな、解釈可能な評価が可能になる。統一モデルでは推論駆動の視覚的タスクがより可能であるが、理解と生成の間には永続的なギャップがある。 GIR-Benchのデータとコードは \href{https://hkust-longgroup.github.io/GIR-Bench}{https://hkust-longgroup.github.io/GIR-Bench} で公開されている。

論文の概要: GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

関連論文リスト