Fugu-MT 論文翻訳(概要): Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

論文の概要: Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

arxiv url: http://arxiv.org/abs/2510.13759v1
Date: Wed, 15 Oct 2025 17:10:35 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-16 20:13:28.779839
Title: Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark
Title（参考訳）: Uni-MMMU: 大規模マルチディシプリルマルチモーダル統一ベンチマーク
Authors: Kai Zou, Ziqi Huang, Yuhao Dong, Shulin Tian, Dian Zheng, Hongbo Liu, Jingwen He, Bin Liu, Yu Qiao, Ziwei Liu,
Abstract要約: 統一マルチモーダルモデルは、視覚的理解と生成を共同で行うことを目的としているが、現在のベンチマークでは、その真の統合を検査することはめったにない。提案するUni-MMMUは、8つの推論中心領域にまたがる生成と理解の双方向の相乗効果を拡大する総合的なベンチマークである。
参考スコア（独自算出の注目度）: 69.8473923357969
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unified multimodal models aim to jointly enable visual understanding and generation, yet current benchmarks rarely examine their true integration. Existing evaluations either treat the two abilities in isolation or overlook tasks that inherently couple them. To address this gap, we present Uni-MMMU, a comprehensive and discipline-aware benchmark that systematically unfolds the bidirectional synergy between generation and understanding across eight reasoning-centric domains, including science, coding, mathematics, and puzzles. Each task is bidirectionally coupled, demanding models to (i) leverage conceptual understanding to guide precise visual synthesis, or (ii) utilize generation as a cognitive scaffold for analytical reasoning. Uni-MMMU incorporates verifiable intermediate reasoning steps, unique ground truths, and a reproducible scoring protocol for both textual and visual outputs. Through extensive evaluation of state-of-the-art unified, generation-only, and understanding-only models, we reveal substantial performance disparities and cross-modal dependencies, offering new insights into when and how these abilities reinforce one another, and establishing a reliable foundation for advancing unified models.
Abstract（参考訳）: 統一マルチモーダルモデルは、視覚的理解と生成を共同で行うことを目的としているが、現在のベンチマークでは、その真の統合を検査することはめったにない。既存の評価は、分離された2つの能力を扱うか、本質的にそれらを結合するタスクを見落としているかのどちらかである。このギャップに対処するため、Uni-MMMUは、科学、コーディング、数学、パズルを含む8つの推論中心領域において、生成と理解の間の双方向のシナジーを体系的に展開する、包括的で規律を意識したベンチマークである。各タスクは双方向に結合され、モデルを必要とする。一概念的理解を活用して、正確な視覚合成を導くこと。 (二)分析的推論のための認知的足場として生成を利用する。 Uni-MMMUは、検証可能な中間推論ステップ、ユニークな基底真理、およびテキスト出力と視覚出力の両方に対する再現可能なスコアリングプロトコルを組み込んでいる。最先端の統一モデル,世代限定モデル,理解のみモデルの広範な評価を通じて,大幅なパフォーマンス格差と相互依存を明らかにし,それらの能力が相互に強化される時期と方法に関する新たな洞察を提供し,統一モデルを進化させるための信頼性の高い基盤を確立する。

論文の概要: Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

関連論文リスト