Fugu-MT 論文翻訳(概要): RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

論文の概要: RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

arxiv url: http://arxiv.org/abs/2509.24897v1
Date: Mon, 29 Sep 2025 15:07:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:20.076364
Title: RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark
Title（参考訳）: RealUnify:Unified ModelsはUnifiedから真に利益を得るか?
Authors: Yang Shi, Yuhao Dong, Yue Ding, Yuran Wang, Xuanyu Zhu, Sheng Zhou, Wenting Liu, Haochen Tian, Rundong Wang, Huanqian Wang, Zuyan Liu, Bohan Zeng, Ruizhe Chen, Qixun Wang, Zhuoran Zhang, Xinlong Chen, Chengzhuo Tong, Bozhou Li, Chaoyou Fu, Qiang Liu, Haotian Wang, Wenjing Yang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang, Ziwei Liu,
Abstract要約: 本稿では,双方向機能相乗効果を評価するためのベンチマークであるRealUnifyを紹介する。 RealUnifyは、10のカテゴリと32のサブタスクにまたがる、細心の注意を払ってアノテートされた1000のインスタンスで構成されている。現在の統一モデルは、効果的な相乗効果を達成するのに依然として苦労しており、アーキテクチャの統一だけでは不十分であることを示している。
参考スコア（独自算出の注目度）: 71.3555284685426
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The integration of visual understanding and generation into unified multimodal models represents a significant stride toward general-purpose AI. However, a fundamental question remains unanswered by existing benchmarks: does this architectural unification actually enable synergetic interaction between the constituent capabilities? Existing evaluation paradigms, which primarily assess understanding and generation in isolation, are insufficient for determining whether a unified model can leverage its understanding to enhance its generation, or use generative simulation to facilitate deeper comprehension. To address this critical gap, we introduce RealUnify, a benchmark specifically designed to evaluate bidirectional capability synergy. RealUnify comprises 1,000 meticulously human-annotated instances spanning 10 categories and 32 subtasks. It is structured around two core axes: 1) Understanding Enhances Generation, which requires reasoning (e.g., commonsense, logic) to guide image generation, and 2) Generation Enhances Understanding, which necessitates mental simulation or reconstruction (e.g., of transformed or disordered visual inputs) to solve reasoning tasks. A key contribution is our dual-evaluation protocol, which combines direct end-to-end assessment with a diagnostic stepwise evaluation that decomposes tasks into distinct understanding and generation phases. This protocol allows us to precisely discern whether performance bottlenecks stem from deficiencies in core abilities or from a failure to integrate them. Through large-scale evaluations of 12 leading unified models and 6 specialized baselines, we find that current unified models still struggle to achieve effective synergy, indicating that architectural unification alone is insufficient. These results highlight the need for new training strategies and inductive biases to fully unlock the potential of unified modeling.
Abstract（参考訳）: 視覚的理解と生成を統合マルチモーダルモデルに統合することは、汎用AIへの重要な一歩である。しかし、このアーキテクチャ統一によって、構成機能間の相乗的相互作用が実際に可能か? 既存の評価パラダイムは、主に独立して理解と生成を評価するが、統一モデルがその理解を活用して生成を強化するか、あるいはより深い理解を促進するために生成シミュレーションを使用するかを決定するには不十分である。この重要なギャップに対処するために、双方向能力の相乗効果を評価するために特別に設計されたベンチマークであるRealUnifyを紹介する。 RealUnifyは、10のカテゴリと32のサブタスクにまたがる、細心の注意を払って注釈付けされた1000のインスタンスで構成されている。 2つのコア軸を中心に構成されている。 1)画像生成の指針となる推論(例:コモンセンス、論理)を必要とするエンハンス生成の理解 2) 推論課題を解決するためには, 心的シミュレーションや再構成(例えば, 変換された視覚入力や乱れた視覚入力)が必要であること。このプロトコルは、直接エンドツーエンドの評価と、タスクを個別の理解と生成フェーズに分解する段階的評価を組み合わせたものです。このプロトコルにより、パフォーマンスのボトルネックがコア能力の欠陥によるものなのか、統合に失敗したものなのかを正確に判別することができます。 12の先行する統一モデルと6つの特殊ベースラインの大規模評価を通じて、現在の統一モデルはまだ効果的な相乗効果を達成するのに苦戦しており、アーキテクチャ統一だけでは不十分であることを示す。これらの結果は、統一モデリングの可能性を完全に解き放つために、新しいトレーニング戦略と誘導バイアスの必要性を強調している。

論文の概要: RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

関連論文リスト