Fugu-MT 論文翻訳(概要): IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

論文の概要: IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

arxiv url: http://arxiv.org/abs/2606.09169v1
Date: Mon, 08 Jun 2026 08:08:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.822859
Title: IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation
Title（参考訳）: IMUG-Bench: インターリーブ理解と生成のための統一マルチモーダルモデルのベンチマーク
Authors: Lingyi Meng, Zecong Tang, Haoran Li, Tengju Ru, Zhejun Cui, Weitong Lian, Qi Kang, Hangshuo Cao, Yichen Zhu, Yechi Liu, Kaixuan Wang, Yu-Jie Yuan, Chunwei Wang, Yu Zhang, Bo Dai,
Abstract要約: We propose IMUG-Bench, a benchmark for multi-turn interleaved image-text dialogue of unified multimodal model (UMMs)。我々のIMUG-Benchは、静的空間、時間的因果、ハイブリッドの3つのクラスから構成されており、3,113のサンプルと12,034の相互作用ターンをカバーしています。 IMUG-Benchの大規模実験は、主流のオープンソースとクローズドソースのUMMを体系的に評価し、その機能境界と障害モードを明らかにし、マルチターン相互作用における生成側の顕著な露光バイアスを明らかにする。
参考スコア（独自算出の注目度）: 30.102836710504565
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmarks fail to evaluate this important task, as they are often limited to single-turn or static settings, and typically overlook exposure bias in multi-turn interactions. To bridge this gap, we propose IMUG-Bench, a comprehensive benchmark for multi-turn interleaved image-text dialogue of UMMs that jointly evaluates their understanding and generation capabilities. Our IMUG-Bench comprises three classes: Static Spatial, Temporal Causal, and Hybrid, covering 3,113 samples and 12,034 interaction turns. It also includes dynamic understanding questions, thereby supporting evaluation that better reflects real-world multi-turn interaction scenarios. Large-scale experiments on IMUG-Bench systematically evaluate mainstream open-source and closed-source UMMs, revealing their capability boundaries and failure modes, and uncovering pronounced exposure bias on the generation side in multi-turn interactions. We further explore several test-time scaling strategies, including Chain-of-Thought, Self-Verification, and Best-of-N Sampling, which effectively improve generation accuracy and mitigate exposure bias in generation tasks. These findings provide insights into enhancing the robustness and multi-turn interaction capability of future UMMs.
Abstract（参考訳）: 近年、統一マルチモーダルモデル (UMM) が登場し、単一のフレームワーク内での理解と生成の両方をサポートするようになった。動的でマルチターンでインターリーブされた画像テキスト対話をマスターすることは、現実のアプリケーションにおいてUMMにとって重要な課題である。しかしながら、既存のベンチマークでは、シングルターンや静的な設定に制限されることが多いため、この重要なタスクを評価することができない。このギャップを埋めるため,UMMのマルチターンインターリーブ画像テキスト対話のための総合ベンチマークであるIMUG-Benchを提案する。我々のIMUG-Benchは、静的空間、時間的因果、ハイブリッドの3つのクラスから構成されており、3,113のサンプルと12,034の相互作用ターンをカバーしています。動的理解の質問も含み、現実世界のマルチターンインタラクションのシナリオをよりよく反映した評価をサポートする。 IMUG-Benchの大規模実験は、主流のオープンソースとクローズドソースのUMMを体系的に評価し、その機能境界と障害モードを明らかにし、マルチターン相互作用における生成側の顕著な露光バイアスを明らかにする。さらに、生成タスクにおける露出バイアスを効果的に改善する、Chain-of-Thought、Self-Verification、Best-of-N Smplingなどのテストタイムスケーリング戦略についても検討する。これらの知見は、将来のUMMの堅牢性とマルチターン相互作用能力の向上に関する洞察を与える。

論文の概要: IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

関連論文リスト