Fugu-MT 論文翻訳(概要): MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

論文の概要: MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

arxiv url: http://arxiv.org/abs/2606.00793v2
Date: Mon, 08 Jun 2026 08:58:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:04.683683
Title: MBench: A Comprehensive Benchmark on Memory Capability for Video World Models
Title（参考訳）: MBench: ビデオワールドモデルのメモリ能力に関する総合ベンチマーク
Authors: Shengjun Zhang, Zhang Zhang, Simin Huang, Zhenyu Tang, Hanyang Wang, Chensheng Dai, Min Chen, Yifan Li, Yuxin Li, Yingjie Chen, Hao Liu, Chen Li, Jing Lyu, Yueqi Duan,
Abstract要約: ビデオワールドモデルのメモリ能力の定量化と評価を目的としたベンチマークである textbfMBench を提案する。我々のベンチマークは、厳密にキュレートされた実写長ビデオに基づいて構築され、ルールベースの量行列とVLMにより評価される。
参考スコア（独自算出の注目度）: 36.71271805993198
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in video-based world models have demonstrated an unprecedented ability to synthesize high-fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion coherence, and text-video alignment, they largely overlook memory, the core capability of a world model to preserve consistency across long-term horizons and complex interactions. To address this gap, we present \textbf{MBench}, a comprehensive benchmark dedicated to quantifying and evaluating the memory capability of video world models. We systematically decompose the memory capability of video world models into three hierarchical and complementary core dimensions: entity consistency, environment consistency, and causal consistency, which are further refined into 12 quantifiable sub-dimensions for comprehensive characterization of long-term memory. Our benchmark is built upon rigorously curated real-captured long videos, and evaluated by rule-based quantitative matrices and VLM to enable objective and comprehensive consistency assessment. Extensive evaluations of mainstream state-of-the-art video world models reveal critical systemic limitations of existing methods in long-term state retention, providing a standardized benchmark and clear research direction to advance the field.
Abstract（参考訳）: ビデオベースの世界モデルの最近の進歩は、高忠実度ビジュアルシーケンスを合成する前例のない能力を示している。しかし、視覚的に可視な映像生成と世界モデルの機能要件の間には、特に時間的地平線を超えて安定かつ合理的な内部状態を維持するための根本的なギャップが持続する。既存のベンチマークは主に視覚的品質、モーションコヒーレンス、テキスト・ビデオのアライメントを強調しているが、それらは主にメモリを見落としている。このギャップに対処するため、ビデオワールドモデルのメモリ能力の定量化と評価を目的とした総合的なベンチマークである「textbf{MBench}」を提示する。我々は,ビデオワールドモデルのメモリ能力を,エンティティ一貫性,環境整合性,因果整合性の3つの階層的・相補的なコア次元に体系的に分解し,長期記憶の包括的特徴付けのために,さらに12個の定量化サブ次元に改良した。我々のベンチマークは、厳密にキュレートされた実写長ビデオに基づいて構築され、ルールベースの量行列とVLMを用いて、客観的かつ包括的な整合性評価を可能にする。主流の最先端ビデオワールドモデルの広範囲な評価により、長期的な状態維持における既存手法のシステム的限界が明らかとなり、標準化されたベンチマークと、フィールドを前進させるための明確な研究方向が提供される。

論文の概要: MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

関連論文リスト