Fugu-MT 論文翻訳(概要): WorldOlympiad: Can Your World Model Survive a Triathlon?

論文の概要: WorldOlympiad: Can Your World Model Survive a Triathlon?

arxiv url: http://arxiv.org/abs/2606.11129v1
Date: Tue, 09 Jun 2026 17:24:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-10 15:40:58.637292
Title: WorldOlympiad: Can Your World Model Survive a Triathlon?
Title（参考訳）: WorldOlympiad:あなたの世界モデルはトライアスロンに耐えられるか?
Authors: Yuke Zhao, Wangbo Zhao, Weijie Wang, Zeyu Zhang, Dakai An, Akide Liu, Yinghao Yu, Jiasheng Tang, Fan Wang, Wei Wang, Bohan Zhuang,
Abstract要約: 我々は,物理忠実度,幾何整合性,相互作用忠実度にまたがるビデオベース世界モデルの診断のためのベンチマークであるWorldOlympiadを紹介する。 WorldOlympiadは、世界モデルの評価を3つの相補的な次元に分解する。
参考スコア（独自算出の注目度）: 39.69359039523777
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We introduce WorldOlympiad, a benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or short-term temporal coherence, they provide limited insight into whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons. To address this gap, WorldOlympiad decomposes world-model evaluation into three complementary dimensions. The physical track uses object segmentation and MLLM-as-judge to assess whether generated videos follow interpretable rules in mechanics, thermal phenomena, and material properties. The geometry track reconstructs generated videos with Gaussian splatting and evaluates structural consistency, cross-view coherence, and camera-trajectory alignment. The interaction track assesses whether generated rollouts follow complex action prompts and maintain smooth, coherent transitions across consecutive video chunks. WorldOlympiad further covers three major downstream scenarios, including gaming, robotics, and general real-world videos, capturing diverse challenges from interactive control and embodied manipulation to open-domain motion and camera dynamics. Together, these tracks and scenarios form a scalable and interpretable evaluation suite that exposes failure modes beyond generic video quality. Experiments on state-of-the-art models reveal substantial gaps in physical reasoning, 3D consistency, and long-horizon interaction, underscoring the need for more structured evaluation protocols for generative world models.
Abstract（参考訳）: 我々は,物理忠実度,幾何整合性,相互作用忠実度にまたがるビデオベース世界モデルの診断のためのベンチマークであるWorldOlympiadを紹介する。既存のベンチマークでは、視覚的品質、セマンティックアライメント、短期的時間的コヒーレンスに重点を置いていることが多いが、生成されたビデオが物理的な規則に従うか、コヒーレントな3D構造を保持し、長い地平線上で制御可能な相互作用を維持するか、という限定的な洞察を与える。このギャップに対処するため、WorldOlympiadは3つの相補的な次元に世界モデルの評価を分解する。物理トラックはオブジェクトセグメンテーションとMLLM-as-judgeを使用して、生成したビデオが力学、熱現象、材料特性の解釈可能な規則に従うかどうかを評価する。ジオメトリトラックはガウススプラッティングで生成された映像を再構成し、構造整合性、クロスビューコヒーレンス、カメラ軌道アライメントを評価する。インタラクショントラックは、生成されたロールアウトが複雑なアクションプロンプトに従うかどうかを評価し、連続したビデオチャンク間のスムーズでコヒーレントな遷移を維持する。 WorldOlympiadはさらに、ゲーム、ロボティクス、そして一般的な現実世界のビデオを含む3つの主要なダウンストリームシナリオをカバーし、インタラクティブなコントロールや身体操作からオープンドメインの動き、カメラのダイナミックスに至るまで、さまざまな課題を捉えている。これらのトラックとシナリオが合わさってスケーラブルで解釈可能な評価スイートを形成し、一般的なビデオ品質を超えた障害モードを公開する。最先端モデルに対する実験は、物理推論、3次元一貫性、長期水平相互作用において大きなギャップを生じさせ、生成的世界モデルのためのより構造化された評価プロトコルの必要性を浮き彫りにした。

論文の概要: WorldOlympiad: Can Your World Model Survive a Triathlon?

関連論文リスト