Fugu-MT 論文翻訳(概要): Bridging the Gap Between Multimodal Foundation Models and World Models

論文の概要: Bridging the Gap Between Multimodal Foundation Models and World Models

arxiv url: http://arxiv.org/abs/2510.03727v1
Date: Sat, 04 Oct 2025 08:14:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.217517
Title: Bridging the Gap Between Multimodal Foundation Models and World Models
Title（参考訳）: マルチモーダルファンデーションモデルと世界モデルとのギャップを埋める
Authors: Xuehai He,
Abstract要約: マルチモーダル・ファンデーション・モデルとワールド・モデルとのギャップを埋めるために何が必要かを検討する。本稿では,シーングラフ,マルチモーダルコンディショニング,アライメント戦略を取り入れて生成プロセスのガイドを行う。我々はこれらの技術を制御可能な4D生成に拡張し、時間と空間を通じてインタラクティブで編集可能、そして変形可能なオブジェクト合成を可能にする。
参考スコア（独自算出の注目度）: 10.001347956177879
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Humans understand the world through the integration of multiple sensory modalities, enabling them to perceive, reason about, and imagine dynamic physical processes. Inspired by this capability, multimodal foundation models (MFMs) have emerged as powerful tools for multimodal understanding and generation. However, today's MFMs fall short of serving as effective world models. They lack the essential ability such as perform counterfactual reasoning, simulate dynamics, understand the spatiotemporal information, control generated visual outcomes, and perform multifaceted reasoning. We investigates what it takes to bridge the gap between multimodal foundation models and world models. We begin by improving the reasoning capabilities of MFMs through discriminative tasks and equipping MFMs with structured reasoning skills, such as causal inference, counterfactual thinking, and spatiotemporal reasoning, enabling them to go beyond surface correlations and understand deeper relationships within visual and textual data. Next, we explore generative capabilities of multimodal foundation models across both image and video modalities, introducing new frameworks for structured and controllable generation. Our approaches incorporate scene graphs, multimodal conditioning, and multimodal alignment strategies to guide the generation process, ensuring consistency with high-level semantics and fine-grained user intent. We further extend these techniques to controllable 4D generation, enabling interactive, editable, and morphable object synthesis over time and space.
Abstract（参考訳）: 人間は、複数の感覚モダリティを統合することで世界を理解し、動的物理過程を知覚し、推論し、想像することができる。この能力に触発されたマルチモーダル基礎モデル(MFM)は、マルチモーダル理解と生成のための強力なツールとして登場した。しかし、今日のMFMは、効果的な世界モデルとして機能しない。反事実的推論、ダイナミクスのシミュレート、時空間情報の理解、生成した視覚的結果の制御、多面的推論といった本質的な能力は欠如している。マルチモーダル・ファンデーション・モデルとワールド・モデルとのギャップを埋めるために何が必要かを考察する。まず、識別的タスクを通じてMDMの推論能力を改善し、因果推論、反現実的思考、時空間推論などの構造化推論スキルを身につけることで、表面的相関を超え、視覚的・テキスト的データ内の深い関係を理解できるようにすることから始める。次に、画像とビデオの両モードにわたるマルチモーダル基礎モデルの生成機能について検討し、構造化および制御可能な生成のための新しいフレームワークを導入する。本稿では,シーングラフ,マルチモーダルコンディショニング,マルチモーダルアライメント戦略を取り入れて生成プロセスをガイドし,高レベルなセマンティクスと詳細なユーザ意図との整合性を確保する。さらに、これらの技術を制御可能な4D生成に拡張し、時間と空間を通じてインタラクティブで編集可能、そして変形可能なオブジェクト合成を可能にする。

論文の概要: Bridging the Gap Between Multimodal Foundation Models and World Models

関連論文リスト