Fugu-MT 論文翻訳(概要): Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

論文の概要: Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

arxiv url: http://arxiv.org/abs/2603.12746v1
Date: Fri, 13 Mar 2026 07:42:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:11.974296
Title: Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World
Title（参考訳）: ダイナミクスの思考:物理4次元世界におけるマルチモーダル大言語モデルの知覚、追跡、推論のダイナミクス
Authors: Yuzhi Huang, Kairun Wen, Rongxin Gao, Dongxuan Liu, Yibin Lou, Jie Wu, Jing Xu, Jian Zhang, Zheng Yang, Yunlong Lin, Chenxin Li, Panwang Pan, Junbin Lu, Jingyan Jiang, Xinghao Ding, Yue Huang, Zhi Wang,
Abstract要約: 人間は物理的4Dの世界に住み、幾何学的構造と意味的内容は時間とともに進化する。さまざまな実世界および合成ビデオデータセットから構築された大規模ベンチマークであるDyn-Benchを紹介した。既存のモデルでは,時間的推論と動的オブジェクトグラウンドの両面において,高い性能を同時に維持できないことがわかった。
参考スコア（独自算出の注目度）: 49.80040477190479
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Humans inhabit a physical 4D world where geometric structure and semantic content evolve over time, constituting a dynamic 4D reality (spatial with temporal dimension). While current Multimodal Large Language Models (MLLMs) excel in static visual understanding, can they also be adept at "thinking in dynamics", i.e., perceive, track and reason about spatio-temporal dynamics in evolving scenes? To systematically assess their spatio-temporal reasoning and localized dynamics perception capabilities, we introduce Dyn-Bench, a large-scale benchmark built from diverse real-world and synthetic video datasets, enabling robust and scalable evaluation of spatio-temporal understanding. Through multi-stage filtering from massive 2D and 4D data sources, Dyn-Bench provides a high-quality collection of dynamic scenes, comprising 1k videos, 7k visual question answering (VQA) pairs, and 3k dynamic object grounding pairs. We probe general, spatial and region-level MLLMs to express how they think in dynamics both linguistically and visually, and find that existing models cannot simultaneously maintain strong performance in both spatio-temporal reasoning and dynamic object grounding, often producing inconsistent interpretations of motion and interaction. Notably, conventional prompting strategies (e.g., chain-of-thought or caption-based hints) provide limited improvement, whereas structured integration approaches, including Mask-Guided Fusion and Spatio-Temporal Textual Cognitive Map (ST-TCM), significantly enhance MLLMs' dynamics perception and spatio-temporal reasoning in the physical 4D world. Code and benchmark are available at https://dyn-bench.github.io/.
Abstract（参考訳）: 人間は物理的4Dの世界に住み、幾何学的構造と意味的内容は時間とともに進化し、動的4D現実(時空間次元の空間)を構成する。現在のMultimodal Large Language Models(MLLM)は静的な視覚的理解に優れていますが、進化するシーンにおける時空間的ダイナミクスの知覚、追跡、推論といった「動的思考」にも耐えられますか? 実世界および合成ビデオデータセットから構築した大規模ベンチマークであるDyn-Benchを導入することにより,時空間的理解の堅牢かつスケーラブルな評価が可能となる。大規模な2Dおよび4Dデータソースからのマルチステージフィルタリングを通じて、Dyn-Benchは、1kビデオ、VQAペア、7kビジュアル質問応答(VQA)ペア、3kダイナミックオブジェクトグラウンドリングペアを含む、高品質な動的シーンのコレクションを提供する。言語的にも視覚的にも、動的にどう考えるかを表現するために、一般、空間的、地域レベルのMLLMを探索し、既存のモデルでは時空間的推論と動的対象グラウンドの両方において、強い性能を同時に維持できないことを発見し、しばしば動きと相互作用の一貫性のない解釈を生み出している。特に、従来のプロンプト戦略(例えば、チェーン・オブ・シンプソン、キャプションベースのヒント)は限定的な改善をもたらすが、Mask-Guided Fusion や Spatio-Temporal Textual Cognitive Map (ST-TCM) などの構造化統合アプローチは、物理的4D世界におけるMLLMの動的知覚と時空間的推論を著しく強化する。コードとベンチマークはhttps://dyn-bench.github.io/.com/で公開されている。

論文の概要: Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

関連論文リスト