Fugu-MT 論文翻訳(概要): 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

論文の概要: 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

arxiv url: http://arxiv.org/abs/2605.05997v1
Date: Thu, 07 May 2026 10:48:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.707068
Title: 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding
Title（参考訳）: 4DThinker:動的空間理解のための4Dイメージによる思考
Authors: Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xiang An, Bo Li, Xin Xie, ZiDong Wang, Mingze Sun, Shuang Chen, Hongyu Li, Xiaobin Hu, Ruqi Huang,
Abstract要約: 視覚言語モデルを“4Dで考える”ための最初のフレームワークである4DThinkerを紹介します。まず,生のビデオから4D推論データを合成する,スケーラブルでアノテーションのないデータ生成パイプラインを紹介する。次に,動的視覚意味論のモデルを構築するために,テキストトークンと4Dラテントを併用した動的画像ファインタニング(DIFT)を提案する。
参考スコア（独自算出の注目度）: 31.082079260882896
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.
Abstract（参考訳）: モノクロ映像からの動的空間推論は視覚情報と物理世界をブリッジするのに不可欠であるが、視覚言語モデル(VLM)では依然として困難である。従来のアプローチでは、空間的推論は完全にテキストとして言語化されており、これは本質的には複雑な力学に対して冗長で不正確であり、または、本質的なモデル能力を促進することなく推論複雑性を増大させる外部幾何学モジュールに依存している。本稿では、4DThinkerを紹介し、VLMが動的に潜伏する心的イメージを通して「4Dで考える」ことを可能にする最初のフレームワークである。具体的には、まず、生のビデオから4D推論データを合成するスケーラブルでアノテーションのないデータ生成パイプラインを導入する。次に,テキストトークンと4Dラテントを共同で監督し,動的視覚意味論のモデルを構築する動的画像ファインタニング(DIFT)を提案する。これに基づいて、4D強化学習(4D Reinforcement Learning, 4DRL)は、結果ベースの報酬を通じて複雑な推論タスクにさらに取り組み、安定した最適化を保証するために、テキストトークンに対するポリシー勾配を制限する。複数の動的空間的推論ベンチマークの広範な実験は、4DThinkerが強いベースラインを一貫して上回り、VLMにおける4D推論に対する新たな視点を提供することを示した。私たちのコードはhttps://github.com/zhangquanchen/4DThinker.comから入手可能です。

論文の概要: 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

関連論文リスト