Fugu-MT 論文翻訳(概要): INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

論文の概要: INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

arxiv url: http://arxiv.org/abs/2604.07209v1
Date: Wed, 08 Apr 2026 15:31:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.610788
Title: INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
Title（参考訳）: INSPATIO-WORLD:時空間自己回帰モデルによる実時間4次元世界シミュレータ
Authors: InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, Siji Pan, Weihong Pan, Weijian Xie, Xianbin Liu, Xiaojun Xiang, Xiaoyu Zhang, Xinyu Chen, Yifu Wang, Yipeng Chen, Zhenzhou Fan, Zhewen Le, Zhichao Ye, Ziqiang Zhao,
Abstract要約: INSPATIO-WORLDは、単一のビデオから高忠実なインタラクティブなシーンを復元し、生成できる新しいリアルタイムフレームワークである。 Implicit Spatiotemporal Cacheは参照と過去の観測結果を潜在世界表現に集約する。 Explicit Space Constraint Moduleは幾何学的構造を強制し、ユーザのインタラクションを正確かつ物理的に可視なカメラ軌道に変換する。
参考スコア（独自算出の注目度）: 44.09983529522167
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.
Abstract（参考訳）: 空間的一貫性とリアルタイムの対話性を備えた世界モデルの構築は、コンピュータビジョンにおける根本的な課題である。現在のビデオ生成パラダイムは、空間的持続性の欠如と視覚的リアリズムの欠如に悩まされることが多く、複雑な環境でシームレスなナビゲーションをサポートすることは困難である。 InSPATIO-WORLDは、単一の参照ビデオから高忠実でダイナミックなシーンを復元・生成できる新しいリアルタイムフレームワークである。 Indicit Spatiotemporal Cache aggregates reference and historical observeds to a Latent world representation, secure global consistency during long-horizon navigation; Explicit Spatial Constraint Modules enforces geometry structure and translates user interaction into exact and physically plausible camera trajectories。さらに,JDMD (Joint Distribution Matching Distillation) を導入する。実世界のデータ分布を正規化ガイドとして使用することにより、JDMDは、合成データへの過度な依存によって引き起こされる忠実度劣化を効果的に克服する。大規模な実験により、INSPATIO-WORLDは、空間的一貫性と相互作用精度において既存の最先端(SOTA)モデルを大幅に上回り、WorldScore-Dynamicベンチマークにおけるリアルタイムインタラクティブな手法の中で第1位となり、モノクロビデオから再構成された4D環境をナビゲートするための実用的なパイプラインを確立した。

論文の概要: INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

関連論文リスト