Fugu-MT 論文翻訳(概要): WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

論文の概要: WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

arxiv url: http://arxiv.org/abs/2603.16871v1
Date: Tue, 17 Mar 2026 17:59:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.478077
Title: WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation
Title（参考訳）: WorldCam: インタラクティブな自動回帰型3Dゲームワールドとカメラポーズを融合した幾何学的表現
Authors: Jisu Nam, Yicong Hong, Chun-Hao Paul Huang, Feng Liu, JoungBin Lee, Jiyoung Kim, Siyoon Jin, Yunsung Lee, Jaeyoon Jung, Suhwan Choi, Seungryong Kim, Yang Zhou,
Abstract要約: 我々は、地上での即時動作制御と長期3次元一貫性を統一した幾何学的表現として、カメラのポーズを確立する。本手法は, アクション制御性, 長時間の視覚的品質, 3次元空間の整合性において, 最先端の対話型ゲームワールドモデルよりも大幅に優れる。
参考スコア（独自算出の注目度）: 47.97929550105451
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.
Abstract（参考訳）: ビデオ拡散トランスフォーマーの最近の進歩は、ユーザーが地平線を越えて生成された環境を探索できるインタラクティブなゲームワールドモデルを可能にしている。しかし、既存のアプローチは正確なアクション制御と長距離3D整合性に苦慮している。これまでのほとんどの研究は、ユーザーアクションを抽象的な条件付け信号として扱い、アクションと3D世界の基本的な幾何学的結合を見極め、アクションは、グローバルなカメラのポーズに蓄積される相対的なカメラの動きを誘導する。本稿では,カメラのポーズを統合的幾何学的表現として確立し,即時動作制御と長期3次元整合性を両立させる。まず、物理に基づく連続的なアクション空間を定義し、リー代数のユーザ入力を正確に6-DoFカメラのポーズを導出し、カメラ埋め込み器を介して生成モデルに注入して正確なアクションアライメントを保証する。第2に,地球カメラのポーズを空間的指標として用いて過去の観測を再現し,長距離航法における位置の幾何的一貫した再検討を可能にする。本研究を支援するために,カメラの軌跡やテキストによる記述を付加した,3000分間の真の人間のゲームプレイを含む大規模データセットを提案する。広汎な実験により、我々のアプローチは、アクション制御性、長期視覚的品質、空間的整合性において、最先端の対話型ゲームの世界モデルよりも大幅に優れていることが示された。

論文の概要: WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

関連論文リスト