Fugu-MT 論文翻訳(概要): SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models

論文の概要: SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models

arxiv url: http://arxiv.org/abs/2509.15536v1
Date: Fri, 19 Sep 2025 02:41:37 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-22 18:18:10.962451
Title: SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models
Title（参考訳）: SAMPO:世代別世界モデルのための運動PrOmptを用いたスケールワイドオートレグレス
Authors: Sen Wang, Jingyi Tian, Le Wang, Zhimin Liao, Jiayi Li, Huaiyi Dong, Kun Xia, Sanping Zhou, Wei Tang, Hua Gang,
Abstract要約: textbfSAMPOは、フレーム内生成のための視覚的自己回帰モデリングと、次のフレーム生成のための因果モデリングを組み合わせたハイブリッドフレームワークである。動作条件付きビデオ予測とモデルベース制御において,SAMPOが競合性能を発揮することを示す。また、SAMPOのゼロショット一般化とスケーリング挙動を評価し、未知のタスクに一般化する能力を示す。
参考スコア（独自算出の注目度）: 42.814012901180774
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose \textbf{S}cale-wise \textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt (\textbf{SAMPO}), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4.4$\times$ faster inference. We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.
Abstract（参考訳）: 世界モデルでは、エージェントは計画、制御、長期の意思決定のための想像された環境における行動の結果をシミュレートすることができる。しかし、既存の自己回帰的世界モデルは、空間構造が乱れ、非効率な復号化、不適切な動きモデリングにより、視覚的に一貫性のある予測に苦慮している。そこで本研究では,フレーム内生成のための視覚的自己回帰モデリングと,次のフレーム生成のための因果モデリングを併用したハイブリッドフレームワークである \textbf{S}cale-wise \textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt (\textbf{SAMPO})を提案する。具体的には、SAMPOは時間的因果復号と双方向の空間的注意を融合し、空間的局所性を保ち、各スケールで並列復号をサポートする。この設計は、時間的一貫性とロールアウト効率の両方を大幅に向上させる。動的シーン理解をさらに改善するため,観測フレーム内の空間的詳細を保存し,メモリ使用量とモデル性能の両方を最適化して,将来のフレームに対するコンパクトな動的表現を抽出する非対称なマルチスケールトークンライザを考案した。さらに、物体とロボットの軌跡に関する時空間的手がかりを注入し、動的領域に注意を向け、時間的一貫性と身体的リアリズムを改善するトラジェクトリ対応モーションプロンプトモジュールを導入する。大規模な実験により、SAMPOはアクション条件付きビデオ予測とモデルベース制御における競合性能を達成し、生成品質を4.4$\times$高速推論で改善した。また、SAMPOのゼロショット一般化とスケーリングの挙動を評価し、未知のタスクに一般化し、より大きなモデルサイズから恩恵を受ける能力を示す。

論文の概要: SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models

関連論文リスト