SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models
- URL: http://arxiv.org/abs/2509.15536v2
- Date: Tue, 21 Oct 2025 03:09:30 GMT
- Title: SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models
- Authors: Sen Wang, Jingyi Tian, Le Wang, Zhimin Liao, Jiayi Li, Huaiyi Dong, Kun Xia, Sanping Zhou, Wei Tang, Hua Gang,
- Abstract summary: textbfSAMPO is a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation.<n>We show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control.<n>We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks.
- Score: 42.814012901180774
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose \textbf{S}cale-wise \textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt (\textbf{SAMPO}), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4.4$\times$ faster inference. We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.
Related papers
- Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory [101.2076718776139]
We propose a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments.<n>We introduce a Pose-free Memory (HPMC) that distills historical latents into a fixed-budget geometric representation.<n>We also propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic.
arXiv Detail & Related papers (2026-02-02T17:52:56Z) - Scalable Spatio-Temporal SE(3) Diffusion for Long-Horizon Protein Dynamics [51.85385061275941]
Molecular dynamics (MD) simulations remain the gold standard for studying protein dynamics.<n>Recent generative models have shown promise in accelerating simulations, yet they struggle with long-horizon generation.<n>We present STAR-MD, a scalable diffusion model that generates physically plausible protein trajectories over micro-scale timescales.
arXiv Detail & Related papers (2026-02-02T14:13:28Z) - High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation [34.450956424316196]
Modeling high-resolutiontemporal representations is essential for video-based human pose estimation.<n>Current state-of-the-art methods typically have difficulties in balancing global and local dynamic modeling.<n>We present a framework that extends Mamba contexts to separately learn global and local high-resolution representations for VHPE.
arXiv Detail & Related papers (2025-10-13T05:18:27Z) - OccTENS: 3D Occupancy World Model via Temporal Next-Scale Prediction [25.630452373274636]
OccTENS is a generative occupancy world model that enables controllable, high-fidelity long-term occupancy generation.<n>We reformulate the occupancy world model as a temporal next-scale prediction (TENS) task.<n>OccTENS outperforms the state-of-the-art method with both higher occupancy quality and faster inference time.
arXiv Detail & Related papers (2025-09-04T05:06:47Z) - Long-Context State-Space Video World Models [66.28743632951218]
We propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency.<n>Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory.<n>Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory.
arXiv Detail & Related papers (2025-05-26T16:12:41Z) - FACTS: A Factored State-Space Framework For World Modelling [24.08175276756845]
We propose a novel recurrent framework, the textbfFACTored textbfState-space (textbfFACTS) model, for spatial-temporal world modelling.<n>The FACTS framework constructs a graph-memory with a routing mechanism that learns permutable memory representations.<n>It consistently outperforms or matches specialised state-of-the-art models, despite its general-purpose world modelling design.
arXiv Detail & Related papers (2024-10-28T11:04:42Z) - Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation [60.80423207808076]
Capturing long-range dependencies while preserving high-resolution visual representations is crucial for dense prediction tasks such as human pose estimation.<n>We propose the Dynamic Visual State Space (DVSS) block, which augments visual state space models with multi-scale convolutional operations.<n>We build HRVMamba, a novel model for efficient high-resolution representation learning.
arXiv Detail & Related papers (2024-10-04T06:19:29Z) - PoseMamba: Monocular 3D Human Pose Estimation with Bidirectional Global-Local Spatio-Temporal State Space Model [7.286873011001679]
We propose a purely SSM-based approach with linear correlations for complexityD human pose estimation in monocular video video.<n>Specifically, we propose a bidirectional global temporal-local-temporal block that comprehensively models human joint relations within individual frames as well as across frames.<n>This strategy provides a more logical geometric ordering strategy, resulting in a combined-local spatial scan.
arXiv Detail & Related papers (2024-08-07T04:38:03Z) - Generative Modeling with Phase Stochastic Bridges [49.4474628881673]
Diffusion models (DMs) represent state-of-the-art generative models for continuous inputs.
We introduce a novel generative modeling framework grounded in textbfphase space dynamics
Our framework demonstrates the capability to generate realistic data points at an early stage of dynamics propagation.
arXiv Detail & Related papers (2023-10-11T18:38:28Z) - Predicting Physics in Mesh-reduced Space with Temporal Attention [15.054026802351146]
We propose a new method that captures long-term dependencies through a transformer-style temporal attention model.
Our method outperforms a competitive GNN baseline on several complex fluid dynamics prediction tasks.
We believe our approach paves the way to bringing the benefits of attention-based sequence models to solving high-dimensional complex physics tasks.
arXiv Detail & Related papers (2022-01-22T18:32:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.