Spatial-Temporal State Propagation Autoregressive Model for 4D Object Generation
- URL: http://arxiv.org/abs/2602.18830v1
- Date: Sat, 21 Feb 2026 13:21:21 GMT
- Title: Spatial-Temporal State Propagation Autoregressive Model for 4D Object Generation
- Authors: Liying Yang, Jialun Liu, Jiakui Hu, Chenhao Guan, Haibin Huang, Fangqiu Yi, Chi Zhang, Yanyan Liang,
- Abstract summary: A spatial-temporal state propagation autoregressive model (STAR) is proposed, which achieves spatial-temporal consistent generation.<n>Experiments demonstrate that 4DSTAR generates spatial-temporal consistent 4D objects, and achieves performance competitive with diffusion models.
- Score: 19.913442608499366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating high-quality 4D objects with spatial-temporal consistency is still formidable. Existing diffusion-based methods often struggle with spatial-temporal inconsistency, as they fail to leverage outputs from all previous timesteps to guide the generation at the current timestep. Therefore, we propose a Spatial-Temporal State Propagation AutoRegressive Model (4DSTAR), which generates 4D objects maintaining temporal-spatial consistency. 4DSTAR formulates the generation problem as the prediction of tokens that represent the 4D object. It consists of two key components: (1) The dynamic spatial-temporal state propagation autoregressive model (STAR) is proposed, which achieves spatial-temporal consistent generation. Unlike standard autoregressive models, STAR divides prediction tokens into groups based on timesteps. It models long-term dependencies by propagating spatial-temporal states from previous groups and utilizes these dependencies to guide generation at the next timestep. To this end, a spatial-temporal container is proposed, which dynamically updating the effective spatial-temporal state features from all historical groups, then updated features serve as conditional features to guide the prediction of the next token group. (2) The 4D VQ-VAE is proposed, which implicitly encodes the 4D structure into discrete space and decodes the discrete tokens predicted by STAR into temporally coherent dynamic 3D Gaussians. Experiments demonstrate that 4DSTAR generates spatial-temporal consistent 4D objects, and achieves performance competitive with diffusion models.
Related papers
- Orthogonal Spatial-temporal Distributional Transfer for 4D Generation [58.30004328671699]
We propose a framework that transfers rich spatial priors from existing 3D diffusion models and temporal priors from video diffusion models to enhance 4D synthesis.<n>We develop a spatial-temporal-disentangled 4D (STD-4D) Diffusion model, which synthesizes 4D-aware videos through disentangled spatial and temporal latents.<n> Experiments demonstrate that our method significantly outperforms existing approaches, achieving superior spatial-temporal consistency and higher-quality 4D synthesis.
arXiv Detail & Related papers (2026-03-05T11:52:21Z) - SS4D: Native 4D Generative Model via Structured Spacetime Latents [50.29500511908054]
We present SS4D, a native 4D generative model that synthesizes dynamic 3D objects directly from monocular video.<n>We train a generator directly on 4D data, achieving high fidelity, temporal coherence, and structural consistency.
arXiv Detail & Related papers (2025-12-16T10:45:06Z) - Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation [21.075786141331974]
We present emphTrack4DGen, a framework for generating dynamic 4D objects from sparse inputs.<n>In Stage One, we enforce dense, feature-level point correspondences inside the diffusion generator.<n>In Stage Two, we reconstruct a dynamic 4D-GS using a hybrid motion encoding.
arXiv Detail & Related papers (2025-12-05T21:13:04Z) - U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences [54.77163447282599]
Existing generative frameworks treat all spatial regions uniformly, overlooking the varying uncertainty across real-world scenes.<n>We present U4D, an uncertainty-aware framework for 4D LiDAR world modeling.<n>Our approach first estimates spatial uncertainty maps from a pretrained segmentation model to localize semantically challenging regions.<n>It then performs generation in a "hard-to-easy" manner through two sequential stages: (1) uncertainty-region modeling, which reconstructs high-entropy regions with fine geometric fidelity, and (2) uncertainty-conditioned completion, which synthesizes the remaining areas under learned structural priors.
arXiv Detail & Related papers (2025-12-02T17:59:57Z) - 4DSTR: Advancing Generative 4D Gaussians with Spatial-Temporal Rectification for High-Quality and Consistent 4D Generation [28.11338918279445]
We propose a novel 4D generation network called 4DSTR, which modulates generative 4D Gaussian Splatting with spatial-temporal rectification.<n>Experiments demonstrate that our 4DSTR achieves state-of-the-art performance in video-to-4D generation, excelling in reconstruction quality, spatial-temporal consistency, and adaptation to rapid temporal movements.
arXiv Detail & Related papers (2025-11-10T15:57:03Z) - Dream4D: Lifting Camera-Controlled I2V towards Spatiotemporally Consistent 4D Generation [3.1852855132066673]
Current approaches often struggle to maintain view consistency while handling complex scene dynamics.<n>This framework is the first to leverage both rich temporal priors video diffusion models and geometric awareness of the reconstruction models.<n>It significantly facilitates 4D generation and shows higher quality (e.g., mPSNR, mSSIM) over existing methods.
arXiv Detail & Related papers (2025-08-11T08:55:47Z) - Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency [49.875459658889355]
Free4D is a tuning-free framework for 4D scene generation from a single image.<n>Our key insight is to distill pre-trained foundation models for consistent 4D scene representation.<n>The resulting 4D representation enables real-time, controllable rendering.
arXiv Detail & Related papers (2025-03-26T17:59:44Z) - DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation [50.01520547454224]
Current generative models struggle to synthesize 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS)<n>We propose DiST-4D, which disentangles the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency.<n>Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations.
arXiv Detail & Related papers (2025-03-19T13:49:48Z) - Diffusion4D: Fast Spatial-temporal Consistent 4D Generation via Video Diffusion Models [116.31344506738816]
We present a novel framework, textbfDiffusion4D, for efficient and scalable 4D content generation.
We develop a 4D-aware video diffusion model capable of synthesizing orbital views of dynamic 3D assets.
Our method surpasses prior state-of-the-art techniques in terms of generation efficiency and 4D geometry consistency.
arXiv Detail & Related papers (2024-05-26T17:47:34Z) - Motion2VecSets: 4D Latent Vector Set Diffusion for Non-rigid Shape Reconstruction and Tracking [52.393359791978035]
Motion2VecSets is a 4D diffusion model for dynamic surface reconstruction from point cloud sequences.
We parameterize 4D dynamics with latent sets instead of using global latent codes.
For more temporally-coherent object tracking, we synchronously denoise deformation latent sets and exchange information across multiple frames.
arXiv Detail & Related papers (2024-01-12T15:05:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.