StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation
- URL: http://arxiv.org/abs/2501.05763v4
- Date: Sun, 13 Apr 2025 06:21:42 GMT
- Title: StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation
- Authors: Shangjin Zhai, Zhichao Ye, Jialin Liu, Weijian Xie, Jiaqi Hu, Zhen Peng, Hua Xue, Danpeng Chen, Xiaomeng Wang, Lei Yang, Nan Wang, Haomin Liu, Guofeng Zhang,
- Abstract summary: StarGen is a framework that employs a pre-trained video diffusion model in an autoregressive manner for long-range scene generation.<n>The generation of each video clip is conditioned on the 3D warping of adjacent images and the temporal temporally overlapping image from previously generated clips.
- Score: 12.016502857454228
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in large reconstruction and generative models have significantly improved scene reconstruction and novel view generation. However, due to compute limitations, each inference with these large models is confined to a small area, making long-range consistent scene generation challenging. To address this, we propose StarGen, a novel framework that employs a pre-trained video diffusion model in an autoregressive manner for long-range scene generation. The generation of each video clip is conditioned on the 3D warping of spatially adjacent images and the temporally overlapping image from previously generated clips, improving spatiotemporal consistency in long-range scene generation with precise pose control. The spatiotemporal condition is compatible with various input conditions, facilitating diverse tasks, including sparse view interpolation, perpetual view generation, and layout-conditioned city generation. Quantitative and qualitative evaluations demonstrate StarGen's superior scalability, fidelity, and pose accuracy compared to state-of-the-art methods. Project page: https://zju3dv.github.io/StarGen.
Related papers
- RAYNOVA: Scale-Temporal Autoregressive World Modeling in Ray Space [51.441415833480505]
RAYNOVA is a multiview world model for driving scenarios that employs a dual-causal autoregressive framework.<n>It constructs an isotropic-temporal representation across views, frames, and scales based on relative Plcker-ray positional encoding.
arXiv Detail & Related papers (2026-02-24T08:41:40Z) - Plenoptic Video Generation [80.3116444692858]
We introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain synchronization-temporal memory.<n>The core idea is to train a multi-in-out video-conditioned model in an autoregressive manner.<n>Our training incorporates context-scaling to improve convergence, self-conditioning to hallucinations caused by error accumulation, and a long-video conditioning mechanism to support extended video generation.
arXiv Detail & Related papers (2026-01-08T18:58:32Z) - iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation [60.66986667921744]
iMontage is a unified framework designed to repurpose a powerful video model into an all-in-one image generator.<n>We propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm.<n>This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors.
arXiv Detail & Related papers (2025-11-25T18:54:16Z) - Towards Geometric and Textural Consistency 3D Scene Generation via Single Image-guided Model Generation and Layout Optimization [14.673302810271219]
We propose a novel three-stage framework for 3D scene generation with explicit geometric representations and high-quality textural details.<n>Our approach not only outperforms state-of-the-art methods in terms of geometric accuracy and texture fidelity of individual generated 3D models, but also has significant advantages in scene layout synthesis.
arXiv Detail & Related papers (2025-07-20T06:59:42Z) - GenFusion: Closing the Loop between Reconstruction and Generation via Videos [24.195304481751602]
We propose a reconstruction-driven video diffusion model that learns to condition video frames on artifact-prone RGB-D renderings.
We also propose a cyclical fusion pipeline that iteratively adds restoration frames from the generative model to the training set.
arXiv Detail & Related papers (2025-03-27T07:16:24Z) - VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling [20.329392012132885]
We propose VideoRFSplat, a text-to-3D model leveraging a video generation model to generate realistic 3D Gaussian Splatting (3DGS) for unbounded real-world scenes.
VideoRFSplat outperforms existing text-to-3D direct generation methods that heavily depend on post-hoc refinement via score distillation sampling.
arXiv Detail & Related papers (2025-03-20T05:26:09Z) - Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion [61.929653153389964]
We present a method for generating Streetscapes-long sequences of views through an on-the-fly synthesized city-scale scene.
Our method can scale to much longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency.
arXiv Detail & Related papers (2024-07-18T17:56:30Z) - MultiDiff: Consistent Novel View Synthesis from a Single Image [60.04215655745264]
MultiDiff is a novel approach for consistent novel view synthesis of scenes from a single RGB image.
Our results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet.
arXiv Detail & Related papers (2024-06-26T17:53:51Z) - GenS: Generalizable Neural Surface Reconstruction from Multi-View Images [20.184657468900852]
GenS is an end-to-end generalizable neural surface reconstruction model.
Our representation is more powerful, which can recover high-frequency details while maintaining global smoothness.
Experiments on popular benchmarks show that our model can generalize well to new scenes.
arXiv Detail & Related papers (2024-06-04T17:13:10Z) - 3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation [51.64796781728106]
We propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior to 2D diffusion model and the global 3D information of the current scene.
Our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.
arXiv Detail & Related papers (2024-03-14T14:31:22Z) - VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction [59.40711222096875]
We present VastGaussian, the first method for high-quality reconstruction and real-time rendering on large scenes based on 3D Gaussian Splatting.
Our approach outperforms existing NeRF-based methods and achieves state-of-the-art results on multiple large scene datasets.
arXiv Detail & Related papers (2024-02-27T11:40:50Z) - RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - DiffDreamer: Towards Consistent Unsupervised Single-view Scene
Extrapolation with Conditional Diffusion Models [91.94566873400277]
DiffDreamer is an unsupervised framework capable of synthesizing novel views depicting a long camera trajectory.
We show that image-conditioned diffusion models can effectively perform long-range scene extrapolation while preserving consistency significantly better than prior GAN-based methods.
arXiv Detail & Related papers (2022-11-22T10:06:29Z) - GAUDI: A Neural Architect for Immersive 3D Scene Generation [67.97817314857917]
GAUDI is a generative model capable of capturing the distribution of complex and realistic 3D scenes that can be rendered immersively from a moving camera.
We show that GAUDI obtains state-of-the-art performance in the unconditional generative setting across multiple datasets.
arXiv Detail & Related papers (2022-07-27T19:10:32Z) - Infinite Nature: Perpetual View Generation of Natural Scenes from a
Single Image [73.56631858393148]
We introduce the problem of perpetual view generation -- long-range generation of novel views corresponding to an arbitrarily long camera trajectory given a single image.
We take a hybrid approach that integrates both geometry and image synthesis in an iterative render, refine, and repeat framework.
Our approach can be trained from a set of monocular video sequences without any manual annotation.
arXiv Detail & Related papers (2020-12-17T18:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.