InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation
- URL: http://arxiv.org/abs/2511.04675v1
- Date: Thu, 06 Nov 2025 18:58:03 GMT
- Title: InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation
- Authors: Jinlai Liu, Jian Han, Bin Yan, Hui Wu, Fengda Zhu, Xing Wang, Yi Jiang, Bingyue Peng, Zehuan Yuan,
- Abstract summary: We introduce InfinityStar, a unified spacetime autoregressive framework for high-resolution image and dynamic video synthesis.<n>To our knowledge, InfinityStar is the first discrete autoregressive video generator capable of producing industrial level 720p videos.
- Score: 39.324043558250544
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce InfinityStar, a unified spacetime autoregressive framework for high-resolution image and dynamic video synthesis. Building on the recent success of autoregressive modeling in both vision and language, our purely discrete approach jointly captures spatial and temporal dependencies within a single architecture. This unified design naturally supports a variety of generation tasks such as text-to-image, text-to-video, image-to-video, and long interactive video synthesis via straightforward temporal autoregression. Extensive experiments demonstrate that InfinityStar scores 83.74 on VBench, outperforming all autoregressive models by large margins, even surpassing some diffusion competitors like HunyuanVideo. Without extra optimizations, our model generates a 5s, 720p video approximately 10x faster than leading diffusion-based methods. To our knowledge, InfinityStar is the first discrete autoregressive video generator capable of producing industrial level 720p videos. We release all code and models to foster further research in efficient, high-quality video generation.
Related papers
- Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation [41.993197533574126]
Inferix is an inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes.<n>Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation.
arXiv Detail & Related papers (2025-11-25T01:45:04Z) - SkyReels-V2: Infinite-length Film Generative Model [35.00453687783287]
We propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework.<n>We establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement.
arXiv Detail & Related papers (2025-04-17T16:37:27Z) - SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device [61.42406720183769]
We propose a comprehensive acceleration framework to bring the power of the large-scale video diffusion model to the hands of edge users.<n>Our model, with only 0.6B parameters, can generate a 5-second video on an iPhone 16 PM within 5 seconds.
arXiv Detail & Related papers (2024-12-13T18:59:56Z) - HunyuanVideo: A Systematic Framework For Large Video Generative Models [82.4392082688739]
HunyuanVideo is an innovative open-source video foundation model.<n>It incorporates data curation, advanced architectural design, progressive model scaling and training.<n>As a result, we successfully trained a video generative model with over 13 billion parameters.
arXiv Detail & Related papers (2024-12-03T23:52:37Z) - Lumiere: A Space-Time Diffusion Model for Video Generation [75.54967294846686]
We introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once.
This is in contrast to existing video models which synthesize distants followed by temporal super-resolution.
By deploying both spatial and (importantly) temporal down- and up-sampling, our model learns to directly generate a full-frame-rate, low-resolution video.
arXiv Detail & Related papers (2024-01-23T18:05:25Z) - Photorealistic Video Generation with Diffusion Models [44.95407324724976]
W.A.L.T. is a transformer-based approach for video generation via diffusion modeling.
We use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities.
We also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 times $ resolution at $8$ frames per second.
arXiv Detail & Related papers (2023-12-11T18:59:57Z) - Latent Video Diffusion Models for High-Fidelity Long Video Generation [58.346702410885236]
We introduce lightweight video diffusion models using a low-dimensional 3D latent space.
We also propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced.
Our framework generates more realistic and longer videos than previous strong baselines.
arXiv Detail & Related papers (2022-11-23T18:58:39Z) - Video Diffusion Models [47.99413440461512]
Generating temporally coherent high fidelity video is an important milestone in generative modeling research.
We propose a diffusion model for video generation that shows very promising initial results.
We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark.
arXiv Detail & Related papers (2022-04-07T14:08:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.