Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
- URL: http://arxiv.org/abs/2509.25161v1
- Date: Mon, 29 Sep 2025 17:57:14 GMT
- Title: Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
- Authors: Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, Shijian Lu,
- Abstract summary: Rolling Forcing is a novel video generation technique that enables streaming long videos with minimal error accumulation.<n>Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames, which accelerates error propagation, we design a joint denoising scheme.<n>Second, we introduce the attention sink mechanism into the long-horizon stream video generation task, which allows the model to keep key value states of initial frames as a global context anchor.<n>Third, we design an efficient training algorithm that enables few-step distillation over largely extended denoising windows.
- Score: 86.40480237741609
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Streaming video generation, as one fundamental component in interactive world models and neural game engines, aims to generate high-quality, low-latency, and temporally coherent long video streams. However, most existing work suffers from severe error accumulation that often significantly degrades the generated stream videos over long horizons. We design Rolling Forcing, a novel video generation technique that enables streaming long videos with minimal error accumulation. Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames, which accelerates error propagation, we design a joint denoising scheme that simultaneously denoises multiple frames with progressively increasing noise levels. This design relaxes the strict causality across adjacent frames, effectively suppressing error growth. Second, we introduce the attention sink mechanism into the long-horizon stream video generation task, which allows the model to keep key value states of initial frames as a global context anchor and thereby enhances long-term global consistency. Third, we design an efficient training algorithm that enables few-step distillation over largely extended denoising windows. This algorithm operates on non-overlapping windows and mitigates exposure bias conditioned on self-generated histories. Extensive experiments show that Rolling Forcing enables real-time streaming generation of multi-minute videos on a single GPU, with substantially reduced error accumulation.
Related papers
- LoL: Longer than Longer, Scaling Video Generation to Hour [50.945885467651216]
This work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay.<n>As an illustration, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.
arXiv Detail & Related papers (2026-01-23T17:21:35Z) - JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion [19.420963062956222]
JoyAvatar is an audio-driven autoregressive model capable of real-time inference and infinite-length video generation.<n>Our model achieves competitive results in visual quality, temporal consistency, and lip synchronization.
arXiv Detail & Related papers (2025-12-12T10:06:01Z) - StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation [65.90400162290057]
Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered.<n>Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation.<n>Live online streaming operates under strict service-level objectives (SLOs): time-to-first-frame must be minimal, and every frame must meet a per-frame deadline with low jitter.
arXiv Detail & Related papers (2025-11-10T18:51:28Z) - MotionStream: Real-Time Video Generation with Interactive Motion Controls [60.403597895657505]
We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU.<n>Our approach begins by augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on the fly.<n>Our models achieve state-of-the-art results in motion following and video quality while being two orders of magnitude faster, uniquely enabling infinite-length streaming.
arXiv Detail & Related papers (2025-11-03T06:37:53Z) - FreeLong++: Training-Free Long Video Generation via Multi-band SpectralFusion [24.48220892418698]
FreeLong is a training-free framework designed to balance the frequency distribution of long video features during the denoising process.<n>FreeLong achieves this by blending global low-frequency features, which capture holistic semantics across the full video, with local high-frequency features extracted from short temporal windows.<n>FreeLong++ extends FreeLong into a multi-branch architecture with multiple attention branches, each operating at a distinct temporal scale.
arXiv Detail & Related papers (2025-06-30T18:11:21Z) - SplatVoxel: History-Aware Novel View Streaming without Temporal Training [29.759664150610362]
We study the problem of novel view streaming from sparse-view videos.<n>Existing novel view synthesis methods struggle with temporal coherence and visual fidelity.<n>We propose a hybrid splat-voxel feed-forward scene reconstruction approach.
arXiv Detail & Related papers (2025-03-18T20:00:47Z) - Tuning-Free Long Video Generation via Global-Local Collaborative Diffusion [22.988212617368095]
We propose GLC-Diffusion, a tuning-free method for long video generation.<n>It models the long video denoising process by establishing Global-Local Collaborative Denoising.<n>We also propose a Video Motion Consistency Refinement (VMCR) module that computes the gradient of pixel-wise and frequency-wise losses.
arXiv Detail & Related papers (2025-01-08T05:49:39Z) - RAIN: Real-time Animation of Infinite Video Stream [52.97171098038888]
RAIN is a pipeline solution capable of animating infinite video streams in real-time with low latency.<n>RAIN generates video frames with much shorter latency and faster speed, while maintaining long-range attention over extended video streams.<n>RAIN can animate characters in real-time with much better quality, accuracy, and consistency than competitors.
arXiv Detail & Related papers (2024-12-27T07:13:15Z) - Progressive Autoregressive Video Diffusion Models [24.97019070991881]
We introduce a more natural formulation of autoregressive long video generation by revisiting the noise level assumption in video diffusion models.<n>Our key idea is to assign the frames with per-frame, progressively increasing noise levels rather than a single noise level and denoise.<n>Video diffusion models equipped with our progressive noise schedule can autoregressively generate long videos with much improved fidelity compared to the baselines.
arXiv Detail & Related papers (2024-10-10T17:36:15Z) - FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention [57.651429116402554]
This paper investigates a straightforward and training-free approach to extend an existing short video diffusion model for consistent long video generation.
We find that directly applying the short video diffusion model to generate long videos can lead to severe video quality degradation.
Motivated by this, we propose a novel solution named FreeLong to balance the frequency distribution of long video features during the denoising process.
arXiv Detail & Related papers (2024-07-29T11:52:07Z) - Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio.
We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z) - RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling [85.60543452539076]
Existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference.
This study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts.
We propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models.
arXiv Detail & Related papers (2023-10-23T17:59:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.