LoL: Longer than Longer, Scaling Video Generation to Hour
- URL: http://arxiv.org/abs/2601.16914v1
- Date: Fri, 23 Jan 2026 17:21:35 GMT
- Title: LoL: Longer than Longer, Scaling Video Generation to Hour
- Authors: Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh,
- Abstract summary: This work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay.<n>As an illustration, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.
- Score: 50.945885467651216
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent research in long-form video generation has shifted from bidirectional to autoregressive models, yet these methods commonly suffer from error accumulation and a loss of long-term coherence. While attention sink frames have been introduced to mitigate this performance decay, they often induce a critical failure mode we term sink-collapse: the generated content repeatedly reverts to the sink frame, resulting in abrupt scene resets and cyclic motion patterns. Our analysis reveals that sink-collapse originates from an inherent conflict between the periodic structure of Rotary Position Embedding (RoPE) and the multi-head attention mechanisms prevalent in current generative models. To address it, we propose a lightweight, training-free approach that effectively suppresses this behavior by introducing multi-head RoPE jitter that breaks inter-head attention homogenization and mitigates long-horizon collapse. Extensive experiments show that our method successfully alleviates sink-collapse while preserving generation quality. To the best of our knowledge, this work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay. As an illustration of this robustness, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.
Related papers
- Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression [36.99018442740971]
We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation.<n>We introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning.<n>Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.
arXiv Detail & Related papers (2025-12-04T18:46:44Z) - Uniform Discrete Diffusion with Metric Path for Video Generation [103.86033350602908]
Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-duration inconsistency.<n>We present Uniform generative modeling and present Uniform pAth (URSA), a powerful framework that bridges the gap with continuous approaches for scalable video generation.<n>URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods.
arXiv Detail & Related papers (2025-10-28T17:59:57Z) - Self-Forcing++: Towards Minute-Scale High-Quality Video Generation [50.945885467651216]
Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality.<n>Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers.<n>We propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets.
arXiv Detail & Related papers (2025-10-02T17:55:42Z) - Rolling Forcing: Autoregressive Long Video Diffusion in Real Time [86.40480237741609]
Rolling Forcing is a novel video generation technique that enables streaming long videos with minimal error accumulation.<n>Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames, which accelerates error propagation, we design a joint denoising scheme.<n>Second, we introduce the attention sink mechanism into the long-horizon stream video generation task, which allows the model to keep key value states of initial frames as a global context anchor.<n>Third, we design an efficient training algorithm that enables few-step distillation over largely extended denoising windows.
arXiv Detail & Related papers (2025-09-29T17:57:14Z) - LongScape: Advancing Long-Horizon Embodied World Models with Context-Aware MoE [16.561410415129778]
LongScape is a hybrid framework that combines intra-chunk diffusion denoising with inter-chunk autoregressive causal generation.<n>Our core innovation is an action-guided, variable-length chunking mechanism that partitions video based on the semantic context of robotic actions.
arXiv Detail & Related papers (2025-09-26T02:47:05Z) - WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception [40.96323549891244]
Current methods predominantly rely on RGB signals, leading to accumulated errors in object structure and motion over extended durations.<n>We introduce WorldWeaver, a robust framework for long video generation that jointly models RGB frames and perceptual conditions within a unified long-horizon modeling scheme.<n>Our training framework offers three key advantages. First, by jointly predicting perceptual conditions and color information from a unified representation, it significantly enhances temporal consistency and motion dynamics.
arXiv Detail & Related papers (2025-08-21T16:57:33Z) - Generating Long Videos of Dynamic Scenes [66.56925105992472]
We present a video generation model that reproduces object motion, changes in camera viewpoint, and new content that arises over time.
A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency.
arXiv Detail & Related papers (2022-06-07T16:29:51Z) - Intrinsic Temporal Regularization for High-resolution Human Video
Synthesis [59.54483950973432]
temporal consistency is crucial for extending image processing pipelines to the video domain.
We propose an effective intrinsic temporal regularization scheme, where an intrinsic confidence map is estimated via the frame generator to regulate motion estimation.
We apply our intrinsic temporal regulation to single-image generator, leading to a powerful " INTERnet" capable of generating $512times512$ resolution human action videos.
arXiv Detail & Related papers (2020-12-11T05:29:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.