Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression
- URL: http://arxiv.org/abs/2512.05081v1
- Date: Thu, 04 Dec 2025 18:46:44 GMT
- Title: Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression
- Authors: Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, Seungryong Kim,
- Abstract summary: We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation.<n>We introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning.<n>Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.
- Score: 36.99018442740971
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.
Related papers
- EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation [8.795438456031512]
Multi-modal generation models have achieved high visual quality, but their prohibitive latency and limited temporal stability hinder real-time deployment.<n> Streaming inference exacerbates these issues, leading to pronounced multimodal ambiguities, such as blurring, temporal drift, and lip dechronization.<n>We propose EchoTorrent, a novel novel with a fourfold schema: Multi-Teacher Training fine-tunes a pre-trained model on distinct preference domains; Adaptive DMD (ACCDMD) calibrates the audio CFG degradation errors in phased via a schedule; Long Hybrid Tail, which enforces alignment exclusively on tail frames during long-horizon self-roll
arXiv Detail & Related papers (2026-02-14T08:32:38Z) - LoL: Longer than Longer, Scaling Video Generation to Hour [50.945885467651216]
This work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay.<n>As an illustration, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.
arXiv Detail & Related papers (2026-01-23T17:21:35Z) - Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation [16.692450893925148]
We present a novel streaming framework named Knot Forcing for real-time portrait animation.<n>K Knot Forcing enables high-fidelity, temporally consistent, and interactive portrait animation over infinite sequences.
arXiv Detail & Related papers (2025-12-25T16:34:56Z) - Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation [69.57572900337176]
We introduce Reward Forcing, a novel framework for efficient streaming video generation.<n> EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying.<n>Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model.
arXiv Detail & Related papers (2025-12-04T11:12:13Z) - MotionStream: Real-Time Video Generation with Interactive Motion Controls [60.403597895657505]
We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU.<n>Our approach begins by augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on the fly.<n>Our models achieve state-of-the-art results in motion following and video quality while being two orders of magnitude faster, uniquely enabling infinite-length streaming.
arXiv Detail & Related papers (2025-11-03T06:37:53Z) - InfVSR: Breaking Length Limits of Generic Video Super-Resolution [40.30527504651693]
InfVSR is an autoregressive-one-step-diffusion paradigm for long sequences.<n>We distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching.<n>Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods.
arXiv Detail & Related papers (2025-10-01T14:21:45Z) - Rolling Forcing: Autoregressive Long Video Diffusion in Real Time [86.40480237741609]
Rolling Forcing is a novel video generation technique that enables streaming long videos with minimal error accumulation.<n>Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames, which accelerates error propagation, we design a joint denoising scheme.<n>Second, we introduce the attention sink mechanism into the long-horizon stream video generation task, which allows the model to keep key value states of initial frames as a global context anchor.<n>Third, we design an efficient training algorithm that enables few-step distillation over largely extended denoising windows.
arXiv Detail & Related papers (2025-09-29T17:57:14Z) - Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [67.94300151774085]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.