Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
- URL: http://arxiv.org/abs/2602.07775v1
- Date: Sun, 08 Feb 2026 02:16:02 GMT
- Title: Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
- Authors: Haodong Li, Shaoteng Liu, Zhe Lin, Manmohan Chandraker,
- Abstract summary: Autoregressive (AR) video diffusion models have achieved remarkable performance.<n>Due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations.<n>Rolling Sink scales the AR video synthesis to ultra-long durations at test time, with consistent subjects, stable colors, coherent structures, and smooth motions.
- Score: 62.3543999991324
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, autoregressive (AR) video diffusion models has achieved remarkable performance. However, due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Following Self Forcing, which studies the train-test gap within the training duration, this work studies the train-test gap beyond the training duration, i.e., the gap between the limited horizons during training and open-ended horizons during testing. Since open-ended testing can extend beyond any finite training window, and long-video training is computationally expensive, we pursue a training-free solution to bridge this gap. To explore a training-free solution, we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink. Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the AR video synthesis to ultra-long durations (e.g., 5-30 minutes at 16 FPS) at test time, with consistent subjects, stable colors, coherent structures, and smooth motions. As demonstrated by extensive experiments, Rolling Sink achieves superior long-horizon visual fidelity and temporal consistency compared to SOTA baselines. Project page: https://rolling-sink.github.io/
Related papers
- Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation [15.110494847628212]
FLEX is a training-free inference-time framework that bridges the gap between short-term training and long-term inference.<n>It significantly outperforms state-of-the-art models at 6x extrapolation (30s duration) and matches the performance of long-video fine-tuned baselines at 12x scale (60s duration)<n>As a plug-and-play augmentation, FLEX seamlessly integrates into existing inference pipelines for horizon extension.
arXiv Detail & Related papers (2026-02-15T07:14:47Z) - End-to-End Training for Autoregressive Video Diffusion via Self-Resampling [63.84672807009907]
Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch.<n>We introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale.
arXiv Detail & Related papers (2025-12-17T18:53:29Z) - LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling [87.98096428508181]
LongVT is an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought.<n>We exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames.<n>Our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning.
arXiv Detail & Related papers (2025-11-25T19:22:48Z) - Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [67.94300151774085]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z) - Refining Pre-Trained Motion Models [56.18044168821188]
We take on the challenge of improving state-of-the-art supervised models with self-supervised training.
We focus on obtaining a "clean" training signal from real-world unlabelled video.
We show that our method yields reliable gains over fully-supervised methods in real videos.
arXiv Detail & Related papers (2024-01-01T18:59:33Z) - Accelerating the Training of Video Super-Resolution [26.449738545078986]
We show that it is possible to gradually train video models from small to large spatial/temporal sizes in an easy-to-hard manner.
Our method is capable of largely speeding up training (up to $6.2times$ speedup in wall-clock training time) without performance drop for various VSR models.
arXiv Detail & Related papers (2022-05-10T17:55:24Z) - A Large-Scale Study on Unsupervised Spatiotemporal Representation
Learning [60.720251418816815]
We present a large-scale study on unsupervised representation learning from videos.
Our objective encourages temporally-persistent features in the same video.
We find that encouraging long-spanned persistency can be effective even if the timespan is 60 seconds.
arXiv Detail & Related papers (2021-04-29T17:59:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.