Related papers: Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation

Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation

URL: http://arxiv.org/abs/2602.14027v2
Date: Tue, 17 Feb 2026 04:53:36 GMT
Title: Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation
Authors: Jia Li, Xiaomeng Fu, Xurui Peng, Weifeng Chen, Youwei Zheng, Tianyu Zhao, Jiexi Wang, Fangmin Chen, Xing Wang, Hayden Kwok-Hay So,
Abstract summary: FLEX is a training-free inference-time framework that bridges the gap between short-term training and long-term inference.<n>It significantly outperforms state-of-the-art models at 6x extrapolation (30s duration) and matches the performance of long-video fine-tuned baselines at 12x scale (60s duration)<n>As a plug-and-play augmentation, FLEX seamlessly integrates into existing inference pipelines for horizon extension.
Score: 15.110494847628212
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoregressive video diffusion models have emerged as a scalable paradigm for long video generation. However, they often suffer from severe extrapolation failure, where rapid error accumulation leads to significant temporal degradation when extending beyond training horizons. We identify that this failure primarily stems from the spectral bias of 3D positional embeddings and the lack of dynamic priors in noise sampling. To address these issues, we propose FLEX (Frequency-aware Length EXtension), a training-free inference-time framework that bridges the gap between short-term training and long-term inference. FLEX introduces Frequency-aware RoPE Modulation to adaptively interpolate under-trained low-frequency components while extrapolating high-frequency ones to preserve multi-scale temporal discriminability. This is integrated with Antiphase Noise Sampling (ANS) to inject high-frequency dynamic priors and Inference-only Attention Sink to anchor global structure. Extensive evaluations on VBench demonstrate that FLEX significantly outperforms state-of-the-art models at 6x extrapolation (30s duration) and matches the performance of long-video fine-tuned baselines at 12x scale (60s duration). As a plug-and-play augmentation, FLEX seamlessly integrates into existing inference pipelines for horizon extension. It effectively pushes the generation limits of models such as LongLive, supporting consistent and dynamic video synthesis at a 4-minute scale. Project page is available at https://ga-lee.github.io/FLEX_demo.

Related papers

EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation [8.795438456031512]
Multi-modal generation models have achieved high visual quality, but their prohibitive latency and limited temporal stability hinder real-time deployment.<n> Streaming inference exacerbates these issues, leading to pronounced multimodal ambiguities, such as blurring, temporal drift, and lip dechronization.<n>We propose EchoTorrent, a novel novel with a fourfold schema: Multi-Teacher Training fine-tunes a pre-trained model on distinct preference domains; Adaptive DMD (ACCDMD) calibrates the audio CFG degradation errors in phased via a schedule; Long Hybrid Tail, which enforces alignment exclusively on tail frames during long-horizon self-roll
arXiv Detail & Related papers (2026-02-14T08:32:38Z)
LoL: Longer than Longer, Scaling Video Generation to Hour [50.945885467651216]
This work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay.<n>As an illustration, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.
arXiv Detail & Related papers (2026-01-23T17:21:35Z)
Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers [95.68243351895107]
We propose a holistic, video-centric paradigm named textbfLocal textbfDiffusion textbfForcing for textbfVideo textbfFrame textbfInterpolation (LDF-VFI)<n>Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence.<n>LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per
arXiv Detail & Related papers (2026-01-21T12:58:52Z)
TIDAL: Temporally Interleaved Diffusion and Action Loop for High-Frequency VLA Control [15.534182843429043]
Large-scale Vision-Language-Action (VLA) models offer semantic generalization but suffer from high inference latency.<n>We propose TIDAL, a hierarchical framework that decouples semantic reasoning from high-frequency actuation.<n> TIDAL operates as a backbone-agnostic module for diffusion-basedVLAs, using a dual-frequency architecture.
arXiv Detail & Related papers (2026-01-21T12:43:11Z)
SoulX-FlashTalk: Real-Time Infinite Streaming of Audio-Driven Avatars via Self-Correcting Bidirectional Distillation [16.34443339642213]
textbfX-FlashTalk is a 14B-scale system to achieve a textbfsub-second start-up latency (0.87s) while reaching a real-time throughput of textbf32 FPS.<n>SoulX-FlashTalk is the first 14B-scale system to achieve a textbfsub-second start-up latency (0.87s) while reaching a real-time throughput of textbf32 FPS.
arXiv Detail & Related papers (2025-12-29T11:18:24Z)
End-to-End Training for Autoregressive Video Diffusion via Self-Resampling [63.84672807009907]
Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch.<n>We introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale.
arXiv Detail & Related papers (2025-12-17T18:53:29Z)
Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression [36.99018442740971]
We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation.<n>We introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning.<n>Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.
arXiv Detail & Related papers (2025-12-04T18:46:44Z)
Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference [5.146388234814547]
Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues.<n>We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches.<n>EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences.
arXiv Detail & Related papers (2025-10-16T12:34:38Z)
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time [86.40480237741609]
Rolling Forcing is a novel video generation technique that enables streaming long videos with minimal error accumulation.<n>Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames, which accelerates error propagation, we design a joint denoising scheme.<n>Second, we introduce the attention sink mechanism into the long-horizon stream video generation task, which allows the model to keep key value states of initial frames as a global context anchor.<n>Third, we design an efficient training algorithm that enables few-step distillation over largely extended denoising windows.
arXiv Detail & Related papers (2025-09-29T17:57:14Z)
FLEX: A Backbone for Diffusion-Based Modeling of Spatio-temporal Physical Systems [51.15230303652732]
FLEX (F Low EXpert) is a backbone architecture for generative modeling of-temporal physical systems.<n>It reduces the variance of the velocity field in the diffusion model, which helps stabilize training.<n>It achieves accurate predictions for super-resolution and forecasting tasks using as few features as two reverse diffusion steps.
arXiv Detail & Related papers (2025-05-23T00:07:59Z)
Long-Context Autoregressive Video Modeling with Next-Frame Prediction [17.710915002557996]
Long-context video modeling is essential for enabling generative models to function as world simulators.<n>While training directly on long videos is a natural solution, the rapid growth of vision tokens makes it computationally prohibitive.<n>We propose Frame AutoRegressive (FAR) models temporal dependencies between continuous frames, converges faster than video diffusion transformers, and outperforms token-level autoregressive models.
arXiv Detail & Related papers (2025-03-25T03:38:06Z)
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers [29.663251658875673]
RIFLEx is a free lunch--achieving high-quality 2x extrapolation on state-of-the-art video diffusion transformers.<n>It enhances quality and enables 3x extrapolation by minimal fine-tuning without long videos.
arXiv Detail & Related papers (2025-02-21T19:28:05Z)
FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention [57.651429116402554]
This paper investigates a straightforward and training-free approach to extend an existing short video diffusion model for consistent long video generation. We find that directly applying the short video diffusion model to generate long videos can lead to severe video quality degradation. Motivated by this, we propose a novel solution named FreeLong to balance the frequency distribution of long video features during the denoising process.
arXiv Detail & Related papers (2024-07-29T11:52:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.