Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
- URL: http://arxiv.org/abs/2511.20649v1
- Date: Tue, 25 Nov 2025 18:59:46 GMT
- Title: Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
- Authors: Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Pinar Yanardag,
- Abstract summary: $infty$-RoPE is a unified inference-time framework for autoregressive video diffusion.<n>Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame.<n> KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame.<n>RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates.
- Score: 15.899488263212442
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce $\infty$-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish $\infty$-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that $\infty$-RoPE consistently surpasses previous autoregressive models in overall VBench scores.
Related papers
- LoL: Longer than Longer, Scaling Video Generation to Hour [50.945885467651216]
This work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay.<n>As an illustration, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.
arXiv Detail & Related papers (2026-01-23T17:21:35Z) - Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers [95.68243351895107]
We propose a holistic, video-centric paradigm named textbfLocal textbfDiffusion textbfForcing for textbfVideo textbfFrame textbfInterpolation (LDF-VFI)<n>Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence.<n>LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per
arXiv Detail & Related papers (2026-01-21T12:58:52Z) - MultiShotMaster: A Controllable Multi-Shot Video Generation Framework [67.38203939500157]
Current generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos.<n>We propose MultiShotMaster, a framework for highly controllable multi-shot video generation.
arXiv Detail & Related papers (2025-12-02T18:59:48Z) - PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs [57.790910044227935]
Video LLMs suffer from temporal inconsistency: small shifts in frame timing can flip attention and suppress relevant frames.<n>We present Phase Aggregated Smoothing (PAS), a training-free mechanism that applies small opposed phase offsets across heads and then aggregates their outputs.<n>Our analysis shows that the RoPE rotated logit can be approximated as a content dot product scaled by a time kernel; smoothing this kernel yields Lipschitz stability of attention to small temporal shifts; multi phase averaging attenuates high frequency ripples while preserving per-head spectra under Nyquist-valid sampling.
arXiv Detail & Related papers (2025-11-14T05:56:47Z) - DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion [62.589889759543446]
DriveGen3D is a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes.<n>Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction.
arXiv Detail & Related papers (2025-10-17T03:00:08Z) - InfVSR: Breaking Length Limits of Generic Video Super-Resolution [40.30527504651693]
InfVSR is an autoregressive-one-step-diffusion paradigm for long sequences.<n>We distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching.<n>Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods.
arXiv Detail & Related papers (2025-10-01T14:21:45Z) - Arbitrary Generative Video Interpolation [27.953958715353608]
Video frame (VFI) generates intermediate frames from given start and end frames.<n>Existing VFI methods are constrained to synthesize a fixed number of intermediate frames.<n>We present ArbInterp, a novel generative VFI framework that enables efficient synthesis at any timestamp.
arXiv Detail & Related papers (2025-10-01T06:57:10Z) - RAPID^3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformer [86.57077884971478]
Diffusion Transformers (DiTs) excel at visual generation yet remain hampered by slow sampling.<n>We introduce RAPID3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformers.<n>It delivers image-wise acceleration with zero updates to the base generator.<n>It achieves nearly 3x faster sampling with competitive generation quality.
arXiv Detail & Related papers (2025-09-26T13:20:52Z) - VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate [16.826081397057774]
VGDFR is a training-free approach for Diffusion-based Video Generation with Dynamic Latent Frame Rate.<n>We show that VGDFR can achieve a speedup up to 3x for video generation with minimal quality degradation.
arXiv Detail & Related papers (2025-04-16T17:09:13Z) - VRoPE: Rotary Position Embedding for Video Large Language Models [20.76019756946152]
Position Embedding (RoPE) has shown strong performance in text-based Large Language Models (LLMs)<n>But extending it to video remains a challenge due to the intricate structure of video frames.<n>We propose Position Rotary Embedding (VRoPE), a novel positional encoding method tailored for Video-LLMs.
arXiv Detail & Related papers (2025-02-17T10:53:57Z) - VideoRoPE: What Makes for Good Video Rotary Position Embedding? [109.88966080843608]
VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination.<n>VideoRoPE features textlow-frequency temporal allocation to mitigate periodic oscillations, a textitdiagonal layout to maintain spatial symmetry, and textadjustable temporal spacing to decouple temporal and spatial indexing.
arXiv Detail & Related papers (2025-02-07T18:56:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.