Recurrent Autoregressive Diffusion: Global Memory Meets Local Attention
- URL: http://arxiv.org/abs/2511.12940v1
- Date: Mon, 17 Nov 2025 03:47:12 GMT
- Title: Recurrent Autoregressive Diffusion: Global Memory Meets Local Attention
- Authors: Taiye Chen, Zihan Ding, Anjian Li, Christina Zhang, Zeqi Xiao, Yisen Wang, Chi Jin,
- Abstract summary: We propose a novel Recurrent Autoregressive Diffusion (RAD) framework, which executes frame-wise autoregression for memory update and retrieval.<n>Experiments on Memory and Minecraft datasets demonstrate the superiority of RAD for long video generation.
- Score: 40.10862285690496
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in video generation have demonstrated the potential of using video diffusion models as world models, with autoregressive generation of infinitely long videos through masked conditioning. However, such models, usually with local full attention, lack effective memory compression and retrieval for long-term generation beyond the window size, leading to issues of forgetting and spatiotemporal inconsistencies. To enhance the retention of historical information within a fixed memory budget, we introduce a recurrent neural network (RNN) into the diffusion transformer framework. Specifically, a diffusion model incorporating LSTM with attention achieves comparable performance to state-of-the-art RNN blocks, such as TTT and Mamba2. Moreover, existing diffusion-RNN approaches often suffer from performance degradation due to training-inference gap or the lack of overlap across windows. To address these limitations, we propose a novel Recurrent Autoregressive Diffusion (RAD) framework, which executes frame-wise autoregression for memory update and retrieval, consistently across training and inference time. Experiments on Memory Maze and Minecraft datasets demonstrate the superiority of RAD for long video generation, highlighting the efficiency of LSTM in sequence modeling.
Related papers
- Causal Autoregressive Diffusion Language Model [70.7353007255797]
CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass.<n>Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation.
arXiv Detail & Related papers (2026-01-29T17:38:29Z) - VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction [31.191310873846177]
VideoAR is the first large-scale Visual Autoregressive framework for video generation that combines multi-scale next-frame prediction with autoregressive modeling.<n>VideoAR disentangles spatial and temporal dependencies by integrating intra-frame VAR with causal next-frame prediction, supported by a 3D multi-scale tokenizer.<n> Empirically, VideoAR achieves new state-of-the-art results improving resolutions among autoregressive models, FVD on UCF-101 from 99.5 to 88.6 while reducing inference steps over 10x, and reaching a VBench score of 81.74-competitive with diffusion-based
arXiv Detail & Related papers (2026-01-09T17:34:59Z) - End-to-End Training for Autoregressive Video Diffusion via Self-Resampling [63.84672807009907]
Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch.<n>We introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale.
arXiv Detail & Related papers (2025-12-17T18:53:29Z) - Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [67.94300151774085]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z) - Long-Context State-Space Video World Models [66.28743632951218]
We propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency.<n>Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory.<n>Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory.
arXiv Detail & Related papers (2025-05-26T16:12:41Z) - Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework [127.61297651993561]
A variety of Auto-Regressive Video Diffusion Models (ARVDM) have achieved remarkable successes in generating realistic long-form videos.<n>We develop theoretical underpinnings for these models and use our insights to improve the performance of existing models.
arXiv Detail & Related papers (2025-03-12T15:32:44Z) - AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion [19.98565541640125]
We introduce Auto-Regressive Diffusion (AR-Diffusion), a novel model that combines the strengths of auto-regressive and diffusion models for flexible video generation.<n>Inspired by auto-regressive generation, we incorporate a non-decreasing constraint on the corruption timesteps of individual frames.<n>This setup, together with temporal causal attention, enables flexible generation of videos with varying lengths while preserving temporal coherence.
arXiv Detail & Related papers (2025-03-10T15:05:59Z) - ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer [95.80384464922147]
ACDiT is a blockwise Conditional Diffusion Transformer.<n>It offers a flexible between token-wise autoregression and full-sequence diffusion.<n>We show that ACDiT performs best among all autoregressive baselines on image and video generation tasks.
arXiv Detail & Related papers (2024-12-10T18:13:20Z) - LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? [10.72249123249003]
We revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding.
We introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions.
LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS dataset with 38.2 BLEU@4 and 126.2 CIDEr.
arXiv Detail & Related papers (2024-04-16T17:47:16Z) - SSM Meets Video Diffusion Models: Efficient Long-Term Video Generation with Structured State Spaces [20.23192934634197]
Recent diffusion models for video generation have predominantly utilized attention layers to extract temporal features.
This limitation presents significant challenges when generating longer video sequences using diffusion models.
We propose leveraging state-space models (SSMs) as temporal feature extractors.
arXiv Detail & Related papers (2024-03-12T14:53:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.