Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation
- URL: http://arxiv.org/abs/2512.18741v2
- Date: Tue, 23 Dec 2025 16:47:46 GMT
- Title: Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation
- Authors: Tianrui Zhu, Shiyi Zhang, Zhirui Sun, Jingqi Tian, Yansong Tang,
- Abstract summary: Memorize-and-Generate (MAG) is a framework that decouples memory compression and frame generation into distinct tasks.<n>We train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation.<n>Experiments demonstrate that MAG achieves superior historical consistency while maintaining competitive performance on standard video generation benchmarks.
- Score: 33.32047364623734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Frame-level autoregressive (frame-AR) models have achieved significant progress, enabling real-time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade-off, we propose Memorize-and-Generate (MAG), a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce MAG-Bench to strictly evaluate historical memory retention. Extensive experiments demonstrate that MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.
Related papers
- StoryMem: Multi-shot Long Video Storytelling with Memory [32.97816766878247]
We propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory.<n>The proposed framework naturally accommodates smooth shot transitions and customized story generation applications.
arXiv Detail & Related papers (2025-12-22T16:23:24Z) - MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives [54.07515675393396]
Existing solutions maintain the memory by compressing historical frames with predefined strategies.<n>We propose MemFlow to address this problem.<n>MemFlow achieves outstanding long-context consistency with negligible burden.
arXiv Detail & Related papers (2025-12-16T18:59:59Z) - RELIC: Interactive Video World Model with Long-Horizon Memory [74.81433479334821]
A truly interactive world model requires real-time long-horizon streaming, consistent spatial memory, and precise user control.<n>We present RELIC, a unified framework that tackles these three challenges altogether.<n>Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time.
arXiv Detail & Related papers (2025-12-03T18:29:20Z) - Recurrent Autoregressive Diffusion: Global Memory Meets Local Attention [40.10862285690496]
We propose a novel Recurrent Autoregressive Diffusion (RAD) framework, which executes frame-wise autoregression for memory update and retrieval.<n>Experiments on Memory and Minecraft datasets demonstrate the superiority of RAD for long video generation.
arXiv Detail & Related papers (2025-11-17T03:47:12Z) - Mixture of Contexts for Long Video Generation [72.96361488755986]
We recast long-context video generation as an internal information retrieval task.<n>We propose a simple, learnable sparse attention routing module, Mixture of Contexts (MoC), as an effective long-term memory retrieval engine.<n>As we scale the data and gradually sparsify the routing, the model allocates compute to salient history, preserving identities, actions, and scenes over minutes of content.
arXiv Detail & Related papers (2025-08-28T17:57:55Z) - VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory [55.73900731190389]
We introduce Surfel-Indexed View Memory (VMem), a memory module that remembers past views by indexing them geometrically based on the 3D surface elements (surfels) they have observed.<n>VMem enables efficient retrieval of the most relevant past views when generating new ones.
arXiv Detail & Related papers (2025-06-23T17:59:56Z) - Video World Models with Long-term Spatial Memory [110.530715838396]
We introduce a novel framework to enhance long-term consistency of video world models.<n>Our framework includes mechanisms to store and retrieve information from the long-term spatial memory.<n>Our evaluations show improved quality, consistency, and context length compared to relevant baselines.
arXiv Detail & Related papers (2025-06-05T17:42:34Z) - Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval [33.15952106579093]
We propose Context-as-Memory, which utilizes historical context as memory for video generation.<n>Considering the enormous computational overhead of incorporating all historical context, we propose the Memory Retrieval module.<n>Experiments demonstrate that Context-as-Memory achieves superior memory capabilities in interactive long video generation compared to SOTAs.
arXiv Detail & Related papers (2025-06-03T17:59:05Z) - Long-Context State-Space Video World Models [66.28743632951218]
We propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency.<n>Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory.<n>Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory.
arXiv Detail & Related papers (2025-05-26T16:12:41Z) - InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO [73.33751812982342]
InfLVG is an inference-time framework that enables coherent long video generation without requiring additional long-form video data.<n>We show that InfLVG can extend video length by up to 9$times$, achieving strong consistency and semantic fidelity across scenes.
arXiv Detail & Related papers (2025-05-23T07:33:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.