Related papers: Pack and Force Your Memory: Long-form and Consistent Video Generation

Pack and Force Your Memory: Long-form and Consistent Video Generation

URL: http://arxiv.org/abs/2510.01784v2
Date: Fri, 03 Oct 2025 16:01:28 GMT
Title: Pack and Force Your Memory: Long-form and Consistent Video Generation
Authors: Xiaofei Wu, Guozhen Zhang, Zhiyong Xu, Yuan Zhou, Qinglin Lu, Xuming He,
Abstract summary: Long-form video generation presents a dual challenge.<n>Models must capture long-range dependencies while preventing the error accumulation inherent in autoregressive decoding.<n>MemoryPack and Direct Forcing substantially enhance the context consistency and reliability of long-form video generation.
Score: 26.53691150499802
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long-form video generation presents a dual challenge: models must capture long-range dependencies while preventing the error accumulation inherent in autoregressive decoding. To address these challenges, we make two contributions. First, for dynamic context modeling, we propose MemoryPack, a learnable context-retrieval mechanism that leverages both textual and image information as global guidance to jointly model short- and long-term dependencies, achieving minute-level temporal consistency. This design scales gracefully with video length, preserves computational efficiency, and maintains linear complexity. Second, to mitigate error accumulation, we introduce Direct Forcing, an efficient single-step approximating strategy that improves training-inference alignment and thereby curtails error propagation during inference. Together, MemoryPack and Direct Forcing substantially enhance the context consistency and reliability of long-form video generation, advancing the practical usability of autoregressive video models.

Related papers

KlingAvatar 2.0 Technical Report [43.949604396366425]
Our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation.<n>It delivers enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.
arXiv Detail & Related papers (2025-12-15T13:30:51Z)
RELIC: Interactive Video World Model with Long-Horizon Memory [74.81433479334821]
A truly interactive world model requires real-time long-horizon streaming, consistent spatial memory, and precise user control.<n>We present RELIC, a unified framework that tackles these three challenges altogether.<n>Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time.
arXiv Detail & Related papers (2025-12-03T18:29:20Z)
Uniform Discrete Diffusion with Metric Path for Video Generation [103.86033350602908]
Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-duration inconsistency.<n>We present Uniform generative modeling and present Uniform pAth (URSA), a powerful framework that bridges the gap with continuous approaches for scalable video generation.<n>URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods.
arXiv Detail & Related papers (2025-10-28T17:59:57Z)
Mixture of Contexts for Long Video Generation [72.96361488755986]
We recast long-context video generation as an internal information retrieval task.<n>We propose a simple, learnable sparse attention routing module, Mixture of Contexts (MoC), as an effective long-term memory retrieval engine.<n>As we scale the data and gradually sparsify the routing, the model allocates compute to salient history, preserving identities, actions, and scenes over minutes of content.
arXiv Detail & Related papers (2025-08-28T17:57:55Z)
LoViC: Efficient Long Video Generation with Context Compression [68.22069741704158]
We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos.<n>At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations.
arXiv Detail & Related papers (2025-07-17T09:46:43Z)
VideoMerge: Towards Training-free Long Video Generation [46.108622251662176]
Long video generation remains a challenging and compelling topic in computer vision.<n>We propose VideoMerge, a training-free method that can be seamlessly adapted to merge short videos.
arXiv Detail & Related papers (2025-03-13T00:47:59Z)
CD-NGP: A Fast Scalable Continual Representation for Dynamic Scenes [31.783117836434403]
CD-NGP is a continual learning framework that reduces memory overhead and enhances scalability.<n>It significantly reduces training memory usage to 14GB and requires only 0.4MB/frame in streaming bandwidth on DyNeRF.
arXiv Detail & Related papers (2024-09-08T17:35:48Z)
RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies. Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks. Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z)
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.<n>We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z)
Generating Long Videos of Dynamic Scenes [66.56925105992472]
We present a video generation model that reproduces object motion, changes in camera viewpoint, and new content that arises over time. A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency.
arXiv Detail & Related papers (2022-06-07T16:29:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.