Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval
- URL: http://arxiv.org/abs/2506.03141v1
- Date: Tue, 03 Jun 2025 17:59:05 GMT
- Title: Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval
- Authors: Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu,
- Abstract summary: We propose Context-as-Memory, which utilizes historical context as memory for video generation.<n>Considering the enormous computational overhead of incorporating all historical context, we propose the Memory Retrieval module.<n>Experiments demonstrate that Context-as-Memory achieves superior memory capabilities in interactive long video generation compared to SOTAs.
- Score: 33.15952106579093
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in interactive video generation have shown promising results, yet existing approaches struggle with scene-consistent memory capabilities in long video generation due to limited use of historical context. In this work, we propose Context-as-Memory, which utilizes historical context as memory for video generation. It includes two simple yet effective designs: (1) storing context in frame format without additional post-processing; (2) conditioning by concatenating context and frames to be predicted along the frame dimension at the input, requiring no external control modules. Furthermore, considering the enormous computational overhead of incorporating all historical context, we propose the Memory Retrieval module to select truly relevant context frames by determining FOV (Field of View) overlap between camera poses, which significantly reduces the number of candidate frames without substantial information loss. Experiments demonstrate that Context-as-Memory achieves superior memory capabilities in interactive long video generation compared to SOTAs, even generalizing effectively to open-domain scenarios not seen during training. The link of our project page is https://context-as-memory.github.io/.
Related papers
- VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory [55.73900731190389]
We introduce Surfel-Indexed View Memory (VMem), a mechanism that remembers past views by indexing them geometrically based on the 3D surface elements they have observed.<n>VMem enables the efficient retrieval of the most relevant past views when generating new ones.<n>We evaluate our approach on challenging long-term scene synthesis benchmarks and demonstrate superior performance compared to existing methods in maintaining scene coherence and camera control.
arXiv Detail & Related papers (2025-06-23T17:59:56Z) - InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO [73.33751812982342]
InfLVG is an inference-time framework that enables coherent long video generation without requiring additional long-form video data.<n>We show that InfLVG can extend video length by up to 9$times$, achieving strong consistency and semantic fidelity across scenes.
arXiv Detail & Related papers (2025-05-23T07:33:25Z) - ReWind: Understanding Long Videos with Instructed Learnable Memory [8.002949551539297]
Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information.<n>We introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity.<n>We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks.
arXiv Detail & Related papers (2024-11-23T13:23:22Z) - Streaming Long Video Understanding with Large Language Models [83.11094441893435]
VideoStreaming is an advanced vision-language large model (VLLM) for video understanding.
It capably understands arbitrary-length video with a constant number of video streaming tokens encoded and propagatedly selected.
Our model achieves superior performance and higher efficiency on long video benchmarks.
arXiv Detail & Related papers (2024-05-25T02:22:09Z) - XMem++: Production-level Video Segmentation From Few Annotated Frames [32.68978079571079]
We introduce a novel semi-supervised video object segmentation (SSVOS) model, XMem++, that improves existing memory-based models.
Our method can extract highly consistent results while keeping the required number of frame annotations low.
We demonstrate SOTA performance on challenging (partial and multi-class) segmentation scenarios as well as long videos.
arXiv Detail & Related papers (2023-07-29T11:18:23Z) - READMem: Robust Embedding Association for a Diverse Memory in
Unconstrained Video Object Segmentation [24.813416082160224]
We present READMem, a modular framework for sVOS methods to handle unconstrained videos.
We propose a robust association of the embeddings stored in the memory with query embeddings during the update process.
Our approach achieves competitive results on the Long-time Video dataset (LV1) while not hindering performance on short sequences.
arXiv Detail & Related papers (2023-05-22T08:31:16Z) - Learning a Condensed Frame for Memory-Efficient Video Class-Incremental
Learning [41.514250287733354]
We propose FrameMaker, a memory-efficient video class-incremental learning approach.
We show that FrameMaker can achieve better performance to recent advanced methods while consuming only 20% memory.
Under the same memory consumption conditions, FrameMaker significantly outperforms existing state-of-the-arts by a convincing margin.
arXiv Detail & Related papers (2022-11-02T02:37:20Z) - Memory Efficient Temporal & Visual Graph Model for Unsupervised Video
Domain Adaptation [50.158454960223274]
Existing video domain adaption (DA) methods need to store all temporal combinations of video frames or pair the source and target videos.
We propose a memory-efficient graph-based video DA approach.
arXiv Detail & Related papers (2022-08-13T02:56:10Z) - Recurrent Dynamic Embedding for Video Object Segmentation [54.52527157232795]
We propose a Recurrent Dynamic Embedding (RDE) to build a memory bank of constant size.
We propose an unbiased guidance loss during the training stage, which makes SAM more robust in long videos.
We also design a novel self-correction strategy so that the network can repair the embeddings of masks with different qualities in the memory bank.
arXiv Detail & Related papers (2022-05-08T02:24:43Z) - Multi-Scale Memory-Based Video Deblurring [34.488707652997704]
We design a memory branch to memorize the blurry-sharp feature pairs in the memory bank.
To enrich the memory of our memory bank, we also designed a bidirectional recurrency and multi-scale strategy.
Experimental results demonstrate that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2022-04-06T08:48:56Z) - Local-Global Context Aware Transformer for Language-Guided Video
Segmentation [103.35509224722097]
We explore the task of language-guided video segmentation (LVS)
We present Locater, which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner.
To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset.
arXiv Detail & Related papers (2022-03-18T07:35:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.