Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory
- URL: http://arxiv.org/abs/2601.16296v1
- Date: Thu, 22 Jan 2026 19:59:17 GMT
- Title: Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory
- Authors: Dohun Lee, Chun-Hao Paul Huang, Xuelin Chen, Jong Chul Ye, Duygu Ceylan, Hyeonho Jeong,
- Abstract summary: Current video editors struggle to maintain cross-consistency across sequential edits.<n>Memory-V2V is a framework that augments existing video-to-video models with explicit memory.<n>We show that Memory-V2V produces videos that are significantly more cross-consistent with minimal computational overhead.
- Score: 63.32726513381937
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent foundational video-to-video diffusion models have achieved impressive results in editing user provided videos by modifying appearance, motion, or camera movement. However, real-world video editing is often an iterative process, where users refine results across multiple rounds of interaction. In this multi-turn setting, current video editors struggle to maintain cross-consistency across sequential edits. In this work, we tackle, for the first time, the problem of cross-consistency in multi-turn video editing and introduce Memory-V2V, a simple, yet effective framework that augments existing video-to-video models with explicit memory. Given an external cache of previously edited videos, Memory-V2V employs accurate retrieval and dynamic tokenization strategies to condition the current editing step on prior results. To further mitigate redundancy and computational overhead, we propose a learnable token compressor within the DiT backbone that compresses redundant conditioning tokens while preserving essential visual cues, achieving an overall speedup of 30%. We validate Memory-V2V on challenging tasks including video novel view synthesis and text-conditioned long video editing. Extensive experiments show that Memory-V2V produces videos that are significantly more cross-consistent with minimal computational overhead, while maintaining or even improving task-specific performance over state-of-the-art baselines. Project page: https://dohunlee1.github.io/MemoryV2V
Related papers
- RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing [15.876564621196684]
We introduce a causal, efficient video editing model that edits variable-length videos frame by frame.<n>For efficiency, we start from a 2D image-to-image (I2I) diffusion model and adapt it to video-to-video (V2V) editing.<n>We call this Residual Flow Diffusion Model (RFDM), which focuses the denoising process on changes between consecutive frames.
arXiv Detail & Related papers (2026-02-06T16:56:30Z) - EasyV2V: A High-quality Instruction-based Video Editing Framework [108.78294392167017]
captionemphEasyV2V is a framework for instruction-based video editing.<n>EasyV2V works with flexible inputs, e.g., video+text, video+mask+reference+, and state-of-the-art video editing results.
arXiv Detail & Related papers (2025-12-18T18:59:57Z) - Towards Effective and Efficient Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval [57.88666884515147]
We propose One-shot video-Clip based Retrieval AuGmentation (OneClip-RAG)<n>OneClip-RAG makes full use of the merits of video clips for augmented video understanding.<n>It is also equipped with a novel query-guided video chunking algorithm.
arXiv Detail & Related papers (2025-12-09T09:40:20Z) - VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management [17.645183933549458]
VideoMem is a novel framework that pioneers models long video understanding as a sequential generation task via adaptive memory management.<n>We show that VideoMem significantly outperforms existing open-source models across diverse benchmarks for ultra-long video understanding tasks.
arXiv Detail & Related papers (2025-12-04T07:42:13Z) - Déjà Vu: Efficient Video-Language Query Engine with Learning-based Inter-Frame Computation Reuse [13.680753232748705]
This paper introduces D'eja Vu, a video-language query engine that accelerates ViT-based VideoLMs by reusing computations across consecutive frames.<n>At its core is ReuseViT, a modified ViT model specifically designed for VideoLM tasks, which learns to detect inter-frame reuse opportunities.<n>We show that D'eja Vu accelerates embedding generation by up to a 2.64x within a 2% error bound, dramatically enhancing the practicality of VideoLMs for large-scale video analytics.
arXiv Detail & Related papers (2025-06-17T01:59:10Z) - Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework.<n>Our approach significantly reduces the memory footprint compared to standard gradient checkpointing.<n>By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z) - ReWind: Understanding Long Videos with Instructed Learnable Memory [8.002949551539297]
Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information.<n>We introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity.<n>We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks.
arXiv Detail & Related papers (2024-11-23T13:23:22Z) - Neural Video Fields Editing [56.558490998753456]
NVEdit is a text-driven video editing framework designed to mitigate memory overhead and improve consistency.
We construct a neural video field, powered by tri-plane and sparse grid, to enable encoding long videos with hundreds of frames.
Next, we update the video field through off-the-shelf Text-to-Image (T2I) models to text-driven editing effects.
arXiv Detail & Related papers (2023-12-12T14:48:48Z) - Video-P2P: Video Editing with Cross-attention Control [68.64804243427756]
Video-P2P is a novel framework for real-world video editing with cross-attention control.
Video-P2P works well on real-world videos for generating new characters while optimally preserving their original poses and scenes.
arXiv Detail & Related papers (2023-03-08T17:53:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.