Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation
- URL: http://arxiv.org/abs/2510.13084v1
- Date: Wed, 15 Oct 2025 01:55:32 GMT
- Title: Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation
- Authors: Yi Zuo, Zitao Wang, Lingling Li, Xu Liu, Fang Liu, Licheng Jiao,
- Abstract summary: Edit-Your-Interest is a text-driven, zero-shot video editing method.<n>It reduces computational overhead compared to full-sequence-temporal modeling approaches.<n>It outperforms state-of-the-art methods in both efficiency and visual fidelity.
- Score: 53.05471174430247
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image (T2I) diffusion models have recently demonstrated significant progress in video editing. However, existing video editing methods are severely limited by their high computational overhead and memory consumption. Furthermore, these approaches often sacrifice visual fidelity, leading to undesirable temporal inconsistencies and artifacts such as blurring and pronounced mosaic-like patterns. We propose Edit-Your-Interest, a lightweight, text-driven, zero-shot video editing method. Edit-Your-Interest introduces a spatio-temporal feature memory to cache features from previous frames, significantly reducing computational overhead compared to full-sequence spatio-temporal modeling approaches. Specifically, we first introduce a Spatio-Temporal Feature Memory bank (SFM), which is designed to efficiently cache and retain the crucial image tokens processed by spatial attention. Second, we propose the Feature Most-Similar Propagation (FMP) method. FMP propagates the most relevant tokens from previous frames to subsequent ones, preserving temporal consistency. Finally, we introduce an SFM update algorithm that continuously refreshes the cached features, ensuring their long-term relevance and effectiveness throughout the video sequence. Furthermore, we leverage cross-attention maps to automatically extract masks for the instances of interest. These masks are seamlessly integrated into the diffusion denoising process, enabling fine-grained control over target objects and allowing Edit-Your-Interest to perform highly accurate edits while robustly preserving the background integrity. Extensive experiments decisively demonstrate that the proposed Edit-Your-Interest outperforms state-of-the-art methods in both efficiency and visual fidelity, validating its superior effectiveness and practicality.
Related papers
- MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation [55.101611012677616]
Diffusion-based text-to-image (T2I) models have demonstrated remarkable results in global video editing tasks.<n>We present MAKIMA, a tuning-free MAE framework built upon pretrained T2I models for open-domain video editing.
arXiv Detail & Related papers (2024-12-28T02:36:51Z) - FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing [22.876290778155514]
Cross-attention masks are effective in video editing but can introduce artifacts such as blurring and flickering.
We propose FreeMask, a method for selecting optimal masks tailored to specific video editing tasks.
Our approach achieves superior semantic fidelity, temporal consistency, and editing quality compared to state-of-the-art methods.
arXiv Detail & Related papers (2024-09-30T17:01:26Z) - Blended Latent Diffusion under Attention Control for Real-World Video Editing [5.659933808910005]
We propose to adapt a image-level blended latent diffusion model to perform local video editing tasks.
Specifically, we leverage DDIM inversion to acquire the latents as background latents instead of the randomly noised ones.
We also introduce an autonomous mask manufacture mechanism derived from cross-attention maps in diffusion steps.
arXiv Detail & Related papers (2024-09-05T13:23:52Z) - COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.<n>We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.<n>COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - FastDrag: Manipulate Anything in One Step [20.494157877241665]
We introduce a novel one-step drag-based image editing method, i.e., FastDrag, to accelerate the editing process.
This innovation achieves one-step latent semantic optimization and hence significantly promotes editing speeds.
Our FastDrag is validated on the DragBench dataset, demonstrating substantial improvements in processing time over existing methods.
arXiv Detail & Related papers (2024-05-24T17:59:26Z) - Object-Centric Diffusion for Efficient Video Editing [64.71639719352636]
Diffusion-based video editing has reached impressive quality.
Such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames.
We propose modifications that allow significant speed-ups whilst maintaining quality.
arXiv Detail & Related papers (2024-01-11T08:36:15Z) - MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers [30.924202893340087]
State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks.
This paper breaks down the text-based video editing task into two stages.
First, we leverage an pre-trained text-to-image diffusion model to simultaneously edit fews in a zero-shot way.
Second, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers.
arXiv Detail & Related papers (2023-12-19T07:05:39Z) - Edit Temporal-Consistent Videos with Image Diffusion Model [49.88186997567138]
Large-scale text-to-image (T2I) diffusion models have been extended for text-guided video editing.
T achieves state-of-the-art performance in both video temporal consistency and video editing capability.
arXiv Detail & Related papers (2023-08-17T16:40:55Z) - FateZero: Fusing Attentions for Zero-shot Text-based Video Editing [104.27329655124299]
We propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask.
Our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.
arXiv Detail & Related papers (2023-03-16T17:51:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.