Related papers: MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation

MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation

URL: http://arxiv.org/abs/2412.19978v1
Date: Sat, 28 Dec 2024 02:36:51 GMT
Title: MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation
Authors: Haoyu Zheng, Wenqiao Zhang, Zheqi Lv, Yu Zhong, Yang Dai, Jianxiang An, Yongliang Shen, Juncheng Li, Dongping Zhang, Siliang Tang, Yueting Zhuang,
Abstract summary: Diffusion-based text-to-image (T2I) models have demonstrated remarkable results in global video editing tasks.<n>We present MAKIMA, a tuning-free MAE framework built upon pretrained T2I models for open-domain video editing.
Score: 55.101611012677616
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion-based text-to-image (T2I) models have demonstrated remarkable results in global video editing tasks. However, their focus is primarily on global video modifications, and achieving desired attribute-specific changes remains a challenging task, specifically in multi-attribute editing (MAE) in video. Contemporary video editing approaches either require extensive fine-tuning or rely on additional networks (such as ControlNet) for modeling multi-object appearances, yet they remain in their infancy, offering only coarse-grained MAE solutions. In this paper, we present MAKIMA, a tuning-free MAE framework built upon pretrained T2I models for open-domain video editing. Our approach preserves video structure and appearance information by incorporating attention maps and features from the inversion process during denoising. To facilitate precise editing of multiple attributes, we introduce mask-guided attention modulation, enhancing correlations between spatially corresponding tokens and suppressing cross-attribute interference in both self-attention and cross-attention layers. To balance video frame generation quality and efficiency, we implement consistent feature propagation, which generates frame sequences by editing keyframes and propagating their features throughout the sequence. Extensive experiments demonstrate that MAKIMA outperforms existing baselines in open-domain multi-attribute video editing tasks, achieving superior results in both editing accuracy and temporal consistency while maintaining computational efficiency.

Related papers

CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion [62.04833878126661]
We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework.<n>We propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic)<n>Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.
arXiv Detail & Related papers (2025-11-26T07:27:11Z)
ConsistEdit: Highly Consistent and Precise Training-free Visual Editing [17.162316662697965]
We propose ConsistEdit, a novel attention control method specifically tailored for MM-DiT.<n>It incorporates vision-only attention control, mask-guided pre-attention fusion, and differentiated manipulation of the query, key, and value tokens.<n>It achieves state-of-the-art performance across a wide range of image and video editing tasks, including both structure-consistent and structure-inconsistent scenarios.
arXiv Detail & Related papers (2025-10-20T17:59:52Z)
Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation [53.05471174430247]
Edit-Your-Interest is a text-driven, zero-shot video editing method.<n>It reduces computational overhead compared to full-sequence-temporal modeling approaches.<n>It outperforms state-of-the-art methods in both efficiency and visual fidelity.
arXiv Detail & Related papers (2025-10-15T01:55:32Z)
Taming Flow-based I2V Models for Creative Video Editing [64.67801702413122]
Video editing, which aims to manipulate videos according to user intent, remains an emerging challenge.<n>Most existing image-conditioned video editing methods require inversion with model-specific design or need extensive optimization.<n>We propose IF-V2V, an Inversion-Free method that can adapt off-the-shelf flow-matching-based I2V models for video editing without significant computational overhead.
arXiv Detail & Related papers (2025-09-26T05:57:04Z)
STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing [35.50656689789427]
STR-Match is a training-free video editing system that produces visually appealing and coherent videos.<n> STR-Match consistently outperforms existing methods in both visual quality andtemporal consistency.
arXiv Detail & Related papers (2025-06-28T12:36:19Z)
LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning [8.077442711429317]
Video editing using diffusion models has achieved remarkable results in generating high-quality edits for videos.<n>First-frame-guided editing provides control over the first frame, but lacks flexibility over subsequent frames.<n>We propose a mask-based LoRA tuning method that adapts pretrained Image-to-Video (I2V) models for flexible video editing.
arXiv Detail & Related papers (2025-06-11T18:03:55Z)
COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.<n>We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.<n>COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z)
HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness [57.18183962641015]
We present HOI-Swap, a video editing framework trained in a self-supervised manner. The first stage focuses on object swapping in a single frame with HOI awareness. The second stage extends the single-frame edit across the entire sequence.
arXiv Detail & Related papers (2024-06-11T22:31:29Z)
Temporally Consistent Object Editing in Videos using Extended Attention [9.605596668263173]
We propose a method to edit videos using a pre-trained inpainting image diffusion model. We ensure that the edited information will be consistent across all the video frames.
arXiv Detail & Related papers (2024-06-01T02:31:16Z)
I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models [18.36472998650704]
We introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model. Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits.
arXiv Detail & Related papers (2024-05-26T11:47:40Z)
Consolidating Attention Features for Multi-view Image Editing [126.19731971010475]
We focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views. We introduce QNeRF, a neural radiance field trained on the internal query features of the edited images. We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps.
arXiv Detail & Related papers (2024-02-22T18:50:18Z)
MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers [30.924202893340087]
State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. This paper breaks down the text-based video editing task into two stages. First, we leverage an pre-trained text-to-image diffusion model to simultaneously edit fews in a zero-shot way. Second, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers.
arXiv Detail & Related papers (2023-12-19T07:05:39Z)
Edit Temporal-Consistent Videos with Image Diffusion Model [49.88186997567138]
Large-scale text-to-image (T2I) diffusion models have been extended for text-guided video editing. T achieves state-of-the-art performance in both video temporal consistency and video editing capability.
arXiv Detail & Related papers (2023-08-17T16:40:55Z)
Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video. The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection. We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.