Related papers: FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing

FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing

URL: http://arxiv.org/abs/2409.20500v1
Date: Mon, 30 Sep 2024 17:01:26 GMT
Title: FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing
Authors: Lingling Cai, Kang Zhao, Hangjie Yuan, Yingya Zhang, Shiwei Zhang, Kejie Huang,
Abstract summary: Cross-attention masks are effective in video editing but can introduce artifacts such as blurring and flickering. We propose FreeMask, a method for selecting optimal masks tailored to specific video editing tasks. Our approach achieves superior semantic fidelity, temporal consistency, and editing quality compared to state-of-the-art methods.
Score: 22.876290778155514
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-video diffusion models have made remarkable advancements. Driven by their ability to generate temporally coherent videos, research on zero-shot video editing using these fundamental models has expanded rapidly. To enhance editing quality, structural controls are frequently employed in video editing. Among these techniques, cross-attention mask control stands out for its effectiveness and efficiency. However, when cross-attention masks are naively applied to video editing, they can introduce artifacts such as blurring and flickering. Our experiments uncover a critical factor overlooked in previous video editing research: cross-attention masks are not consistently clear but vary with model structure and denoising timestep. To address this issue, we propose the metric Mask Matching Cost (MMC) that quantifies this variability and propose FreeMask, a method for selecting optimal masks tailored to specific video editing tasks. Using MMC-selected masks, we further improve the masked fusion mechanism within comprehensive attention features, e.g., temp, cross, and self-attention modules. Our approach can be seamlessly integrated into existing zero-shot video editing frameworks with better performance, requiring no control assistance or parameter fine-tuning but enabling adaptive decoupling of unedited semantic layouts with mask precision control. Extensive experiments demonstrate that FreeMask achieves superior semantic fidelity, temporal consistency, and editing quality compared to state-of-the-art methods.

Related papers

LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning [8.077442711429317]
Video editing using diffusion models has achieved remarkable results in generating high-quality edits for videos.<n>First-frame-guided editing provides control over the first frame, but lacks flexibility over subsequent frames.<n>We propose a mask-based LoRA tuning method that adapts pretrained Image-to-Video (I2V) models for flexible video editing.
arXiv Detail & Related papers (2025-06-11T18:03:55Z)
MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation [55.101611012677616]
Diffusion-based text-to-image (T2I) models have demonstrated remarkable results in global video editing tasks. We present MAKIMA, a tuning-free MAE framework built upon pretrained T2I models for open-domain video editing.
arXiv Detail & Related papers (2024-12-28T02:36:51Z)
Mask Factory: Towards High-quality Synthetic Data Generation for Dichotomous Image Segmentation [70.95380821618711]
Dichotomous Image (DIS) tasks require highly precise annotations. Current generative models and techniques struggle with the issues of scene deviations, noise-induced errors, and limited training sample variability. We introduce a novel approach, which provides a scalable solution for generating diverse and precise datasets.
arXiv Detail & Related papers (2024-12-26T06:37:25Z)
Blended Latent Diffusion under Attention Control for Real-World Video Editing [5.659933808910005]
We propose to adapt a image-level blended latent diffusion model to perform local video editing tasks. Specifically, we leverage DDIM inversion to acquire the latents as background latents instead of the randomly noised ones. We also introduce an autonomous mask manufacture mechanism derived from cross-attention maps in diffusion steps.
arXiv Detail & Related papers (2024-09-05T13:23:52Z)
Text-Guided Video Masked Autoencoder [12.321239366215426]
We introduce a novel text-guided masking algorithm (TGM) that masks the video regions with highest correspondence to paired captions. We show that across existing masking algorithms, unifying MAE and masked video-text contrastive learning improves downstream performance compared to pure MAE.
arXiv Detail & Related papers (2024-08-01T17:58:19Z)
Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks [43.079272743475435]
In this paper, we propose a novel and efficient image editing method for Text-to-Image (T2I) diffusion models, termed Instant Diffusion Editing(InstDiffEdit) In particular, InstDiffEdit aims to employ the cross-modal attention ability of existing diffusion models to achieve instant mask guidance during the diffusion steps. To supplement the existing evaluations of DIE, we propose a new benchmark called Editing-Mask to examine the mask accuracy and local editing ability of existing methods.
arXiv Detail & Related papers (2024-01-15T14:25:54Z)
MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers [30.924202893340087]
State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. This paper breaks down the text-based video editing task into two stages. First, we leverage an pre-trained text-to-image diffusion model to simultaneously edit fews in a zero-shot way. Second, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers.
arXiv Detail & Related papers (2023-12-19T07:05:39Z)
MotionEditor: Editing Video Motion via Content-Aware Diffusion [96.825431998349]
MotionEditor is a diffusion model for video motion editing. It incorporates a novel content-aware motion adapter into ControlNet to capture temporal motion correspondence.
arXiv Detail & Related papers (2023-11-30T18:59:33Z)
MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation [74.32046206403177]
MagicProp disentangles the video editing process into two stages: appearance editing and motion-aware appearance propagation. In the first stage, MagicProp selects a single frame from the input video and applies image-editing techniques to modify the content and/or style of the frame. In the second stage, MagicProp employs the edited frame as an appearance reference and generates the remaining frames using an autoregressive rendering approach.
arXiv Detail & Related papers (2023-09-02T11:13:29Z)
MGMAE: Motion Guided Masking for Video Masked Autoencoding [34.80832206608387]
Temporal redundancy has led to a high masking ratio and customized masking strategy in VideoMAE. Our motion guided masking explicitly incorporates motion information to build temporal consistent masking volume. We perform experiments on the datasets of Something-Something V2 and Kinetics-400, demonstrating the superior performance of our MGMAE to the original VideoMAE.
arXiv Detail & Related papers (2023-08-21T15:39:41Z)
Siamese Masked Autoencoders [76.35448665609998]
We present Siamese Masked Autoencoders (SiamMAE) for learning visual correspondence from videos. SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them. It outperforms state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks.
arXiv Detail & Related papers (2023-05-23T17:59:46Z)
Mask-Free Video Instance Segmentation [102.50936366583106]
Video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets. We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state. Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection.
arXiv Detail & Related papers (2023-03-28T11:48:07Z)
FateZero: Fusing Attentions for Zero-shot Text-based Video Editing [104.27329655124299]
We propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask. Our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.
arXiv Detail & Related papers (2023-03-16T17:51:13Z)
Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video. The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection. We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.