Related papers: Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models

Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models

URL: http://arxiv.org/abs/2404.05519v1
Date: Mon, 8 Apr 2024 13:40:01 GMT
Title: Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models
Authors: Saman Motamed, Wouter Van Gansbeke, Luc Van Gool,
Abstract summary: Cross-attention guidance can be a promising approach for editing videos. We show that despite the limitations of current T2V models, cross-attention guidance can be a promising approach for editing videos.
Score: 52.28245595257831
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With recent advances in image and video diffusion models for content creation, a plethora of techniques have been proposed for customizing their generated content. In particular, manipulating the cross-attention layers of Text-to-Image (T2I) diffusion models has shown great promise in controlling the shape and location of objects in the scene. Transferring image-editing techniques to the video domain, however, is extremely challenging as object motion and temporal consistency are difficult to capture accurately. In this work, we take a first look at the role of cross-attention in Text-to-Video (T2V) diffusion models for zero-shot video editing. While one-shot models have shown potential in controlling motion and camera movement, we demonstrate zero-shot control over object shape, position and movement in T2V models. We show that despite the limitations of current T2V models, cross-attention guidance can be a promising approach for editing videos.

Related papers

MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching [27.28898943916193]
Text-to-video (T2V) diffusion models have promising capabilities in synthesizing realistic videos from input text prompts. In this work, we tackle the motion customization problem, where a reference video is provided as motion guidance. We propose MotionMatcher, a motion customization framework that fine-tunes the pre-trained T2V diffusion model at the feature level.
arXiv Detail & Related papers (2025-02-18T19:12:51Z)
VideoDirector: Precise Video Editing via Text-to-Video Models [45.53826541639349]
Current video editing methods rely on text-to-video (T2V) models, which inherently lack temporal-coherence generative ability. We propose a spatial-temporal decoupled guidance (STDG) and multi-frame null-text optimization strategy to provide pivotal temporal cues for more precise pivotal inversion. Experimental results demonstrate that our method effectively harnesses the powerful temporal generation capabilities of T2V models.
arXiv Detail & Related papers (2024-11-26T16:56:53Z)
HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness [57.18183962641015]
We present HOI-Swap, a video editing framework trained in a self-supervised manner. The first stage focuses on object swapping in a single frame with HOI awareness. The second stage extends the single-frame edit across the entire sequence.
arXiv Detail & Related papers (2024-06-11T22:31:29Z)
Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices [19.07572422897737]
We present Slicedit, a method for text-based video editing that utilize a pretrained T2I diffusion model to process both spatial andtemporal slices. Our method generates videos retain the structure and motion of the original video while adhering to the target text.
arXiv Detail & Related papers (2024-05-20T17:55:56Z)
Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models [48.56724784226513]
We propose Customize-A-Video that models the motion from a single reference video and adapts it to new subjects and scenes with both spatial and temporal varieties. The proposed modules are trained in a staged pipeline and inferred in a plug-and-play fashion, enabling easy extensions to various downstream tasks.
arXiv Detail & Related papers (2024-02-22T18:38:48Z)
TokenFlow: Consistent Diffusion Features for Consistent Video Editing [27.736354114287725]
We present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method.
arXiv Detail & Related papers (2023-07-19T18:00:03Z)
Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps. We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process. Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z)
Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models [68.31777975873742]
Recent attempts at video editing require significant text-to-video data and computation resources for training. We propose vid2vid-zero, a simple yet effective method for zero-shot video editing. Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos.
arXiv Detail & Related papers (2023-03-30T17:59:25Z)
Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video. The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection. We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.