FateZero: Fusing Attentions for Zero-shot Text-based Video Editing
- URL: http://arxiv.org/abs/2303.09535v3
- Date: Wed, 11 Oct 2023 17:46:21 GMT
- Title: FateZero: Fusing Attentions for Zero-shot Text-based Video Editing
- Authors: Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying
Shan, Qifeng Chen
- Abstract summary: We propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask.
Our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.
- Score: 104.27329655124299
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The diffusion-based generative models have achieved remarkable success in
text-based image generation. However, since it contains enormous randomness in
generation progress, it is still challenging to apply such models for
real-world visual content editing, especially in videos. In this paper, we
propose FateZero, a zero-shot text-based editing method on real-world videos
without per-prompt training or use-specific mask. To edit videos consistently,
we propose several techniques based on the pre-trained models. Firstly, in
contrast to the straightforward DDIM inversion technique, our approach captures
intermediate attention maps during inversion, which effectively retain both
structural and motion information. These maps are directly fused in the editing
process rather than generated during denoising. To further minimize semantic
leakage of the source video, we then fuse self-attentions with a blending mask
obtained by cross-attention features from the source prompt. Furthermore, we
have implemented a reform of the self-attention mechanism in denoising UNet by
introducing spatial-temporal attention to ensure frame consistency. Yet
succinct, our method is the first one to show the ability of zero-shot
text-driven video style and local attribute editing from the trained
text-to-image model. We also have a better zero-shot shape-aware editing
ability based on the text-to-video model. Extensive experiments demonstrate our
superior temporal consistency and editing capability than previous works.
Related papers
- Blended Latent Diffusion under Attention Control for Real-World Video Editing [5.659933808910005]
We propose to adapt a image-level blended latent diffusion model to perform local video editing tasks.
Specifically, we leverage DDIM inversion to acquire the latents as background latents instead of the randomly noised ones.
We also introduce an autonomous mask manufacture mechanism derived from cross-attention maps in diffusion steps.
arXiv Detail & Related papers (2024-09-05T13:23:52Z) - Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices [19.07572422897737]
We present Slicedit, a method for text-based video editing that utilize a pretrained T2I diffusion model to process both spatial andtemporal slices.
Our method generates videos retain the structure and motion of the original video while adhering to the target text.
arXiv Detail & Related papers (2024-05-20T17:55:56Z) - FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video
editing [65.60744699017202]
We introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing.
Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module.
Results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance.
arXiv Detail & Related papers (2023-10-09T17:59:53Z) - TokenFlow: Consistent Diffusion Features for Consistent Video Editing [27.736354114287725]
We present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing.
Our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video.
Our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method.
arXiv Detail & Related papers (2023-07-19T18:00:03Z) - VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing [18.24307442582304]
We introduce VidEdit, a novel method for zero-shot text-based video editing.
Our experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset.
arXiv Detail & Related papers (2023-06-14T19:15:49Z) - Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models [68.31777975873742]
Recent attempts at video editing require significant text-to-video data and computation resources for training.
We propose vid2vid-zero, a simple yet effective method for zero-shot video editing.
Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos.
arXiv Detail & Related papers (2023-03-30T17:59:25Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video.
The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection.
We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z) - Dreamix: Video Diffusion Models are General Video Editors [22.127604561922897]
Text-driven image and video diffusion models have recently achieved unprecedented generation realism.
We present the first diffusion-based method that is able to perform text-based motion and appearance editing of general videos.
arXiv Detail & Related papers (2023-02-02T18:58:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.