FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video
editing
- URL: http://arxiv.org/abs/2310.05922v3
- Date: Thu, 29 Feb 2024 21:06:58 GMT
- Title: FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video
editing
- Authors: Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren,
Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, Sen He
- Abstract summary: We introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing.
Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module.
Results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance.
- Score: 65.60744699017202
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-video editing aims to edit the visual appearance of a source video
conditional on textual prompts. A major challenge in this task is to ensure
that all frames in the edited video are visually consistent. Most recent works
apply advanced text-to-image diffusion models to this task by inflating 2D
spatial attention in the U-Net into spatio-temporal attention. Although
temporal context can be added through spatio-temporal attention, it may
introduce some irrelevant information for each patch and therefore cause
inconsistency in the edited video. In this paper, for the first time, we
introduce optical flow into the attention module in the diffusion model's U-Net
to address the inconsistency issue for text-to-video editing. Our method,
FLATTEN, enforces the patches on the same flow path across different frames to
attend to each other in the attention module, thus improving the visual
consistency in the edited videos. Additionally, our method is training-free and
can be seamlessly integrated into any diffusion-based text-to-video editing
methods and improve their visual consistency. Experiment results on existing
text-to-video editing benchmarks show that our proposed method achieves the new
state-of-the-art performance. In particular, our method excels in maintaining
the visual consistency in the edited videos.
Related papers
- COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.
We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.
COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image
Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing.
It attains temporally consistent editing of input videos in a training-free manner.
Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z) - MagicProp: Diffusion-based Video Editing via Motion-aware Appearance
Propagation [74.32046206403177]
MagicProp disentangles the video editing process into two stages: appearance editing and motion-aware appearance propagation.
In the first stage, MagicProp selects a single frame from the input video and applies image-editing techniques to modify the content and/or style of the frame.
In the second stage, MagicProp employs the edited frame as an appearance reference and generates the remaining frames using an autoregressive rendering approach.
arXiv Detail & Related papers (2023-09-02T11:13:29Z) - StableVideo: Text-driven Consistency-aware Diffusion Video Editing [24.50933856309234]
Diffusion-based methods can generate realistic images and videos, but they struggle to edit existing objects in a video while preserving their appearance over time.
This paper introduces temporal dependency to existing text-driven diffusion models, which allows them to generate consistent appearance for the edited objects.
We build up a text-driven video editing framework based on this mechanism, namely StableVideo, which can achieve consistency-aware video editing.
arXiv Detail & Related papers (2023-08-18T14:39:16Z) - TokenFlow: Consistent Diffusion Features for Consistent Video Editing [27.736354114287725]
We present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing.
Our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video.
Our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method.
arXiv Detail & Related papers (2023-07-19T18:00:03Z) - FateZero: Fusing Attentions for Zero-shot Text-based Video Editing [104.27329655124299]
We propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask.
Our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.
arXiv Detail & Related papers (2023-03-16T17:51:13Z) - Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video.
The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection.
We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z) - Shape-aware Text-driven Layered Video Editing [39.56765973770167]
We present a shape-aware, text-driven video editing method to handle shape changes.
We first propagate the deformation field between the input and edited to all frames.
We then leverage a pre-trained text-conditioned diffusion model as guidance for refining shape distortion and completing unseen regions.
arXiv Detail & Related papers (2023-01-30T18:41:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.