FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video
Editing
- URL: http://arxiv.org/abs/2403.06269v1
- Date: Sun, 10 Mar 2024 17:12:01 GMT
- Title: FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video
Editing
- Authors: Youyuan Zhang and Xuan Ju and James J. Clark
- Abstract summary: Existing approaches relying on image generation models for video editing suffer from time-consuming one-shot fine-tuning, additional condition extraction, or DDIM inversion.
We propose FastVideoEdit, an efficient zero-shot video editing approach inspired by Consistency Models (CMs)
Our method enables direct mapping from source video to target video with strong preservation ability utilizing a special variance schedule.
- Score: 10.011515580084243
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion models have demonstrated remarkable capabilities in text-to-image
and text-to-video generation, opening up possibilities for video editing based
on textual input. However, the computational cost associated with sequential
sampling in diffusion models poses challenges for efficient video editing.
Existing approaches relying on image generation models for video editing suffer
from time-consuming one-shot fine-tuning, additional condition extraction, or
DDIM inversion, making real-time applications impractical. In this work, we
propose FastVideoEdit, an efficient zero-shot video editing approach inspired
by Consistency Models (CMs). By leveraging the self-consistency property of
CMs, we eliminate the need for time-consuming inversion or additional condition
extraction, reducing editing time. Our method enables direct mapping from
source video to target video with strong preservation ability utilizing a
special variance schedule. This results in improved speed advantages, as fewer
sampling steps can be used while maintaining comparable generation quality.
Experimental results validate the state-of-the-art performance and speed
advantages of FastVideoEdit across evaluation metrics encompassing editing
speed, temporal consistency, and text-video alignment.
Related papers
- COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.
We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.
COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models [18.36472998650704]
We introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model.
Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits.
arXiv Detail & Related papers (2024-05-26T11:47:40Z) - EffiVED:Efficient Video Editing via Text-instruction Diffusion Models [9.287394166165424]
We introduce EffiVED, an efficient diffusion-based model that supports instruction-guided video editing.
We transform vast image editing datasets and open-world videos into a high-quality dataset for training EffiVED.
arXiv Detail & Related papers (2024-03-18T08:42:08Z) - VidToMe: Video Token Merging for Zero-Shot Video Editing [100.79999871424931]
We propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames.
Our method improves temporal coherence and reduces memory consumption in self-attention computations.
arXiv Detail & Related papers (2023-12-17T09:05:56Z) - Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image
Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing.
It attains temporally consistent editing of input videos in a training-free manner.
Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z) - MagicEdit: High-Fidelity and Temporally Coherent Video Editing [70.55750617502696]
We present MagicEdit, a surprisingly simple yet effective solution to the text-guided video editing task.
We found that high-fidelity and temporally coherent video-to-video translation can be achieved by explicitly disentangling the learning of content, structure and motion signals during training.
arXiv Detail & Related papers (2023-08-28T17:56:22Z) - Edit Temporal-Consistent Videos with Image Diffusion Model [49.88186997567138]
Large-scale text-to-image (T2I) diffusion models have been extended for text-guided video editing.
T achieves state-of-the-art performance in both video temporal consistency and video editing capability.
arXiv Detail & Related papers (2023-08-17T16:40:55Z) - VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing [18.24307442582304]
We introduce VidEdit, a novel method for zero-shot text-based video editing.
Our experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset.
arXiv Detail & Related papers (2023-06-14T19:15:49Z) - Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video.
The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection.
We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.