Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models
- URL: http://arxiv.org/abs/2303.17599v3
- Date: Thu, 4 Jan 2024 01:30:50 GMT
- Title: Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models
- Authors: Wen Wang, Yan Jiang, Kangyang Xie, Zide Liu, Hao Chen, Yue Cao,
Xinlong Wang, Chunhua Shen
- Abstract summary: Recent attempts at video editing require significant text-to-video data and computation resources for training.
We propose vid2vid-zero, a simple yet effective method for zero-shot video editing.
Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos.
- Score: 68.31777975873742
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale text-to-image diffusion models achieve unprecedented success in
image generation and editing. However, how to extend such success to video
editing is unclear. Recent initial attempts at video editing require
significant text-to-video data and computation resources for training, which is
often not accessible. In this work, we propose vid2vid-zero, a simple yet
effective method for zero-shot video editing. Our vid2vid-zero leverages
off-the-shelf image diffusion models, and doesn't require training on any
video. At the core of our method is a null-text inversion module for
text-to-video alignment, a cross-frame modeling module for temporal
consistency, and a spatial regularization module for fidelity to the original
video. Without any training, we leverage the dynamic nature of the attention
mechanism to enable bi-directional temporal modeling at test time. Experiments
and analyses show promising results in editing attributes, subjects, places,
etc., in real-world videos. Code is made available at
\url{https://github.com/baaivision/vid2vid-zero}.
Related papers
- Fine-gained Zero-shot Video Sampling [21.42513407755273]
We propose a novel Zero-Shot video sampling algorithm, denoted as $mathcalZS2$.
$mathcalZS2$ is capable of directly sampling high-quality video clips without any training or optimization.
It achieves state-of-the-art performance in zero-shot video generation, occasionally outperforming recent supervised methods.
arXiv Detail & Related papers (2024-07-31T09:36:58Z) - Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices [19.07572422897737]
We present Slicedit, a method for text-based video editing that utilize a pretrained T2I diffusion model to process both spatial andtemporal slices.
Our method generates videos retain the structure and motion of the original video while adhering to the target text.
arXiv Detail & Related papers (2024-05-20T17:55:56Z) - Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models [52.28245595257831]
Cross-attention guidance can be a promising approach for editing videos.
We show that despite the limitations of current T2V models, cross-attention guidance can be a promising approach for editing videos.
arXiv Detail & Related papers (2024-04-08T13:40:01Z) - Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image
Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing.
It attains temporally consistent editing of input videos in a training-free manner.
Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z) - TokenFlow: Consistent Diffusion Features for Consistent Video Editing [27.736354114287725]
We present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing.
Our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video.
Our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method.
arXiv Detail & Related papers (2023-07-19T18:00:03Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - FateZero: Fusing Attentions for Zero-shot Text-based Video Editing [104.27329655124299]
We propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask.
Our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.
arXiv Detail & Related papers (2023-03-16T17:51:13Z) - Dreamix: Video Diffusion Models are General Video Editors [22.127604561922897]
Text-driven image and video diffusion models have recently achieved unprecedented generation realism.
We present the first diffusion-based method that is able to perform text-based motion and appearance editing of general videos.
arXiv Detail & Related papers (2023-02-02T18:58:58Z) - MagicVideo: Efficient Video Generation With Latent Diffusion Models [76.95903791630624]
We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo.
Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card.
We conduct extensive experiments and demonstrate that MagicVideo can generate high-quality video clips with either realistic or imaginary content.
arXiv Detail & Related papers (2022-11-20T16:40:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.