Pix2Video: Video Editing using Image Diffusion
- URL: http://arxiv.org/abs/2303.12688v1
- Date: Wed, 22 Mar 2023 16:36:10 GMT
- Title: Pix2Video: Video Editing using Image Diffusion
- Authors: Duygu Ceylan, Chun-Hao Paul Huang, Niloy J. Mitra
- Abstract summary: We investigate how to use pre-trained image models for text-guided video editing.
Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame.
We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.
- Score: 43.07444438561277
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Image diffusion models, trained on massive image collections, have emerged as
the most versatile image generator model in terms of quality and diversity.
They support inverting real images and conditional (e.g., text) generation,
making them attractive for high-quality image editing applications. We
investigate how to use such pre-trained image models for text-guided video
editing. The critical challenge is to achieve the target edits while still
preserving the content of the source video. Our method works in two simple
steps: first, we use a pre-trained structure-guided (e.g., depth) image
diffusion model to perform text-guided edits on an anchor frame; then, in the
key step, we progressively propagate the changes to the future frames via
self-attention feature injection to adapt the core denoising step of the
diffusion model. We then consolidate the changes by adjusting the latent code
for the frame before continuing the process. Our approach is training-free and
generalizes to a wide range of edits. We demonstrate the effectiveness of the
approach by extensive experimentation and compare it against four different
prior and parallel efforts (on ArXiv). We demonstrate that realistic
text-guided video edits are possible, without any compute-intensive
preprocessing or video-specific finetuning.
Related papers
- Portrait Video Editing Empowered by Multimodal Generative Priors [39.747581584889495]
We introduce PortraitGen, a powerful portrait video editing method that achieves consistent and expressive stylization with multimodal prompts.
Our approach incorporates multimodal inputs through knowledge distilled from large-scale 2D generative models.
Our system also incorporates expression similarity guidance and a face-aware portrait editing module, effectively mitigating degradation issues associated with iterative dataset updates.
arXiv Detail & Related papers (2024-09-20T15:45:13Z) - Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection [60.47731445033151]
We propose a novel unified editing framework that combines the strengths of both approaches by utilizing only a basic 2D image text-to-image (T2I) diffusion model.
Experimental results confirm that our method enables editing across diverse modalities including 3D scenes, videos, and panorama images.
arXiv Detail & Related papers (2024-05-27T04:44:36Z) - I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models [18.36472998650704]
We introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model.
Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits.
arXiv Detail & Related papers (2024-05-26T11:47:40Z) - VASE: Object-Centric Appearance and Shape Manipulation of Real Videos [108.60416277357712]
In this work, we introduce a framework that is object-centric and is designed to control both the object's appearance and, notably, to execute precise and explicit structural modifications on the object.
We build our framework on a pre-trained image-conditioned diffusion model, integrate layers to handle the temporal dimension, and propose training strategies and architectural modifications to enable shape control.
We evaluate our method on the image-driven video editing task showing similar performance to the state-of-the-art, and showcasing novel shape-editing capabilities.
arXiv Detail & Related papers (2024-01-04T18:59:24Z) - InFusion: Inject and Attention Fusion for Multi Concept Zero-Shot
Text-based Video Editing [27.661609140918916]
InFusion is a framework for zero-shot text-based video editing.
It supports editing of multiple concepts with pixel-level control over diverse concepts mentioned in the editing prompt.
Our framework is a low-cost alternative to one-shot tuned models for editing since it does not require training.
arXiv Detail & Related papers (2023-07-22T17:05:47Z) - Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models [68.31777975873742]
Recent attempts at video editing require significant text-to-video data and computation resources for training.
We propose vid2vid-zero, a simple yet effective method for zero-shot video editing.
Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos.
arXiv Detail & Related papers (2023-03-30T17:59:25Z) - FateZero: Fusing Attentions for Zero-shot Text-based Video Editing [104.27329655124299]
We propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask.
Our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.
arXiv Detail & Related papers (2023-03-16T17:51:13Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z) - Dreamix: Video Diffusion Models are General Video Editors [22.127604561922897]
Text-driven image and video diffusion models have recently achieved unprecedented generation realism.
We present the first diffusion-based method that is able to perform text-based motion and appearance editing of general videos.
arXiv Detail & Related papers (2023-02-02T18:58:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.