Diffusion Video Autoencoders: Toward Temporally Consistent Face Video
Editing via Disentangled Video Encoding
- URL: http://arxiv.org/abs/2212.02802v2
- Date: Mon, 27 Mar 2023 11:15:59 GMT
- Title: Diffusion Video Autoencoders: Toward Temporally Consistent Face Video
Editing via Disentangled Video Encoding
- Authors: Gyeongman Kim, Hajin Shim, Hyunsu Kim, Yunjey Choi, Junho Kim, Eunho
Yang
- Abstract summary: We propose a novel face video editing framework based on diffusion autoencoders.
Our model is based on diffusion models and can satisfy both reconstruction and edit capabilities at the same time.
- Score: 35.18070525015657
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inspired by the impressive performance of recent face image editing methods,
several studies have been naturally proposed to extend these methods to the
face video editing task. One of the main challenges here is temporal
consistency among edited frames, which is still unresolved. To this end, we
propose a novel face video editing framework based on diffusion autoencoders
that can successfully extract the decomposed features - for the first time as a
face video editing model - of identity and motion from a given video. This
modeling allows us to edit the video by simply manipulating the temporally
invariant feature to the desired direction for the consistency. Another unique
strength of our model is that, since our model is based on diffusion models, it
can satisfy both reconstruction and edit capabilities at the same time, and is
robust to corner cases in wild face videos (e.g. occluded faces) unlike the
existing GAN-based methods.
Related papers
- I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models [18.36472998650704]
We introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model.
Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits.
arXiv Detail & Related papers (2024-05-26T11:47:40Z) - DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing [48.238213651343784]
Video score distillation can introduce new content indicated by target text, but can also cause structure and motion deviation.
We propose to match space-time self-similarities of the original video and the edited video during the score distillation.
Our approach is model-agnostic, which can be applied for both cascaded and non-cascaded video diffusion frameworks.
arXiv Detail & Related papers (2024-03-18T17:38:53Z) - MagicProp: Diffusion-based Video Editing via Motion-aware Appearance
Propagation [74.32046206403177]
MagicProp disentangles the video editing process into two stages: appearance editing and motion-aware appearance propagation.
In the first stage, MagicProp selects a single frame from the input video and applies image-editing techniques to modify the content and/or style of the frame.
In the second stage, MagicProp employs the edited frame as an appearance reference and generates the remaining frames using an autoregressive rendering approach.
arXiv Detail & Related papers (2023-09-02T11:13:29Z) - StableVideo: Text-driven Consistency-aware Diffusion Video Editing [24.50933856309234]
Diffusion-based methods can generate realistic images and videos, but they struggle to edit existing objects in a video while preserving their appearance over time.
This paper introduces temporal dependency to existing text-driven diffusion models, which allows them to generate consistent appearance for the edited objects.
We build up a text-driven video editing framework based on this mechanism, namely StableVideo, which can achieve consistency-aware video editing.
arXiv Detail & Related papers (2023-08-18T14:39:16Z) - InFusion: Inject and Attention Fusion for Multi Concept Zero-Shot
Text-based Video Editing [27.661609140918916]
InFusion is a framework for zero-shot text-based video editing.
It supports editing of multiple concepts with pixel-level control over diverse concepts mentioned in the editing prompt.
Our framework is a low-cost alternative to one-shot tuned models for editing since it does not require training.
arXiv Detail & Related papers (2023-07-22T17:05:47Z) - FateZero: Fusing Attentions for Zero-shot Text-based Video Editing [104.27329655124299]
We propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask.
Our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.
arXiv Detail & Related papers (2023-03-16T17:51:13Z) - Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video.
The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection.
We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z) - Dreamix: Video Diffusion Models are General Video Editors [22.127604561922897]
Text-driven image and video diffusion models have recently achieved unprecedented generation realism.
We present the first diffusion-based method that is able to perform text-based motion and appearance editing of general videos.
arXiv Detail & Related papers (2023-02-02T18:58:58Z) - UniFaceGAN: A Unified Framework for Temporally Consistent Facial Video
Editing [78.26925404508994]
We propose a unified temporally consistent facial video editing framework termed UniFaceGAN.
Our framework is designed to handle face swapping and face reenactment simultaneously.
Compared with the state-of-the-art facial image editing methods, our framework generates video portraits that are more photo-realistic and temporally smooth.
arXiv Detail & Related papers (2021-08-12T10:35:22Z) - Task-agnostic Temporally Consistent Facial Video Editing [84.62351915301795]
We propose a task-agnostic, temporally consistent facial video editing framework.
Based on a 3D reconstruction model, our framework is designed to handle several editing tasks in a more unified and disentangled manner.
Compared with the state-of-the-art facial image editing methods, our framework generates video portraits that are more photo-realistic and temporally smooth.
arXiv Detail & Related papers (2020-07-03T02:49:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.