V-LASIK: Consistent Glasses-Removal from Videos Using Synthetic Data
- URL: http://arxiv.org/abs/2406.14510v1
- Date: Thu, 20 Jun 2024 17:14:43 GMT
- Title: V-LASIK: Consistent Glasses-Removal from Videos Using Synthetic Data
- Authors: Rotem Shalev-Arkushin, Aharon Azulay, Tavi Halperin, Eitan Richardson, Amit H. Bermano, Ohad Fried,
- Abstract summary: Diffusion-based generative models have recently shown remarkable image and video editing capabilities.
We focus on consistent and identity-preserving removal of glasses in videos, using it as a case study for consistent local attribute removal in videos.
We show that despite data imperfection, our model is able to perform the desired edit consistently while preserving the original video content.
- Score: 20.23001319056999
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion-based generative models have recently shown remarkable image and video editing capabilities. However, local video editing, particularly removal of small attributes like glasses, remains a challenge. Existing methods either alter the videos excessively, generate unrealistic artifacts, or fail to perform the requested edit consistently throughout the video. In this work, we focus on consistent and identity-preserving removal of glasses in videos, using it as a case study for consistent local attribute removal in videos. Due to the lack of paired data, we adopt a weakly supervised approach and generate synthetic imperfect data, using an adjusted pretrained diffusion model. We show that despite data imperfection, by learning from our generated data and leveraging the prior of pretrained diffusion models, our model is able to perform the desired edit consistently while preserving the original video content. Furthermore, we exemplify the generalization ability of our method to other local video editing tasks by applying it successfully to facial sticker-removal. Our approach demonstrates significant improvement over existing methods, showcasing the potential of leveraging synthetic data and strong video priors for local video editing tasks.
Related papers
- Data Collection-free Masked Video Modeling [6.641717260925999]
We introduce an effective self-supervised learning framework for videos that leverages and less costly static images.
These pseudo-motion videos are then leveraged in masked video modeling.
Our approach is applicable to synthetic images as well, thus entirely freeing video-training from data collection costs other concerns in real data.
arXiv Detail & Related papers (2024-09-10T17:34:07Z) - COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.
We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.
COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - Temporally Consistent Object Editing in Videos using Extended Attention [9.605596668263173]
We propose a method to edit videos using a pre-trained inpainting image diffusion model.
We ensure that the edited information will be consistent across all the video frames.
arXiv Detail & Related papers (2024-06-01T02:31:16Z) - EffiVED:Efficient Video Editing via Text-instruction Diffusion Models [9.287394166165424]
We introduce EffiVED, an efficient diffusion-based model that supports instruction-guided video editing.
We transform vast image editing datasets and open-world videos into a high-quality dataset for training EffiVED.
arXiv Detail & Related papers (2024-03-18T08:42:08Z) - VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion
Models [96.55004961251889]
Video Instruction Diffusion (VIDiff) is a unified foundation model designed for a wide range of video tasks.
Our model can edit and translate the desired results within seconds based on user instructions.
We provide convincing generative results for diverse input videos and written instructions, both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-30T18:59:52Z) - Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image
Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing.
It attains temporally consistent editing of input videos in a training-free manner.
Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z) - Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models [68.31777975873742]
Recent attempts at video editing require significant text-to-video data and computation resources for training.
We propose vid2vid-zero, a simple yet effective method for zero-shot video editing.
Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos.
arXiv Detail & Related papers (2023-03-30T17:59:25Z) - FateZero: Fusing Attentions for Zero-shot Text-based Video Editing [104.27329655124299]
We propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask.
Our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.
arXiv Detail & Related papers (2023-03-16T17:51:13Z) - Dreamix: Video Diffusion Models are General Video Editors [22.127604561922897]
Text-driven image and video diffusion models have recently achieved unprecedented generation realism.
We present the first diffusion-based method that is able to perform text-based motion and appearance editing of general videos.
arXiv Detail & Related papers (2023-02-02T18:58:58Z) - Diffusion Video Autoencoders: Toward Temporally Consistent Face Video
Editing via Disentangled Video Encoding [35.18070525015657]
We propose a novel face video editing framework based on diffusion autoencoders.
Our model is based on diffusion models and can satisfy both reconstruction and edit capabilities at the same time.
arXiv Detail & Related papers (2022-12-06T07:41:51Z) - Learning to Cut by Watching Movies [114.57935905189416]
This paper focuses on a new task for computational video editing, namely the task of raking cut plausibility.
Our key idea is to leverage content that has already been edited to learn fine-grained audiovisual patterns that trigger cuts.
We devise a model that learns to discriminate between real and artificial cuts via contrastive learning.
arXiv Detail & Related papers (2021-08-09T18:37:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.