EVE: Efficient zero-shot text-based Video Editing with Depth Map
Guidance and Temporal Consistency Constraints
- URL: http://arxiv.org/abs/2308.10648v1
- Date: Mon, 21 Aug 2023 11:36:46 GMT
- Title: EVE: Efficient zero-shot text-based Video Editing with Depth Map
Guidance and Temporal Consistency Constraints
- Authors: Yutao Chen, Xingning Dong, Tian Gan, Chunluan Zhou, Ming Yang, and
Qingpei Guo
- Abstract summary: Current video editing tasks mainly suffer from the dilemma between the high fine-tuning cost and the limited generation capacity.
We propose EVE, a robust and efficient zero-shot video editing method.
Under the guidance of depth maps and temporal consistency constraints, EVE derives satisfactory video editing results with an affordable computational and time cost.
- Score: 20.1875350156484
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Motivated by the superior performance of image diffusion models, more and
more researchers strive to extend these models to the text-based video editing
task. Nevertheless, current video editing tasks mainly suffer from the dilemma
between the high fine-tuning cost and the limited generation capacity. Compared
with images, we conjecture that videos necessitate more constraints to preserve
the temporal consistency during editing. Towards this end, we propose EVE, a
robust and efficient zero-shot video editing method. Under the guidance of
depth maps and temporal consistency constraints, EVE derives satisfactory video
editing results with an affordable computational and time cost. Moreover,
recognizing the absence of a publicly available video editing dataset for fair
comparisons, we construct a new benchmark ZVE-50 dataset. Through comprehensive
experimentation, we validate that EVE could achieve a satisfactory trade-off
between performance and efficiency. We will release our dataset and codebase to
facilitate future researchers.
Related papers
- VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing [13.006616304789878]
This paper introduces a dataset VIVID-10M and a baseline model VIVID.
VIVID-10M is the first large-scale hybrid image-video local editing dataset.
Our approach achieves state-of-the-art performance in video local editing, surpassing baseline methods in both automated metrics and user studies.
arXiv Detail & Related papers (2024-11-22T10:04:05Z) - StableV2V: Stablizing Shape Consistency in Video-to-Video Editing [11.09708780767668]
We present a shape-consistent video editing method, namely StableV2V, in this paper.
Our method decomposes the entire editing pipeline into several sequential procedures, where it edits the first video frame, then establishes an alignment between the delivered motions and user prompts, and eventually propagates the edited contents to all other frames based on such alignment.
Experimental results and analyses illustrate the outperforming performance, visual consistency, and inference efficiency of our method compared to existing state-of-the-art studies.
arXiv Detail & Related papers (2024-11-17T11:48:01Z) - Taming Rectified Flow for Inversion and Editing [57.3742655030493]
Rectified-flow-based diffusion transformers, such as FLUX and OpenSora, have demonstrated exceptional performance in the field of image and video generation.
Despite their robust generative capabilities, these models often suffer from inaccurate inversion, which could limit their effectiveness in downstream tasks such as image and video editing.
We propose RF-r, a novel training-free sampler that enhances inversion precision by reducing errors in the process of solving rectified flow ODEs.
arXiv Detail & Related papers (2024-11-07T14:29:02Z) - COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.
We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.
COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - Zero-Shot Video Editing through Adaptive Sliding Score Distillation [51.57440923362033]
This study proposes a novel paradigm of video-based score distillation, facilitating direct manipulation of original video content.
We propose an Adaptive Sliding Score Distillation strategy, which incorporates both global and local video guidance to reduce the impact of editing errors.
arXiv Detail & Related papers (2024-06-07T12:33:59Z) - EffiVED:Efficient Video Editing via Text-instruction Diffusion Models [9.287394166165424]
We introduce EffiVED, an efficient diffusion-based model that supports instruction-guided video editing.
We transform vast image editing datasets and open-world videos into a high-quality dataset for training EffiVED.
arXiv Detail & Related papers (2024-03-18T08:42:08Z) - FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing [8.907836546058086]
Existing approaches relying on image generation models for video editing suffer from time-consuming one-shot fine-tuning, additional condition extraction, or DDIM inversion.
We propose FastVideoEdit, an efficient zero-shot video editing approach inspired by Consistency Models (CMs)
Our method enables direct mapping from source video to target video with strong preservation ability utilizing a special variance schedule.
arXiv Detail & Related papers (2024-03-10T17:12:01Z) - The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse [58.0132400208411]
Even a single edit can trigger model collapse, manifesting as significant performance degradation in various benchmark tasks.
benchmarking Large Language Models after each edit is impractically time-consuming and resource-intensive.
We have utilized GPT-3.5 to develop a new dataset, HardEdit, based on hard cases.
arXiv Detail & Related papers (2024-02-15T01:50:38Z) - Edit Temporal-Consistent Videos with Image Diffusion Model [49.88186997567138]
Large-scale text-to-image (T2I) diffusion models have been extended for text-guided video editing.
T achieves state-of-the-art performance in both video temporal consistency and video editing capability.
arXiv Detail & Related papers (2023-08-17T16:40:55Z) - INVE: Interactive Neural Video Editing [79.48055669064229]
Interactive Neural Video Editing (INVE) is a real-time video editing solution that consistently propagates sparse frame edits to the entire video clip.
Our method is inspired by the recent work on Layered Neural Atlas (LNA)
LNA suffers from two major drawbacks: (1) the method is too slow for interactive editing, and (2) it offers insufficient support for some editing use cases.
arXiv Detail & Related papers (2023-07-15T00:02:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.