Edit Temporal-Consistent Videos with Image Diffusion Model
- URL: http://arxiv.org/abs/2308.09091v2
- Date: Sat, 30 Dec 2023 04:09:45 GMT
- Title: Edit Temporal-Consistent Videos with Image Diffusion Model
- Authors: Yuanzhi Wang, Yong Li, Xiaoya Zhang, Xin Liu, Anbo Dai, Antoni B.
Chan, Zhen Cui
- Abstract summary: Large-scale text-to-image (T2I) diffusion models have been extended for text-guided video editing.
T achieves state-of-the-art performance in both video temporal consistency and video editing capability.
- Score: 49.88186997567138
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale text-to-image (T2I) diffusion models have been extended for
text-guided video editing, yielding impressive zero-shot video editing
performance. Nonetheless, the generated videos usually show spatial
irregularities and temporal inconsistencies as the temporal characteristics of
videos have not been faithfully modeled. In this paper, we propose an elegant
yet effective Temporal-Consistent Video Editing (TCVE) method to mitigate the
temporal inconsistency challenge for robust text-guided video editing. In
addition to the utilization of a pretrained T2I 2D Unet for spatial content
manipulation, we establish a dedicated temporal Unet architecture to faithfully
capture the temporal coherence of the input video sequences. Furthermore, to
establish coherence and interrelation between the spatial-focused and
temporal-focused components, a cohesive spatial-temporal modeling unit is
formulated. This unit effectively interconnects the temporal Unet with the
pretrained 2D Unet, thereby enhancing the temporal consistency of the generated
videos while preserving the capacity for video content manipulation.
Quantitative experimental results and visualization results demonstrate that
TCVE achieves state-of-the-art performance in both video temporal consistency
and video editing capability, surpassing existing benchmarks in the field.
Related papers
- VideoDirector: Precise Video Editing via Text-to-Video Models [45.53826541639349]
Current video editing methods rely on text-to-video (T2V) models, which inherently lack temporal-coherence generative ability.
We propose a spatial-temporal decoupled guidance (STDG) and multi-frame null-text optimization strategy to provide pivotal temporal cues for more precise pivotal inversion.
Experimental results demonstrate that our method effectively harnesses the powerful temporal generation capabilities of T2V models.
arXiv Detail & Related papers (2024-11-26T16:56:53Z) - COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.
We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.
COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing [8.907836546058086]
Existing approaches relying on image generation models for video editing suffer from time-consuming one-shot fine-tuning, additional condition extraction, or DDIM inversion.
We propose FastVideoEdit, an efficient zero-shot video editing approach inspired by Consistency Models (CMs)
Our method enables direct mapping from source video to target video with strong preservation ability utilizing a special variance schedule.
arXiv Detail & Related papers (2024-03-10T17:12:01Z) - VidToMe: Video Token Merging for Zero-Shot Video Editing [100.79999871424931]
We propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames.
Our method improves temporal coherence and reduces memory consumption in self-attention computations.
arXiv Detail & Related papers (2023-12-17T09:05:56Z) - ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation [33.37279673304]
We introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text.
ConditionVideo generates realistic dynamic videos from random noise or given scene videos.
Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods.
arXiv Detail & Related papers (2023-10-11T17:46:28Z) - VideoComposer: Compositional Video Synthesis with Motion Controllability [52.4714732331632]
VideoComposer allows users to flexibly compose a video with textual conditions, spatial conditions, and more importantly temporal conditions.
We introduce the motion vector from compressed videos as an explicit control signal to provide guidance regarding temporal dynamics.
In addition, we develop a Spatio-Temporal Condition encoder (STC-encoder) that serves as a unified interface to effectively incorporate the spatial and temporal relations of sequential inputs.
arXiv Detail & Related papers (2023-06-03T06:29:02Z) - Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video.
The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection.
We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.