Edit Temporal-Consistent Videos with Image Diffusion Model
- URL: http://arxiv.org/abs/2308.09091v2
- Date: Sat, 30 Dec 2023 04:09:45 GMT
- Title: Edit Temporal-Consistent Videos with Image Diffusion Model
- Authors: Yuanzhi Wang, Yong Li, Xiaoya Zhang, Xin Liu, Anbo Dai, Antoni B.
Chan, Zhen Cui
- Abstract summary: Large-scale text-to-image (T2I) diffusion models have been extended for text-guided video editing.
T achieves state-of-the-art performance in both video temporal consistency and video editing capability.
- Score: 49.88186997567138
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale text-to-image (T2I) diffusion models have been extended for
text-guided video editing, yielding impressive zero-shot video editing
performance. Nonetheless, the generated videos usually show spatial
irregularities and temporal inconsistencies as the temporal characteristics of
videos have not been faithfully modeled. In this paper, we propose an elegant
yet effective Temporal-Consistent Video Editing (TCVE) method to mitigate the
temporal inconsistency challenge for robust text-guided video editing. In
addition to the utilization of a pretrained T2I 2D Unet for spatial content
manipulation, we establish a dedicated temporal Unet architecture to faithfully
capture the temporal coherence of the input video sequences. Furthermore, to
establish coherence and interrelation between the spatial-focused and
temporal-focused components, a cohesive spatial-temporal modeling unit is
formulated. This unit effectively interconnects the temporal Unet with the
pretrained 2D Unet, thereby enhancing the temporal consistency of the generated
videos while preserving the capacity for video content manipulation.
Quantitative experimental results and visualization results demonstrate that
TCVE achieves state-of-the-art performance in both video temporal consistency
and video editing capability, surpassing existing benchmarks in the field.
Related papers
- COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.
We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.
COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video
Editing [10.011515580084243]
Existing approaches relying on image generation models for video editing suffer from time-consuming one-shot fine-tuning, additional condition extraction, or DDIM inversion.
We propose FastVideoEdit, an efficient zero-shot video editing approach inspired by Consistency Models (CMs)
Our method enables direct mapping from source video to target video with strong preservation ability utilizing a special variance schedule.
arXiv Detail & Related papers (2024-03-10T17:12:01Z) - VidToMe: Video Token Merging for Zero-Shot Video Editing [100.79999871424931]
We propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames.
Our method improves temporal coherence and reduces memory consumption in self-attention computations.
arXiv Detail & Related papers (2023-12-17T09:05:56Z) - ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation [33.37279673304]
We introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text.
ConditionVideo generates realistic dynamic videos from random noise or given scene videos.
Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods.
arXiv Detail & Related papers (2023-10-11T17:46:28Z) - VideoComposer: Compositional Video Synthesis with Motion Controllability [52.4714732331632]
VideoComposer allows users to flexibly compose a video with textual conditions, spatial conditions, and more importantly temporal conditions.
We introduce the motion vector from compressed videos as an explicit control signal to provide guidance regarding temporal dynamics.
In addition, we develop a Spatio-Temporal Condition encoder (STC-encoder) that serves as a unified interface to effectively incorporate the spatial and temporal relations of sequential inputs.
arXiv Detail & Related papers (2023-06-03T06:29:02Z) - Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video.
The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection.
We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z) - Video Demoireing with Relation-Based Temporal Consistency [68.20281109859998]
Moire patterns, appearing as color distortions, severely degrade image and video qualities when filming a screen with digital cameras.
We study how to remove such undesirable moire patterns in videos, namely video demoireing.
arXiv Detail & Related papers (2022-04-06T17:45:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.