Towards Consistent Video Editing with Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2305.17431v1
- Date: Sat, 27 May 2023 10:03:36 GMT
- Title: Towards Consistent Video Editing with Text-to-Image Diffusion Models
- Authors: Zicheng Zhang, Bonan Li, Xuecheng Nie, Congying Han, Tiande Guo, Luoqi
Liu
- Abstract summary: Existing works have advanced Text-to-Image (TTI) diffusion models for video editing in a one-shot learning manner.
These methods might produce results of unsatisfied consistency with text prompt as well as temporal sequence.
We propose a novel EI$2$ model towards textbfEnhancing vtextbfIdeo textbfEditing constextbfIstency of TTI-based frameworks.
- Score: 10.340371518799444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing works have advanced Text-to-Image (TTI) diffusion models for video
editing in a one-shot learning manner. Despite their low requirements of data
and computation, these methods might produce results of unsatisfied consistency
with text prompt as well as temporal sequence, limiting their applications in
the real world. In this paper, we propose to address the above issues with a
novel EI$^2$ model towards \textbf{E}nhancing v\textbf{I}deo \textbf{E}diting
cons\textbf{I}stency of TTI-based frameworks. Specifically, we analyze and find
that the inconsistent problem is caused by newly added modules into TTI models
for learning temporal information. These modules lead to covariate shift in the
feature space, which harms the editing capability. Thus, we design EI$^2$ to
tackle the above drawbacks with two classical modules: Shift-restricted
Temporal Attention Module (STAM) and Fine-coarse Frame Attention Module (FFAM).
First, through theoretical analysis, we demonstrate that covariate shift is
highly related to Layer Normalization, thus STAM employs a \textit{Instance
Centering} layer replacing it to preserve the distribution of temporal
features. In addition, {STAM} employs an attention layer with normalized
mapping to transform temporal features while constraining the variance shift.
As the second part, we incorporate {STAM} with a novel {FFAM}, which
efficiently leverages fine-coarse spatial information of overall frames to
further enhance temporal consistency. Extensive experiments demonstrate the
superiority of the proposed EI$^2$ model for text-driven video editing.
Related papers
- Enhancing Low-Cost Video Editing with Lightweight Adaptors and Temporal-Aware Inversion [20.308013151046616]
We propose a framework that integrates temporal-spatial and semantic consistency with Baliteral DDIM inversion.
Our method significantly improves perceptual quality, text-image alignment, and temporal coherence, as demonstrated on the MSR-VTT dataset.
arXiv Detail & Related papers (2025-01-08T16:41:31Z) - VideoDirector: Precise Video Editing via Text-to-Video Models [45.53826541639349]
Current video editing methods rely on text-to-video (T2V) models, which inherently lack temporal-coherence generative ability.
We propose a spatial-temporal decoupled guidance (STDG) and multi-frame null-text optimization strategy to provide pivotal temporal cues for more precise pivotal inversion.
Experimental results demonstrate that our method effectively harnesses the powerful temporal generation capabilities of T2V models.
arXiv Detail & Related papers (2024-11-26T16:56:53Z) - Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing [60.730661748555214]
We introduce textbfTask-textbfOriented textbfDiffusion textbfInversion (textbfTODInv), a novel framework that inverts and edits real images tailored to specific editing tasks.
ToDInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability.
arXiv Detail & Related papers (2024-08-23T22:16:34Z) - Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z) - COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.
We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.
COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing.
We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample.
We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z) - DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot
Text-to-Video Generation [37.25815760042241]
This paper introduces a new framework, dubbed DirecT2V, to generate text-to-video (T2V) videos.
We equip a diffusion model with a novel value mapping method and dual-softmax filtering, which do not require any additional training.
The experimental results validate the effectiveness of our framework in producing visually coherent and storyful videos.
arXiv Detail & Related papers (2023-05-23T17:57:09Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video.
The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection.
We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z) - Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video
Grounding [35.73830796500975]
We present an end-to-end one-stage framework, termed Spatio-Temporal Consistency-Aware Transformer (STCAT)
To generate the above template under sufficient video- perception, an encoder-decoder architecture is proposed for effective global context modeling.
Our method outperforms previous state-of-the-arts with clear margins on two challenging video benchmarks.
arXiv Detail & Related papers (2022-09-27T11:13:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.