ControlVideo: Conditional Control for One-shot Text-driven Video Editing
and Beyond
- URL: http://arxiv.org/abs/2305.17098v2
- Date: Tue, 28 Nov 2023 02:37:16 GMT
- Title: ControlVideo: Conditional Control for One-shot Text-driven Video Editing
and Beyond
- Authors: Min Zhao, Rongzhen Wang, Fan Bao, Chongxuan Li, Jun Zhu
- Abstract summary: ControlVideo generates a video that aligns with a given text while preserving the structure of the source video.
Building on a pre-trained text-to-image diffusion model, ControlVideo enhances the fidelity and temporal consistency.
- Score: 45.188722895165505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents \emph{ControlVideo} for text-driven video editing --
generating a video that aligns with a given text while preserving the structure
of the source video. Building on a pre-trained text-to-image diffusion model,
ControlVideo enhances the fidelity and temporal consistency by incorporating
additional conditions (such as edge maps), and fine-tuning the key-frame and
temporal attention on the source video-text pair via an in-depth exploration of
the design space. Extensive experimental results demonstrate that ControlVideo
outperforms various competitive baselines by delivering videos that exhibit
high fidelity w.r.t. the source content, and temporal consistency, all while
aligning with the text. By incorporating Low-rank adaptation layers into the
model before training, ControlVideo is further empowered to generate videos
that align seamlessly with reference images. More importantly, ControlVideo can
be readily extended to the more challenging task of long video editing (e.g.,
with hundreds of frames), where maintaining long-range temporal consistency is
crucial. To achieve this, we propose to construct a fused ControlVideo by
applying basic ControlVideo to overlapping short video segments and key frame
videos and then merging them by pre-defined weight functions. Empirical results
validate its capability to create videos across 140 frames, which is
approximately 5.83 to 17.5 times more than what previous works achieved. The
code is available at
\href{https://github.com/thu-ml/controlvideo}{https://github.com/thu-ml/controlvideo}
and the visualization results are available at
\href{https://drive.google.com/file/d/1wEgc2io3UwmoC5vTPbkccFvTkwVqsZlK/view?usp=drive_link}{HERE}.
Related papers
- CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [55.515836117658985]
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer.
It can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.
arXiv Detail & Related papers (2024-08-12T11:47:11Z) - LOVECon: Text-driven Training-Free Long Video Editing with ControlNet [9.762680144118061]
This paper aims to bridge the gap, establishing a simple and effective baseline for training-free diffusion model-based long video editing.
We build the pipeline upon ControlNet, which excels at various image editing tasks based on text prompts.
Our method manages to edit videos comprising hundreds of frames according to user requirements.
arXiv Detail & Related papers (2023-10-15T02:39:25Z) - Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image
Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing.
It attains temporally consistent editing of input videos in a training-free manner.
Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z) - VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing [18.24307442582304]
We introduce VidEdit, a novel method for zero-shot text-based video editing.
Our experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset.
arXiv Detail & Related papers (2023-06-14T19:15:49Z) - Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps.
We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process.
Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z) - ControlVideo: Training-free Controllable Text-to-Video Generation [117.06302461557044]
ControlVideo is a framework to enable natural and efficient text-to-video generation.
It generates both short and long videos within several minutes using one NVIDIA 2080Ti.
arXiv Detail & Related papers (2023-05-22T14:48:53Z) - Sketching the Future (STF): Applying Conditional Control Techniques to
Text-to-Video Models [0.0]
We propose a novel approach that combines zero-shot text-to-video generation with ControlNet to improve the output of these models.
Our method takes multiple sketched frames as input and generates video output that matches the flow of these frames.
arXiv Detail & Related papers (2023-05-10T02:33:25Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.