Related papers: ControlVideo: Conditional Control for One-shot Text-driven Video Editing and Beyond

ControlVideo: Conditional Control for One-shot Text-driven Video Editing and Beyond

URL: http://arxiv.org/abs/2305.17098v2
Date: Tue, 28 Nov 2023 02:37:16 GMT
Title: ControlVideo: Conditional Control for One-shot Text-driven Video Editing and Beyond
Authors: Min Zhao, Rongzhen Wang, Fan Bao, Chongxuan Li, Jun Zhu
Abstract summary: ControlVideo generates a video that aligns with a given text while preserving the structure of the source video. Building on a pre-trained text-to-image diffusion model, ControlVideo enhances the fidelity and temporal consistency.
Score: 45.188722895165505
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents \emph{ControlVideo} for text-driven video editing -- generating a video that aligns with a given text while preserving the structure of the source video. Building on a pre-trained text-to-image diffusion model, ControlVideo enhances the fidelity and temporal consistency by incorporating additional conditions (such as edge maps), and fine-tuning the key-frame and temporal attention on the source video-text pair via an in-depth exploration of the design space. Extensive experimental results demonstrate that ControlVideo outperforms various competitive baselines by delivering videos that exhibit high fidelity w.r.t. the source content, and temporal consistency, all while aligning with the text. By incorporating Low-rank adaptation layers into the model before training, ControlVideo is further empowered to generate videos that align seamlessly with reference images. More importantly, ControlVideo can be readily extended to the more challenging task of long video editing (e.g., with hundreds of frames), where maintaining long-range temporal consistency is crucial. To achieve this, we propose to construct a fused ControlVideo by applying basic ControlVideo to overlapping short video segments and key frame videos and then merging them by pre-defined weight functions. Empirical results validate its capability to create videos across 140 frames, which is approximately 5.83 to 17.5 times more than what previous works achieved. The code is available at \href{https://github.com/thu-ml/controlvideo}{https://github.com/thu-ml/controlvideo} and the visualization results are available at \href{https://drive.google.com/file/d/1wEgc2io3UwmoC5vTPbkccFvTkwVqsZlK/view?usp=drive_link}{HERE}.

Related papers

SketchVideo: Sketch-based Video Generation and Editing [51.99066098393491]
We aim to achieve sketch-based spatial and motion control for video generation and support fine-grained editing of real or synthetic videos. Based on the DiT video generation model, we propose a memory-efficient control structure with sketch control blocks that predict residual features of skipped DiT blocks. For sketch-based video editing, we design an additional video insertion module that maintains consistency between the newly edited content and the original video's spatial feature and dynamic motion.
arXiv Detail & Related papers (2025-03-30T02:44:09Z)
HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models [63.65066762436074]
HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks. It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks.
arXiv Detail & Related papers (2025-03-14T15:36:39Z)
Video Diffusion Transformers are In-Context Learners [31.736838809714726]
This paper investigates a solution for enabling in-context capabilities of video diffusion transformers. We propose a simple pipeline to leverage in-context generation: ($textbfii$) videos along spacial or time dimension. Our framework presents a valuable tool for the research community and offers critical insights for advancing product-level controllable video generation systems.
arXiv Detail & Related papers (2024-12-14T10:39:55Z)
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [55.515836117658985]
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer. It can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.
arXiv Detail & Related papers (2024-08-12T11:47:11Z)
LOVECon: Text-driven Training-Free Long Video Editing with ControlNet [9.762680144118061]
This paper aims to bridge the gap, establishing a simple and effective baseline for training-free diffusion model-based long video editing. We build the pipeline upon ControlNet, which excels at various image editing tasks based on text prompts. Our method manages to edit videos comprising hundreds of frames according to user requirements.
arXiv Detail & Related papers (2023-10-15T02:39:25Z)
Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing. It attains temporally consistent editing of input videos in a training-free manner. Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z)
VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing [18.24307442582304]
We introduce VidEdit, a novel method for zero-shot text-based video editing. Our experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset.
arXiv Detail & Related papers (2023-06-14T19:15:49Z)
Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps. We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process. Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z)
ControlVideo: Training-free Controllable Text-to-Video Generation [117.06302461557044]
ControlVideo is a framework to enable natural and efficient text-to-video generation. It generates both short and long videos within several minutes using one NVIDIA 2080Ti.
arXiv Detail & Related papers (2023-05-22T14:48:53Z)
Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models [0.0]
We propose a novel approach that combines zero-shot text-to-video generation with ControlNet to improve the output of these models. Our method takes multiple sketched frames as input and generates video output that matches the flow of these frames.
arXiv Detail & Related papers (2023-05-10T02:33:25Z)
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods. Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.