Related papers: SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models

SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models

URL: http://arxiv.org/abs/2311.16933v1
Date: Tue, 28 Nov 2023 16:33:08 GMT
Title: SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models
Authors: Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, Bo Dai
Abstract summary: We present SparseCtrl to enable flexible structure control with temporally sparse signals. It incorporates an additional condition to process these sparse signals while leaving the pre-trained T2V model untouched. The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images.
Score: 84.71887272654865
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The development of text-to-video (T2V), i.e., generating videos with a given text prompt, has been significantly advanced in recent years. However, relying solely on text prompts often results in ambiguous frame composition due to spatial uncertainty. The research community thus leverages the dense structure signals, e.g., per-frame depth/edge sequences, to enhance controllability, whose collection accordingly increases the burden of inference. In this work, we present SparseCtrl to enable flexible structure control with temporally sparse signals, requiring only one or a few inputs, as shown in Figure 1. It incorporates an additional condition encoder to process these sparse signals while leaving the pre-trained T2V model untouched. The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images, providing more practical control for video generation and promoting applications such as storyboarding, depth rendering, keyframe animation, and interpolation. Extensive experiments demonstrate the generalization of SparseCtrl on both original and personalized T2V generators. Codes and models will be publicly available at https://guoyww.github.io/projects/SparseCtrl .

Related papers

AnyI2V: Animating Any Conditional Image with Motion Control [25.49332963076066]
We propose AnyI2V, a training-free framework that animates any conditional images with user-defined motion trajectories.<n>Experiments demonstrate that the proposed AnyI2V achieves superior performance and provides a new perspective in spatial- and motion-controlled video generation.
arXiv Detail & Related papers (2025-07-03T17:59:02Z)
Enabling Versatile Controls for Video Diffusion Models [18.131652071161266]
VCtrl is a novel framework designed to enable fine control over pre-trained video diffusion models. Comprehensive experiments and human evaluations demonstrate VCtrl effectively enhances controllability and generation quality.
arXiv Detail & Related papers (2025-03-21T09:48:00Z)
BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations [82.94002870060045]
Existing video generation models struggle to follow complex text prompts and synthesize multiple objects. We develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models.
arXiv Detail & Related papers (2025-01-13T19:17:06Z)
Mimir: Improving Video Diffusion Models for Precise Text Understanding [53.72393225042688]
Text serves as the key control signal in video generation due to its narrative nature. The recent success of large language models (LLMs) showcases the power of decoder-only transformers. This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully tailored token fuser.
arXiv Detail & Related papers (2024-12-04T07:26:44Z)
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens. DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z)
Retrieval Enhanced Zero-Shot Video Captioning [69.96136689829778]
We bridge video and text using three key models: a general video understanding model XCLIP, a general image understanding model CLIP, and a text generation model GPT-2. To address this problem, we propose using learnable tokens as a communication medium between frozen GPT-2 and frozen XCLIP. Experiments show 4% to 20% improvements in terms of the main metric CIDEr compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-05-11T16:22:00Z)
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model. Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z)
Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation [115.09597127418452]
Latent-Shift is an efficient text-to-video generation method based on a pretrained text-to-image generation model. We show that Latent-Shift achieves comparable or better results while being significantly more efficient.
arXiv Detail & Related papers (2023-04-17T17:57:06Z)
Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V) We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z)
All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations. The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.