Fine-grained Controllable Video Generation via Object Appearance and
Context
- URL: http://arxiv.org/abs/2312.02919v1
- Date: Tue, 5 Dec 2023 17:47:33 GMT
- Title: Fine-grained Controllable Video Generation via Object Appearance and
Context
- Authors: Hsin-Ping Huang, Yu-Chuan Su, Deqing Sun, Lu Jiang, Xuhui Jia, Yukun
Zhu, Ming-Hsuan Yang
- Abstract summary: We propose fine-grained controllable video generation (FACTOR) to achieve detailed control.
FACTOR aims to control objects' appearances and context, including their location and category.
Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
- Score: 74.23066823064575
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-video generation has shown promising results. However, by taking only
natural languages as input, users often face difficulties in providing detailed
information to precisely control the model's output. In this work, we propose
fine-grained controllable video generation (FACTOR) to achieve detailed
control. Specifically, FACTOR aims to control objects' appearances and context,
including their location and category, in conjunction with the text prompt. To
achieve detailed control, we propose a unified framework to jointly inject
control signals into the existing text-to-video model. Our model consists of a
joint encoder and adaptive cross-attention layers. By optimizing the encoder
and the inserted layer, we adapt the model to generate videos that are aligned
with both text prompts and fine-grained control. Compared to existing methods
relying on dense control signals such as edge maps, we provide a more intuitive
and user-friendly interface to allow object-level fine-grained control. Our
method achieves controllability of object appearances without finetuning, which
reduces the per-subject optimization efforts for the users. Extensive
experiments on standard benchmark datasets and user-provided inputs validate
that our model obtains a 70% improvement in controllability metrics over
competitive baselines.
Related papers
- Compass Control: Multi Object Orientation Control for Text-to-Image Generation [24.4172525865806]
Existing approaches for controlling text-to-image diffusion models, while powerful, do not allow for explicit 3D object-centric control.
We address the problem of multi-object orientation control in text-to-image diffusion models.
This enables the generation of diverse multi-object scenes with precise orientation control for each object.
arXiv Detail & Related papers (2025-04-09T10:15:15Z) - Enabling Versatile Controls for Video Diffusion Models [18.131652071161266]
VCtrl is a novel framework designed to enable fine control over pre-trained video diffusion models.
Comprehensive experiments and human evaluations demonstrate VCtrl effectively enhances controllability and generation quality.
arXiv Detail & Related papers (2025-03-21T09:48:00Z) - BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations [82.94002870060045]
Existing video generation models struggle to follow complex text prompts and synthesize multiple objects.
We develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance.
We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models.
arXiv Detail & Related papers (2025-01-13T19:17:06Z) - DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation [63.63429658282696]
We propose DynamicControl, which supports dynamic combinations of diverse control signals.
We show that DynamicControl is superior to existing methods in terms of controllability, generation quality and composability under various conditional controls.
arXiv Detail & Related papers (2024-12-04T11:54:57Z) - I2VControl: Disentangled and Unified Video Motion Synthesis Control [11.83645633418189]
We present a disentangled and unified framework, namely I2VControl, that unifies multiple motion control tasks in image-to-video synthesis.
Our approach partitions the video into individual motion units and represents each unit with disentangled control signals.
Our methodology seamlessly integrates as a plug-in for pre-trained models and remains agnostic to specific model architectures.
arXiv Detail & Related papers (2024-11-26T04:21:22Z) - Generating Compositional Scenes via Text-to-image RGBA Instance Generation [82.63805151691024]
Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering.
We propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity.
Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes.
arXiv Detail & Related papers (2024-11-16T23:44:14Z) - DiVE: DiT-based Video Generation with Enhanced Control [23.63288169762629]
We propose first DiT-based framework specifically designed for generating temporally and multi-view consistent videos.
Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency.
arXiv Detail & Related papers (2024-09-03T04:29:59Z) - EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation [73.80275802696815]
We propose a universal framework called EasyControl for video generation.
Our method enables users to control video generation with a single condition map.
Our model demonstrates powerful image retention ability, resulting in high FVD and IS in UCF101 and MSR-VTT.
arXiv Detail & Related papers (2024-08-23T11:48:29Z) - VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control [74.5434726968562]
We tame transformers video for 3D camera control using a ControlNet-like conditioning mechanism based on Plucker coordinates.
Our work is the first to enable camera control for transformer-based video diffusion models.
arXiv Detail & Related papers (2024-07-17T17:59:05Z) - PerlDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Models [55.080748327139176]
PerlDiff is a method for effective street view image generation that fully leverages perspective 3D geometric information.
Our results justify that our PerlDiff markedly enhances the precision of generation on the NuScenes and KITTI datasets.
arXiv Detail & Related papers (2024-07-08T16:46:47Z) - ECNet: Effective Controllable Text-to-Image Diffusion Models [31.21525123716149]
We introduce two innovative solutions for conditional text-to-image models.
Firstly, we propose a Spatial Guidance (SGI) which enhances conditional detail by encoding text inputs with precise annotation information.
Secondly, to overcome the issue of limited conditional supervision, we introduce Diffusion Consistency Loss.
This encourages consistency between the latent code at each time step and the input signal, thereby enhancing the robustness and accuracy of the output.
arXiv Detail & Related papers (2024-03-27T10:09:38Z) - LiFi: Lightweight Controlled Text Generation with Fine-Grained Control
Codes [46.74968005604948]
We present LIFI, which offers a lightweight approach with fine-grained control for controlled text generation.
We evaluate LIFI on two conventional tasks -- sentiment control and topic control -- and one newly proposed task -- stylistic novel writing.
arXiv Detail & Related papers (2024-02-10T11:53:48Z) - ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation [33.37279673304]
We introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text.
ConditionVideo generates realistic dynamic videos from random noise or given scene videos.
Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods.
arXiv Detail & Related papers (2023-10-11T17:46:28Z) - Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image
Generation [79.8881514424969]
Text-conditional diffusion models are able to generate high-fidelity images with diverse contents.
However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery.
We propose Cocktail, a pipeline to mix various modalities into one embedding.
arXiv Detail & Related papers (2023-06-01T17:55:32Z) - Sketching the Future (STF): Applying Conditional Control Techniques to
Text-to-Video Models [0.0]
We propose a novel approach that combines zero-shot text-to-video generation with ControlNet to improve the output of these models.
Our method takes multiple sketched frames as input and generates video output that matches the flow of these frames.
arXiv Detail & Related papers (2023-05-10T02:33:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.