Fine-grained Controllable Video Generation via Object Appearance and
Context
- URL: http://arxiv.org/abs/2312.02919v1
- Date: Tue, 5 Dec 2023 17:47:33 GMT
- Title: Fine-grained Controllable Video Generation via Object Appearance and
Context
- Authors: Hsin-Ping Huang, Yu-Chuan Su, Deqing Sun, Lu Jiang, Xuhui Jia, Yukun
Zhu, Ming-Hsuan Yang
- Abstract summary: We propose fine-grained controllable video generation (FACTOR) to achieve detailed control.
FACTOR aims to control objects' appearances and context, including their location and category.
Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
- Score: 74.23066823064575
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-video generation has shown promising results. However, by taking only
natural languages as input, users often face difficulties in providing detailed
information to precisely control the model's output. In this work, we propose
fine-grained controllable video generation (FACTOR) to achieve detailed
control. Specifically, FACTOR aims to control objects' appearances and context,
including their location and category, in conjunction with the text prompt. To
achieve detailed control, we propose a unified framework to jointly inject
control signals into the existing text-to-video model. Our model consists of a
joint encoder and adaptive cross-attention layers. By optimizing the encoder
and the inserted layer, we adapt the model to generate videos that are aligned
with both text prompts and fine-grained control. Compared to existing methods
relying on dense control signals such as edge maps, we provide a more intuitive
and user-friendly interface to allow object-level fine-grained control. Our
method achieves controllability of object appearances without finetuning, which
reduces the per-subject optimization efforts for the users. Extensive
experiments on standard benchmark datasets and user-provided inputs validate
that our model obtains a 70% improvement in controllability metrics over
competitive baselines.
Related papers
- BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations [82.94002870060045]
Existing video generation models struggle to follow complex text prompts and synthesize multiple objects.
We develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance.
We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models.
arXiv Detail & Related papers (2025-01-13T19:17:06Z) - DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation [63.63429658282696]
We propose DynamicControl, which supports dynamic combinations of diverse control signals.
We show that DynamicControl is superior to existing methods in terms of controllability, generation quality and composability under various conditional controls.
arXiv Detail & Related papers (2024-12-04T11:54:57Z) - I2VControl: Disentangled and Unified Video Motion Synthesis Control [11.83645633418189]
We present a disentangled and unified framework, namely I2VControl, that unifies multiple motion control tasks in image-to-video synthesis.
Our approach partitions the video into individual motion units and represents each unit with disentangled control signals.
Our methodology seamlessly integrates as a plug-in for pre-trained models and remains agnostic to specific model architectures.
arXiv Detail & Related papers (2024-11-26T04:21:22Z) - DiVE: DiT-based Video Generation with Enhanced Control [23.63288169762629]
We propose first DiT-based framework specifically designed for generating temporally and multi-view consistent videos.
Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency.
arXiv Detail & Related papers (2024-09-03T04:29:59Z) - EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation [73.80275802696815]
We propose a universal framework called EasyControl for video generation.
Our method enables users to control video generation with a single condition map.
Our model demonstrates powerful image retention ability, resulting in high FVD and IS in UCF101 and MSR-VTT.
arXiv Detail & Related papers (2024-08-23T11:48:29Z) - PerLDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Models [55.080748327139176]
We introduce PerLDiff, a method for effective street view image generation that fully leverages perspective 3D geometric information.
Our results justify that our PerLDiff markedly enhances the precision of generation on the NuScenes and KITTI datasets.
arXiv Detail & Related papers (2024-07-08T16:46:47Z) - ECNet: Effective Controllable Text-to-Image Diffusion Models [31.21525123716149]
We introduce two innovative solutions for conditional text-to-image models.
Firstly, we propose a Spatial Guidance (SGI) which enhances conditional detail by encoding text inputs with precise annotation information.
Secondly, to overcome the issue of limited conditional supervision, we introduce Diffusion Consistency Loss.
This encourages consistency between the latent code at each time step and the input signal, thereby enhancing the robustness and accuracy of the output.
arXiv Detail & Related papers (2024-03-27T10:09:38Z) - LiFi: Lightweight Controlled Text Generation with Fine-Grained Control
Codes [46.74968005604948]
We present LIFI, which offers a lightweight approach with fine-grained control for controlled text generation.
We evaluate LIFI on two conventional tasks -- sentiment control and topic control -- and one newly proposed task -- stylistic novel writing.
arXiv Detail & Related papers (2024-02-10T11:53:48Z) - Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image
Generation [79.8881514424969]
Text-conditional diffusion models are able to generate high-fidelity images with diverse contents.
However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery.
We propose Cocktail, a pipeline to mix various modalities into one embedding.
arXiv Detail & Related papers (2023-06-01T17:55:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.