Related papers: Fine-grained Controllable Video Generation via Object Appearance and Context

Fine-grained Controllable Video Generation via Object Appearance and Context

URL: http://arxiv.org/abs/2312.02919v1
Date: Tue, 5 Dec 2023 17:47:33 GMT
Title: Fine-grained Controllable Video Generation via Object Appearance and Context
Authors: Hsin-Ping Huang, Yu-Chuan Su, Deqing Sun, Lu Jiang, Xuhui Jia, Yukun Zhu, Ming-Hsuan Yang
Abstract summary: We propose fine-grained controllable video generation (FACTOR) to achieve detailed control. FACTOR aims to control objects' appearances and context, including their location and category. Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
Score: 74.23066823064575
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-video generation has shown promising results. However, by taking only natural languages as input, users often face difficulties in providing detailed information to precisely control the model's output. In this work, we propose fine-grained controllable video generation (FACTOR) to achieve detailed control. Specifically, FACTOR aims to control objects' appearances and context, including their location and category, in conjunction with the text prompt. To achieve detailed control, we propose a unified framework to jointly inject control signals into the existing text-to-video model. Our model consists of a joint encoder and adaptive cross-attention layers. By optimizing the encoder and the inserted layer, we adapt the model to generate videos that are aligned with both text prompts and fine-grained control. Compared to existing methods relying on dense control signals such as edge maps, we provide a more intuitive and user-friendly interface to allow object-level fine-grained control. Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users. Extensive experiments on standard benchmark datasets and user-provided inputs validate that our model obtains a 70% improvement in controllability metrics over competitive baselines.

Related papers

LLMControl: Grounded Control of Text-to-Image Diffusion-based Synthesis with Multimodal LLMs [3.6016438645365834]
We present a framework called LLM_Control to address the challenges of controllable T2I generation task.<n>By improving grounding capabilities, LLM_Control is introduced to accurately modulate the pre-trained diffusion models.<n>We utilize the multimodal LLM as a global controller to arrange spatial layouts, augment semantic descriptions and bind object attributes.
arXiv Detail & Related papers (2025-07-26T12:57:02Z)
Compass Control: Multi Object Orientation Control for Text-to-Image Generation [24.4172525865806]
Existing approaches for controlling text-to-image diffusion models, while powerful, do not allow for explicit 3D object-centric control. We address the problem of multi-object orientation control in text-to-image diffusion models. This enables the generation of diverse multi-object scenes with precise orientation control for each object.
arXiv Detail & Related papers (2025-04-09T10:15:15Z)
Enabling Versatile Controls for Video Diffusion Models [18.131652071161266]
VCtrl is a novel framework designed to enable fine control over pre-trained video diffusion models. Comprehensive experiments and human evaluations demonstrate VCtrl effectively enhances controllability and generation quality.
arXiv Detail & Related papers (2025-03-21T09:48:00Z)
BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations [82.94002870060045]
Existing video generation models struggle to follow complex text prompts and synthesize multiple objects. We develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models.
arXiv Detail & Related papers (2025-01-13T19:17:06Z)
DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation [63.63429658282696]
We propose DynamicControl, which supports dynamic combinations of diverse control signals. We show that DynamicControl is superior to existing methods in terms of controllability, generation quality and composability under various conditional controls.
arXiv Detail & Related papers (2024-12-04T11:54:57Z)
I2VControl: Disentangled and Unified Video Motion Synthesis Control [11.83645633418189]
We present a disentangled and unified framework, namely I2VControl, that unifies multiple motion control tasks in image-to-video synthesis. Our approach partitions the video into individual motion units and represents each unit with disentangled control signals. Our methodology seamlessly integrates as a plug-in for pre-trained models and remains agnostic to specific model architectures.
arXiv Detail & Related papers (2024-11-26T04:21:22Z)
Generating Compositional Scenes via Text-to-image RGBA Instance Generation [82.63805151691024]
Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering. We propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity. Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes.
arXiv Detail & Related papers (2024-11-16T23:44:14Z)
DiVE: DiT-based Video Generation with Enhanced Control [23.63288169762629]
We propose first DiT-based framework specifically designed for generating temporally and multi-view consistent videos. Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency.
arXiv Detail & Related papers (2024-09-03T04:29:59Z)
EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation [73.80275802696815]
We propose a universal framework called EasyControl for video generation. Our method enables users to control video generation with a single condition map. Our model demonstrates powerful image retention ability, resulting in high FVD and IS in UCF101 and MSR-VTT.
arXiv Detail & Related papers (2024-08-23T11:48:29Z)
VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control [74.5434726968562]
We tame transformers video for 3D camera control using a ControlNet-like conditioning mechanism based on Plucker coordinates. Our work is the first to enable camera control for transformer-based video diffusion models.
arXiv Detail & Related papers (2024-07-17T17:59:05Z)
PerlDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Models [55.080748327139176]
PerlDiff is a method for effective street view image generation that fully leverages perspective 3D geometric information. Our results justify that our PerlDiff markedly enhances the precision of generation on the NuScenes and KITTI datasets.
arXiv Detail & Related papers (2024-07-08T16:46:47Z)
ECNet: Effective Controllable Text-to-Image Diffusion Models [31.21525123716149]
We introduce two innovative solutions for conditional text-to-image models. Firstly, we propose a Spatial Guidance (SGI) which enhances conditional detail by encoding text inputs with precise annotation information. Secondly, to overcome the issue of limited conditional supervision, we introduce Diffusion Consistency Loss. This encourages consistency between the latent code at each time step and the input signal, thereby enhancing the robustness and accuracy of the output.
arXiv Detail & Related papers (2024-03-27T10:09:38Z)
LiFi: Lightweight Controlled Text Generation with Fine-Grained Control Codes [46.74968005604948]
We present LIFI, which offers a lightweight approach with fine-grained control for controlled text generation. We evaluate LIFI on two conventional tasks -- sentiment control and topic control -- and one newly proposed task -- stylistic novel writing.
arXiv Detail & Related papers (2024-02-10T11:53:48Z)
ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation [33.37279673304]
We introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text. ConditionVideo generates realistic dynamic videos from random noise or given scene videos. Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods.
arXiv Detail & Related papers (2023-10-11T17:46:28Z)
Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation [79.8881514424969]
Text-conditional diffusion models are able to generate high-fidelity images with diverse contents. However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery. We propose Cocktail, a pipeline to mix various modalities into one embedding.
arXiv Detail & Related papers (2023-06-01T17:55:32Z)
Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models [0.0]
We propose a novel approach that combines zero-shot text-to-video generation with ControlNet to improve the output of these models. Our method takes multiple sketched frames as input and generates video output that matches the flow of these frames.
arXiv Detail & Related papers (2023-05-10T02:33:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.