SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions
- URL: http://arxiv.org/abs/2404.06451v1
- Date: Tue, 9 Apr 2024 16:53:43 GMT
- Title: SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions
- Authors: Xiaoyu Liu, Yuxiang Wei, Ming Liu, Xianhui Lin, Peiran Ren, Xuansong Xie, Wangmeng Zuo,
- Abstract summary: We present a novel T2I generation method dubbed SmartControl to modify the rough visual conditions for adapting to text prompt.
The key idea of our SmartControl is to relax the visual condition on the areas that are conflicted with text prompts.
Experiments on four typical visual condition types clearly show the efficacy of our SmartControl against state-of-the-arts.
- Score: 59.53867290769282
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human visual imagination usually begins with analogies or rough sketches. For example, given an image with a girl playing guitar before a building, one may analogously imagine how it seems like if Iron Man playing guitar before Pyramid in Egypt. Nonetheless, visual condition may not be precisely aligned with the imaginary result indicated by text prompt, and existing layout-controllable text-to-image (T2I) generation models is prone to producing degraded generated results with obvious artifacts. To address this issue, we present a novel T2I generation method dubbed SmartControl, which is designed to modify the rough visual conditions for adapting to text prompt. The key idea of our SmartControl is to relax the visual condition on the areas that are conflicted with text prompts. In specific, a Control Scale Predictor (CSP) is designed to identify the conflict regions and predict the local control scales, while a dataset with text prompts and rough visual conditions is constructed for training CSP. It is worth noting that, even with a limited number (e.g., 1,000~2,000) of training samples, our SmartControl can generalize well to unseen objects. Extensive experiments on four typical visual condition types clearly show the efficacy of our SmartControl against state-of-the-arts. Source code, pre-trained models, and datasets are available at https://github.com/liuxiaoyu1104/SmartControl.
Related papers
- AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation [24.07613591217345]
Linguistic control enables effective content creation, but struggles with fine-grained control over image generation.
AnyControl develops a novel Multi-Control framework that extracts a unified multi-modal embedding to guide the generation process.
This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals.
arXiv Detail & Related papers (2024-06-27T07:40:59Z) - Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control [73.6361029556484]
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs.
We consider pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts.
We show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
arXiv Detail & Related papers (2024-05-09T15:39:54Z) - Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion [27.61734719689046]
We propose a training-free approach named Mask-guided Prompt Following (MGPF) to enhance prompt following with visual control.
The efficacy and superiority of MGPF are validated through comprehensive quantitative and qualitative experiments.
arXiv Detail & Related papers (2024-04-23T06:10:43Z) - FineControlNet: Fine-level Text Control for Image Generation with
Spatially Aligned Text Control Injection [28.65209293141492]
FineControlNet provides fine control over each instance's appearance while maintaining the precise pose control capability.
FineControlNet achieves superior performance in generating images that follow the user-provided instance-specific text prompts and poses.
arXiv Detail & Related papers (2023-12-14T18:59:43Z) - ControlNet-XS: Designing an Efficient and Effective Architecture for
Controlling Text-to-Image Diffusion Models [21.379896810560282]
A popular approach is to use a controlling network, such as ControlNet, in combination with a pre-trained image generation model, such as Stable Diffusion.
In this work we propose a new controlling architecture, called ControlNet-XS, which does not suffer from this problem.
In contrast to ControlNet, our model needs only a fraction of parameters, and hence is about twice as fast during inference and training time.
arXiv Detail & Related papers (2023-12-11T17:58:06Z) - Fine-grained Controllable Video Generation via Object Appearance and
Context [74.23066823064575]
We propose fine-grained controllable video generation (FACTOR) to achieve detailed control.
FACTOR aims to control objects' appearances and context, including their location and category.
Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
arXiv Detail & Related papers (2023-12-05T17:47:33Z) - AutoStory: Generating Diverse Storytelling Images with Minimal Human
Effort [55.83007338095763]
We propose an automated story visualization system that can effectively generate diverse, high-quality, and consistent sets of story images.
We utilize the comprehension and planning capabilities of large language models for layout planning, and then leverage large-scale text-to-image models to generate sophisticated story images.
arXiv Detail & Related papers (2023-11-19T06:07:37Z) - Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models [82.19740045010435]
We introduce Uni-ControlNet, a unified framework that allows for the simultaneous utilization of different local controls and global controls.
Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models.
Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability.
arXiv Detail & Related papers (2023-05-25T17:59:58Z) - UniControl: A Unified Diffusion Model for Controllable Visual Generation
In the Wild [166.25327094261038]
We introduce UniControl, a new generative foundation model for controllable condition-to-image (C2I) tasks.
UniControl consolidates a wide array of C2I tasks within a singular framework, while still allowing for arbitrary language prompts.
trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities.
arXiv Detail & Related papers (2023-05-18T17:41:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.