Related papers: SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions

SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions

URL: http://arxiv.org/abs/2404.06451v1
Date: Tue, 9 Apr 2024 16:53:43 GMT
Title: SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions
Authors: Xiaoyu Liu, Yuxiang Wei, Ming Liu, Xianhui Lin, Peiran Ren, Xuansong Xie, Wangmeng Zuo,
Abstract summary: We present a novel T2I generation method dubbed SmartControl to modify the rough visual conditions for adapting to text prompt. The key idea of our SmartControl is to relax the visual condition on the areas that are conflicted with text prompts. Experiments on four typical visual condition types clearly show the efficacy of our SmartControl against state-of-the-arts.
Score: 59.53867290769282
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human visual imagination usually begins with analogies or rough sketches. For example, given an image with a girl playing guitar before a building, one may analogously imagine how it seems like if Iron Man playing guitar before Pyramid in Egypt. Nonetheless, visual condition may not be precisely aligned with the imaginary result indicated by text prompt, and existing layout-controllable text-to-image (T2I) generation models is prone to producing degraded generated results with obvious artifacts. To address this issue, we present a novel T2I generation method dubbed SmartControl, which is designed to modify the rough visual conditions for adapting to text prompt. The key idea of our SmartControl is to relax the visual condition on the areas that are conflicted with text prompts. In specific, a Control Scale Predictor (CSP) is designed to identify the conflict regions and predict the local control scales, while a dataset with text prompts and rough visual conditions is constructed for training CSP. It is worth noting that, even with a limited number (e.g., 1,000~2,000) of training samples, our SmartControl can generalize well to unseen objects. Extensive experiments on four typical visual condition types clearly show the efficacy of our SmartControl against state-of-the-arts. Source code, pre-trained models, and datasets are available at https://github.com/liuxiaoyu1104/SmartControl.

Related papers

VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis [59.12590059101254]
We present VersaGen, a generative AI agent that enables versatile visual control in text-to-image (T2I) synthesis. We train an adaptor upon a frozen T2I model to accommodate the visual information into the text-dominated diffusion process.
arXiv Detail & Related papers (2024-12-16T09:32:23Z)
CtrLoRA: An Extensible and Efficient Framework for Controllable Image Generation [69.43106794519193]
We propose the CtrLoRA framework, which trains a Base ControlNet to learn the common knowledge of image-to-image generation from multiple base conditions. Our framework reduces the learnable parameters by 90% compared to ControlNet, significantly lowering the threshold to distribute and deploy the model weights.
arXiv Detail & Related papers (2024-10-12T07:04:32Z)
PreciseControl: Enhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control [24.569528214869113]
StyleGAN models learn a rich face prior and enable smooth control towards fine-grained attribute editing by latent manipulation. This work uses the disentangled $mathcalW+$ space of StyleGANs to condition the T2I model. We perform extensive experiments to validate our method for face personalization and fine-grained attribute editing.
arXiv Detail & Related papers (2024-07-24T07:10:25Z)
AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation [24.07613591217345]
Linguistic control enables effective content creation, but struggles with fine-grained control over image generation. AnyControl develops a novel Multi-Control framework that extracts a unified multi-modal embedding to guide the generation process. This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals.
arXiv Detail & Related papers (2024-06-27T07:40:59Z)
Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion [27.61734719689046]
We propose a training-free approach named Mask-guided Prompt Following (MGPF) to enhance prompt following with visual control. The efficacy and superiority of MGPF are validated through comprehensive quantitative and qualitative experiments.
arXiv Detail & Related papers (2024-04-23T06:10:43Z)
FineControlNet: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection [28.65209293141492]
FineControlNet provides fine control over each instance's appearance while maintaining the precise pose control capability. FineControlNet achieves superior performance in generating images that follow the user-provided instance-specific text prompts and poses.
arXiv Detail & Related papers (2023-12-14T18:59:43Z)
Fine-grained Controllable Video Generation via Object Appearance and Context [74.23066823064575]
We propose fine-grained controllable video generation (FACTOR) to achieve detailed control. FACTOR aims to control objects' appearances and context, including their location and category. Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
arXiv Detail & Related papers (2023-12-05T17:47:33Z)
AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort [55.83007338095763]
We propose an automated story visualization system that can effectively generate diverse, high-quality, and consistent sets of story images. We utilize the comprehension and planning capabilities of large language models for layout planning, and then leverage large-scale text-to-image models to generate sophisticated story images.
arXiv Detail & Related papers (2023-11-19T06:07:37Z)
Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models [82.19740045010435]
We introduce Uni-ControlNet, a unified framework that allows for the simultaneous utilization of different local controls and global controls. Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models. Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability.
arXiv Detail & Related papers (2023-05-25T17:59:58Z)
UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild [166.25327094261038]
We introduce UniControl, a new generative foundation model for controllable condition-to-image (C2I) tasks. UniControl consolidates a wide array of C2I tasks within a singular framework, while still allowing for arbitrary language prompts. trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities.
arXiv Detail & Related papers (2023-05-18T17:41:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.