Related papers: OmniControlNet: Dual-stage Integration for Conditional Image Generation

OmniControlNet: Dual-stage Integration for Conditional Image Generation

URL: http://arxiv.org/abs/2406.05871v1
Date: Sun, 9 Jun 2024 18:03:47 GMT
Title: OmniControlNet: Dual-stage Integration for Conditional Image Generation
Authors: Yilin Wang, Haiyang Xu, Xiang Zhang, Zeyuan Chen, Zhizhou Sha, Zirui Wang, Zhuowen Tu,
Abstract summary: We provide a two-way integration for the widely adopted ControlNet by integrating external condition generation algorithms into a single dense prediction method. Our proposed OmniControlNet consolidates 1) the condition generation by a single multi-tasking dense prediction algorithm under the task embedding guidance and 2) the image generation process for different conditioning types under the textual embedding guidance.
Score: 61.1432268643639
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We provide a two-way integration for the widely adopted ControlNet by integrating external condition generation algorithms into a single dense prediction method and incorporating its individually trained image generation processes into a single model. Despite its tremendous success, the ControlNet of a two-stage pipeline bears limitations in being not self-contained (e.g. calls the external condition generation algorithms) with a large model redundancy (separately trained models for different types of conditioning inputs). Our proposed OmniControlNet consolidates 1) the condition generation (e.g., HED edges, depth maps, user scribble, and animal pose) by a single multi-tasking dense prediction algorithm under the task embedding guidance and 2) the image generation process for different conditioning types under the textual embedding guidance. OmniControlNet achieves significantly reduced model complexity and redundancy while capable of producing images of comparable quality for conditioned text-to-image generation.

Related papers

Condition Weaving Meets Expert Modulation: Towards Universal and Controllable Image Generation [17.898556887669997]
We propose a Unified image-to-image Generation (UniGen) framework that supports diverse conditional inputs.<n>Condition Modulated Expert (CoMoE) module aggregates semantically similar patch features for visual representation and conditional modeling.<n>We also propose WeaveNet, a dynamic, snake-like connection mechanism that enables effective interaction between global text-level control from the backbone and fine-grained control from conditional branches.
arXiv Detail & Related papers (2025-08-24T13:47:10Z)
MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models [30.494968865008513]
Recent text-to-image models struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex image generation.<n>We propose MENTOR, a novel framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation.<n>Our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods.
arXiv Detail & Related papers (2025-07-13T10:52:59Z)
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think [38.258453761376586]
We propose Dream Engine, an efficient framework designed for arbitrary text-image interleaved control in image generation models. Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning. Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark.
arXiv Detail & Related papers (2025-02-27T15:08:39Z)
DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation [63.63429658282696]
We propose DynamicControl, which supports dynamic combinations of diverse control signals. We show that DynamicControl is superior to existing methods in terms of controllability, generation quality and composability under various conditional controls.
arXiv Detail & Related papers (2024-12-04T11:54:57Z)
One Diffusion to Generate Them All [54.82732533013014]
OneDiffusion is a versatile, large-scale diffusion model that supports bidirectional image synthesis and understanding. It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps. OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs.
arXiv Detail & Related papers (2024-11-25T12:11:05Z)
OminiControl: Minimal and Universal Control for Diffusion Transformer [68.3243031301164]
OminiControl is a framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models. At its core, OminiControl leverages a parameter reuse mechanism, enabling the DiT to encode image conditions using itself as a powerful backbone. OminiControl addresses a wide range of image conditioning tasks in a unified manner, including subject-driven generation and spatially-aligned conditions.
arXiv Detail & Related papers (2024-11-22T17:55:15Z)
A Simple Approach to Unifying Diffusion-based Conditional Generation [63.389616350290595]
We introduce a simple, unified framework to handle diverse conditional generation tasks. Our approach enables versatile capabilities via different inference-time sampling schemes. Our model supports additional capabilities like non-spatially aligned and coarse conditioning.
arXiv Detail & Related papers (2024-10-15T09:41:43Z)
AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation [24.07613591217345]
Linguistic control enables effective content creation, but struggles with fine-grained control over image generation. AnyControl develops a novel Multi-Control framework that extracts a unified multi-modal embedding to guide the generation process. This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals.
arXiv Detail & Related papers (2024-06-27T07:40:59Z)
Instruct-Imagen: Image Generation with Multi-modal Instruction [90.04481955523514]
instruct-imagen is a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain.
arXiv Detail & Related papers (2024-01-03T19:31:58Z)
Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models [82.19740045010435]
We introduce Uni-ControlNet, a unified framework that allows for the simultaneous utilization of different local controls and global controls. Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models. Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability.
arXiv Detail & Related papers (2023-05-25T17:59:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.