DivControl: Knowledge Diversion for Controllable Image Generation
- URL: http://arxiv.org/abs/2507.23620v1
- Date: Thu, 31 Jul 2025 15:00:15 GMT
- Title: DivControl: Knowledge Diversion for Controllable Image Generation
- Authors: Yucheng Xie, Fu Feng, Ruixiao Shi, Jing Wang, Yong Rui, Xin Geng,
- Abstract summary: DivControl is a decomposable pretraining framework for unified controllable generation.<n>It achieves state-of-the-art controllability with 36.4$times$ less training cost.<n>It also delivers strong zero-shot and few-shot performance on unseen conditions.
- Score: 38.166949036830886
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models have advanced from text-to-image (T2I) to image-to-image (I2I) generation by incorporating structured inputs such as depth maps, enabling fine-grained spatial control. However, existing methods either train separate models for each condition or rely on unified architectures with entangled representations, resulting in poor generalization and high adaptation costs for novel conditions. To this end, we propose DivControl, a decomposable pretraining framework for unified controllable generation and efficient adaptation. DivControl factorizes ControlNet via SVD into basic components-pairs of singular vectors-which are disentangled into condition-agnostic learngenes and condition-specific tailors through knowledge diversion during multi-condition training. Knowledge diversion is implemented via a dynamic gate that performs soft routing over tailors based on the semantics of condition instructions, enabling zero-shot generalization and parameter-efficient adaptation to novel conditions. To further improve condition fidelity and training efficiency, we introduce a representation alignment loss that aligns condition embeddings with early diffusion features. Extensive experiments demonstrate that DivControl achieves state-of-the-art controllability with 36.4$\times$ less training cost, while simultaneously improving average performance on basic conditions. It also delivers strong zero-shot and few-shot performance on unseen conditions, demonstrating superior scalability, modularity, and transferability.
Related papers
- A Practical Investigation of Spatially-Controlled Image Generation with Transformers [16.682348277650817]
We aim to provide clear takeaways across generation paradigms for practitioners wishing to develop systems for spatially-controlled generation.<n>We perform controlled experiments on ImageNet across diffusion-based/flow-based and autoregressive (AR) models.
arXiv Detail & Related papers (2025-07-21T15:33:49Z) - RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation [16.038598998902767]
Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts.<n>We propose a flexible feature injection framework that decouples the injection timestep from the denoising process.<n>Our approach achieves state-of-the-art performance across diverse zero-shot conditioning scenarios.
arXiv Detail & Related papers (2025-07-03T16:56:15Z) - EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer [15.879712910520801]
We propose EasyControl, a novel framework designed to unify condition-guided diffusion transformers with high efficiency and flexibility.<n>Our framework is built on three key innovations. First, we introduce a lightweight Condition Injection LoRA Module.<n>Second, we propose a Position-Aware Training Paradigm. This approach standardizes input conditions to fixed resolutions, allowing the generation of images with arbitrary aspect ratios and flexible resolutions.<n>Third, we develop a Causal Attention Mechanism combined with the KV Cache technique, adapted for conditional generation tasks.
arXiv Detail & Related papers (2025-03-10T08:07:17Z) - DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation [63.63429658282696]
We propose DynamicControl, which supports dynamic combinations of diverse control signals.<n>We show that DynamicControl is superior to existing methods in terms of controllability, generation quality and composability under various conditional controls.
arXiv Detail & Related papers (2024-12-04T11:54:57Z) - OminiControl: Minimal and Universal Control for Diffusion Transformer [68.3243031301164]
We present OminiControl, a novel approach that rethinks how image conditions are integrated into Diffusion Transformer (DiT) architectures.<n>OminiControl addresses these limitations through three key innovations.
arXiv Detail & Related papers (2024-11-22T17:55:15Z) - EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation [73.80275802696815]
We propose a universal framework called EasyControl for video generation.
Our method enables users to control video generation with a single condition map.
Our model demonstrates powerful image retention ability, resulting in high FVD and IS in UCF101 and MSR-VTT.
arXiv Detail & Related papers (2024-08-23T11:48:29Z) - ControlVAR: Exploring Controllable Visual Autoregressive Modeling [48.66209303617063]
Conditional visual generation has witnessed remarkable progress with the advent of diffusion models (DMs)
Challenges such as expensive computational cost, high inference latency, and difficulties of integration with large language models (LLMs) have necessitated exploring alternatives to DMs.
This paper introduces Controlmore, a novel framework that explores pixel-level controls in visual autoregressive modeling for flexible and efficient conditional generation.
arXiv Detail & Related papers (2024-06-14T06:35:33Z) - Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control [73.6361029556484]
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs.
We consider pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts.
We show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
arXiv Detail & Related papers (2024-05-09T15:39:54Z) - ECNet: Effective Controllable Text-to-Image Diffusion Models [31.21525123716149]
We introduce two innovative solutions for conditional text-to-image models.
Firstly, we propose a Spatial Guidance (SGI) which enhances conditional detail by encoding text inputs with precise annotation information.
Secondly, to overcome the issue of limited conditional supervision, we introduce Diffusion Consistency Loss.
This encourages consistency between the latent code at each time step and the input signal, thereby enhancing the robustness and accuracy of the output.
arXiv Detail & Related papers (2024-03-27T10:09:38Z) - CCM: Adding Conditional Controls to Text-to-Image Consistency Models [89.75377958996305]
We consider alternative strategies for adding ControlNet-like conditional control to Consistency Models.
A lightweight adapter can be jointly optimized under multiple conditions through Consistency Training.
We study these three solutions across various conditional controls, including edge, depth, human pose, low-resolution image and masked image.
arXiv Detail & Related papers (2023-12-12T04:16:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.