Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback
- URL: http://arxiv.org/abs/2507.02321v1
- Date: Thu, 03 Jul 2025 05:25:53 GMT
- Title: Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback
- Authors: Nina Konovalova, Maxim Nikolaev, Andrey Kuznetsov, Aibek Alanov,
- Abstract summary: ControlNet addresses this by introducing an auxiliary conditioning module.<n>ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps.<n>We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps.
- Score: 1.7749342709605145
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions throughout the entire diffusion process, our alignment loss improves both control fidelity and generation quality. Combined with established techniques like ControlNet++, InnerControl achieves state-of-the-art performance across diverse conditioning methods (e.g., edges, depth).
Related papers
- Noise Consistency Training: A Native Approach for One-Step Generator in Learning Additional Controls [6.343348427620997]
One-step generators offer excellent generation quality and computational efficiency.<n>But adapting them to new control conditions poses a significant challenge.<n>This paper introduces a novel and lightweight approach to directly integrate new control signals into pre-trained one-step generators.
arXiv Detail & Related papers (2025-06-24T15:58:55Z) - Minimal Impact ControlNet: Advancing Multi-ControlNet Integration [35.40147040893738]
In current ControlNet training, each control is designed to influence all areas of an image.<n>Silent control signals can suppress the generation of textures in related areas.<n>We propose Minimal Impact ControlNet to address this problem.
arXiv Detail & Related papers (2025-06-02T13:41:43Z) - PixelPonder: Dynamic Patch Adaptation for Enhanced Multi-Conditional Text-to-Image Generation [24.964136963713102]
We present PixelPonder, a novel unified control framework that allows for effective control of multiple visual conditions under a single control structure.<n>Specifically, we design a patch-level adaptive condition selection mechanism that dynamically prioritizes spatially relevant control signals at the sub-region level.<n>Extensive experiments demonstrate that PixelPonder surpasses previous methods across different benchmark datasets.
arXiv Detail & Related papers (2025-03-09T16:27:02Z) - CoDe: Blockwise Control for Denoising Diffusion Models [9.235074675079767]
Aligning diffusion models to downstream tasks often requires finetuning new models or gradient-based guidance at inference time.<n>In this work, we explore a simple inference-time gradient-free guidance approach, called controlled denoising (CoDe)<n>CoDe is a blockwise sampling method applied during intermediate denoising steps, allowing for alignment with downstream rewards.
arXiv Detail & Related papers (2025-02-03T00:23:04Z) - Denoising as Adaptation: Noise-Space Domain Adaptation for Image Restoration [64.84134880709625]
We show that it is possible to perform domain adaptation via the noise space using diffusion models.<n>In particular, by leveraging the unique property of how auxiliary conditional inputs influence the multi-step denoising process, we derive a meaningful diffusion loss.<n>We present crucial strategies such as channel-shuffling layer and residual-swapping contrastive learning in the diffusion model.
arXiv Detail & Related papers (2024-06-26T17:40:30Z) - Growing Q-Networks: Solving Continuous Control Tasks with Adaptive Control Resolution [51.83951489847344]
In robotics applications, smooth control signals are commonly preferred to reduce system wear and energy efficiency.
In this work, we aim to bridge this performance gap by growing discrete action spaces from coarse to fine control resolution.
Our work indicates that an adaptive control resolution in combination with value decomposition yields simple critic-only algorithms that yield surprisingly strong performance on continuous control tasks.
arXiv Detail & Related papers (2024-04-05T17:58:37Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Unsupervised learning based end-to-end delayless generative fixed-filter
active noise control [22.809445468752262]
Delayless noise control is achieved by our earlier generative fixed-filter active noise control (GFANC) framework.
The one-dimensional convolutional neural network (1D CNN) in the co-processor requires initial training using labelled noise datasets.
We propose an unsupervised-GFANC approach to simplify the 1D CNN training process and enhance its practicality.
arXiv Detail & Related papers (2024-02-08T06:14:12Z) - DITTO: Diffusion Inference-Time T-Optimization for Music Generation [49.90109850026932]
Diffusion Inference-Time T-Optimization (DITTO) is a frame-work for controlling pre-trained text-to-music diffusion models at inference-time.
We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control.
arXiv Detail & Related papers (2024-01-22T18:10:10Z) - DeNoising-MOT: Towards Multiple Object Tracking with Severe Occlusions [52.63323657077447]
We propose DNMOT, an end-to-end trainable DeNoising Transformer for multiple object tracking.
Specifically, we augment the trajectory with noises during training and make our model learn the denoising process in an encoder-decoder architecture.
We conduct extensive experiments on the MOT17, MOT20, and DanceTrack datasets, and the experimental results show that our method outperforms previous state-of-the-art methods by a clear margin.
arXiv Detail & Related papers (2023-09-09T04:40:01Z) - AdaStereo: An Efficient Domain-Adaptive Stereo Matching Approach [50.855679274530615]
We present a novel domain-adaptive approach called AdaStereo to align multi-level representations for deep stereo matching networks.
Our models achieve state-of-the-art cross-domain performance on multiple benchmarks, including KITTI, Middlebury, ETH3D and DrivingStereo.
Our method is robust to various domain adaptation settings, and can be easily integrated into quick adaptation application scenarios and real-world deployments.
arXiv Detail & Related papers (2021-12-09T15:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.