Related papers: RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation

RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation

URL: http://arxiv.org/abs/2507.02792v2
Date: Tue, 08 Jul 2025 16:01:31 GMT
Title: RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation
Authors: Liheng Zhang, Lexi Pang, Hang Ye, Xiaoxuan Ma, Yizhou Wang,
Abstract summary: Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts.<n>We propose a flexible feature injection framework that decouples the injection timestep from the denoising process.<n>Our approach achieves state-of-the-art performance across diverse zero-shot conditioning scenarios.
Score: 16.038598998902767
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., depth or pose maps) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning approaches. However, they often suffer from structural misalignment, condition leakage, and visual artifacts, especially when the condition image diverges significantly from natural RGB distributions. By revisiting existing methods, we identify a core limitation: the synchronous injection of condition features fails to account for the trade-off between domain alignment and structural preservation during denoising. Inspired by this observation, we propose a flexible feature injection framework that decouples the injection timestep from the denoising process. At its core is a structure-rich injection module, which enables the model to better adapt to the evolving interplay between alignment and structure preservation throughout the diffusion steps, resulting in more faithful structural generation. In addition, we introduce appearance-rich prompting and a restart refinement strategy to further enhance appearance control and visual quality. Together, these designs enable training-free generation that is both structure-rich and appearance-rich. Extensive experiments show that our approach achieves state-of-the-art performance across diverse zero-shot conditioning scenarios.

Related papers

FICGen: Frequency-Inspired Contextual Disentanglement for Layout-driven Degraded Image Generation [16.628211648386454]
FICGen seeks to transfer frequency knowledge of degraded images into the latent diffusion space.<n>FICGen consistently surpasses existing L2I methods in terms of generative fidelity, alignment and downstream auxiliary trainability.
arXiv Detail & Related papers (2025-09-01T04:00:22Z)
SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation [9.212970624261272]
State-of-the-art text-to-image models produce visually impressive results but often struggle with precise alignment to text prompts.<n>We propose a novel approach that learns a high-success-rate distribution conditioned on a target prompt.<n>Our method explicitly models the signal component during the denoising process, offering fine-grained control that mitigates over-optimization.
arXiv Detail & Related papers (2025-08-19T14:31:15Z)
CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance [47.59187786346473]
We present CountLoop, a training-free framework that provides diffusion models with accurate instance control.<n>Experiments on COCO Count, T2I CompBench, and two new high-instance benchmarks show that CountLoop achieves counting accuracy of up to 98%.
arXiv Detail & Related papers (2025-08-18T11:28:02Z)
Exploring Fourier Prior and Event Collaboration for Low-Light Image Enhancement [1.8724535169356553]
Event cameras provide performance gain for low-light image enhancement.<n>Currently, existing event-based methods feed a frame and events directly into a single model.<n>We propose a visibility restoration network with amplitude-phase entanglement.<n>In the second stage, a fusion strategy with dynamic alignment is proposed to mitigate the spatial mismatch.
arXiv Detail & Related papers (2025-08-01T04:25:00Z)
Restoring Real-World Images with an Internal Detail Enhancement Diffusion Model [9.520471615470267]
Restoring real-world degraded images, such as old photographs or low-resolution images, presents a significant challenge.<n>Recent data-driven approaches have struggled with achieving high-fidelity restoration and providing object-level control over colorization.<n>We propose an internal detail-preserving diffusion model for high-fidelity restoration of real-world degraded images.
arXiv Detail & Related papers (2025-05-24T12:32:53Z)
From Missing Pieces to Masterpieces: Image Completion with Context-Adaptive Diffusion [98.31811240195324]
ConFill is a novel framework that reduces discrepancies between generated and original images at each diffusion step.<n>It outperforms current methods, setting a new benchmark in image completion.
arXiv Detail & Related papers (2025-04-19T13:40:46Z)
"Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z)
PixelPonder: Dynamic Patch Adaptation for Enhanced Multi-Conditional Text-to-Image Generation [24.964136963713102]
We present PixelPonder, a novel unified control framework that allows for effective control of multiple visual conditions under a single control structure.<n>Specifically, we design a patch-level adaptive condition selection mechanism that dynamically prioritizes spatially relevant control signals at the sub-region level.<n>Extensive experiments demonstrate that PixelPonder surpasses previous methods across different benchmark datasets.
arXiv Detail & Related papers (2025-03-09T16:27:02Z)
Unpaired Deblurring via Decoupled Diffusion Model [55.21345354747609]
We propose UID-Diff, a generative-diffusion-based model designed to enhance deblurring performance on unknown domains.<n>We employ two Q-Formers as structural features and blur patterns extractors separately. The features extracted will be used for the supervised deblurring task on synthetic data and the unsupervised blur-transfer task.<n>Experiments on real-world datasets demonstrate that UID-Diff outperforms existing state-of-the-art methods in blur removal and structural preservation.
arXiv Detail & Related papers (2025-02-03T17:00:40Z)
A Simple Approach to Unifying Diffusion-based Conditional Generation [63.389616350290595]
We introduce a simple, unified framework to handle diverse conditional generation tasks.<n>Our approach enables versatile capabilities via different inference-time sampling schemes.<n>Our model supports additional capabilities like non-spatially aligned and coarse conditioning.
arXiv Detail & Related papers (2024-10-15T09:41:43Z)
ZePo: Zero-Shot Portrait Stylization with Faster Sampling [61.14140480095604]
This paper presents an inversion-free portrait stylization framework based on diffusion models that accomplishes content and style feature fusion in merely four sampling steps. We propose a feature merging strategy to amalgamate redundant features in Consistency Features, thereby reducing the computational load of attention control.
arXiv Detail & Related papers (2024-08-10T08:53:41Z)
TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization [59.412236435627094]
TALE is a training-free framework harnessing the generative capabilities of text-to-image diffusion models. We equip TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided Latent Optimization. Our experiments demonstrate that TALE surpasses prior baselines and attains state-of-the-art performance in image-guided composition.
arXiv Detail & Related papers (2024-08-07T08:52:21Z)
Coherent and Multi-modality Image Inpainting via Latent Space Optimization [61.99406669027195]
PILOT (intextbfPainting vtextbfIa textbfLatent textbfOptextbfTimization) is an optimization approach grounded on a novel textitsemantic centralization and textitbackground preservation loss. Our method searches latent spaces capable of generating inpainted regions that exhibit high fidelity to user-provided prompts while maintaining coherence with the background.
arXiv Detail & Related papers (2024-07-10T19:58:04Z)
Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning [40.06403155373455]
We propose a novel reinforcement learning framework for personalized text-to-image generation. Our proposed approach outperforms existing state-of-the-art methods by a large margin on visual fidelity while maintaining text-alignment.
arXiv Detail & Related papers (2024-07-09T08:11:53Z)
TCIG: Two-Stage Controlled Image Generation with Quality Enhancement through Diffusion [0.0]
A two-stage method that combines controllability and high quality in the generation of images is proposed. By separating controllability from high quality, This method achieves outstanding results.
arXiv Detail & Related papers (2024-03-02T13:59:02Z)
DreamDrone: Text-to-Image Diffusion Models are Zero-shot Perpetual View Generators [56.994967294931286]
We introduce DreamDrone, a novel zero-shot and training-free pipeline for generating flythrough scenes from textual prompts. We advocate explicitly warping the intermediate latent code of the pre-trained text-to-image diffusion model for high-quality image generation and unbounded generalization ability.
arXiv Detail & Related papers (2023-12-14T08:42:26Z)
Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis [62.07413805483241]
Steered Diffusion is a framework for zero-shot conditional image generation using a diffusion model trained for unconditional generation. We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution.
arXiv Detail & Related papers (2023-09-30T02:03:22Z)
Controlling Text-to-Image Diffusion by Orthogonal Finetuning [74.21549380288631]
We introduce a principled finetuning method -- Orthogonal Finetuning (OFT) for adapting text-to-image diffusion models to downstream tasks. Unlike existing methods, OFT can provably preserve hyperspherical energy which characterizes the pairwise neuron relationship on the unit hypersphere. We empirically show that our OFT framework outperforms existing methods in generation quality and convergence speed.
arXiv Detail & Related papers (2023-06-12T17:59:23Z)
Auto-regressive Image Synthesis with Integrated Quantization [55.51231796778219]
This paper presents a versatile framework for conditional image generation. It incorporates the inductive bias of CNNs and powerful sequence modeling of auto-regression. Our method achieves superior diverse image generation performance as compared with the state-of-the-art.
arXiv Detail & Related papers (2022-07-21T22:19:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.