FreeControl: Efficient, Training-Free Structural Control via One-Step Attention Extraction
- URL: http://arxiv.org/abs/2511.05219v1
- Date: Fri, 07 Nov 2025 13:17:46 GMT
- Title: FreeControl: Efficient, Training-Free Structural Control via One-Step Attention Extraction
- Authors: Jiang Lin, Xinyu Chen, Song Wu, Zhiqiu Zhang, Jizhi Zhang, Ye Wang, Qiang Tang, Qian Wang, Jian Yang, Zili Yi,
- Abstract summary: FreeControl is a training-free framework for semantic structural control in diffusion models.<n>It performs one-step attention extraction from a single, optimally chosen key timestep and reuses it throughout denoising.<n> LCD provides finer control over attention quality and eliminates structural artifacts.
- Score: 26.92307756692061
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Controlling the spatial and semantic structure of diffusion-generated images remains a challenge. Existing methods like ControlNet rely on handcrafted condition maps and retraining, limiting flexibility and generalization. Inversion-based approaches offer stronger alignment but incur high inference cost due to dual-path denoising. We present FreeControl, a training-free framework for semantic structural control in diffusion models. Unlike prior methods that extract attention across multiple timesteps, FreeControl performs one-step attention extraction from a single, optimally chosen key timestep and reuses it throughout denoising. This enables efficient structural guidance without inversion or retraining. To further improve quality and stability, we introduce Latent-Condition Decoupling (LCD): a principled separation of the key timestep and the noised latent used in attention extraction. LCD provides finer control over attention quality and eliminates structural artifacts. FreeControl also supports compositional control via reference images assembled from multiple sources - enabling intuitive scene layout design and stronger prompt alignment. FreeControl introduces a new paradigm for test-time control, enabling structurally and semantically aligned, visually coherent generation directly from raw images, with the flexibility for intuitive compositional design and compatibility with modern diffusion models at approximately 5 percent additional cost.
Related papers
- ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention [86.93601565563954]
ScaleWeaver is a framework designed to achieve high-fidelity, controllable generation upon advanced visual autoregressive( VAR) models.<n>The proposed Reference Attention module discards the unnecessary attention from image$rightarrow$condition, reducing computational cost.<n>Experiments show that ScaleWeaver delivers high-quality generation and precise control while attaining superior efficiency over diffusion-based methods.
arXiv Detail & Related papers (2025-10-16T17:00:59Z) - Optimal Control Meets Flow Matching: A Principled Route to Multi-Subject Fidelity [35.95129874095729]
Text-to-image (T2I) models excel on single-entity prompts but struggle with multi-subject descriptions.<n>We introduce the first theoretical framework with principled optimizable objective for steering sampling dynamics toward multi-subject fidelity.
arXiv Detail & Related papers (2025-10-02T17:59:58Z) - FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios [49.09128364751743]
Action customization involves generating videos where the subject performs actions dictated by input control signals.<n>Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure.<n>We propose FlexiAct, which transfers actions from a reference video to an arbitrary target image.
arXiv Detail & Related papers (2025-05-06T17:58:02Z) - PixelPonder: Dynamic Patch Adaptation for Enhanced Multi-Conditional Text-to-Image Generation [24.964136963713102]
We present PixelPonder, a novel unified control framework that allows for effective control of multiple visual conditions under a single control structure.<n>Specifically, we design a patch-level adaptive condition selection mechanism that dynamically prioritizes spatially relevant control signals at the sub-region level.<n>Extensive experiments demonstrate that PixelPonder surpasses previous methods across different benchmark datasets.
arXiv Detail & Related papers (2025-03-09T16:27:02Z) - DC-ControlNet: Decoupling Inter- and Intra-Element Conditions in Image Generation with Diffusion Models [55.42794740244581]
We introduce DC (Decouple)-ControlNet, a framework for multi-condition image generation.<n>The core idea behind DC-ControlNet is to decouple control conditions, transforming global control into a hierarchical system.<n>For interactions between elements, we introduce the Inter-Element Controller.
arXiv Detail & Related papers (2025-02-20T18:01:02Z) - FlexControl: Computation-Aware ControlNet with Differentiable Router for Text-to-Image Generation [10.675687253961595]
ControlNet offers a powerful way to guide diffusion-based generative models.<n>Most implementations rely on ad-hocs to choose which network blocks to control-an approach that varies unpredictably with different tasks.<n>We propose FlexControl, a framework that copies all diffusion blocks during training and employs a trainable gating mechanism.
arXiv Detail & Related papers (2025-02-11T23:27:58Z) - ControlNeXt: Powerful and Efficient Control for Image and Video Generation [59.62289489036722]
We propose ControlNeXt: a powerful and efficient method for controllable image and video generation.<n>We first design a more straightforward and efficient architecture, replacing heavy additional branches with minimal additional cost.<n>As for training, we reduce up to 90% of learnable parameters compared to the alternatives.
arXiv Detail & Related papers (2024-08-12T11:41:18Z) - FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation [99.4649330193233]
Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps.
We propose a novel Flexible and Efficient method, FlexEControl, for controllable T2I generation.
arXiv Detail & Related papers (2024-05-08T06:09:11Z) - FreeControl: Training-Free Spatial Control of Any Text-to-Image
Diffusion Model with Any Condition [41.92032568474062]
FreeControl is a training-free approach for controllable T2I generation.
It supports multiple conditions, architectures, and checkpoints simultaneously.
It achieves competitive synthesis quality with training-based approaches.
arXiv Detail & Related papers (2023-12-12T18:59:14Z) - UniControl: A Unified Diffusion Model for Controllable Visual Generation
In the Wild [166.25327094261038]
We introduce UniControl, a new generative foundation model for controllable condition-to-image (C2I) tasks.
UniControl consolidates a wide array of C2I tasks within a singular framework, while still allowing for arbitrary language prompts.
trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities.
arXiv Detail & Related papers (2023-05-18T17:41:34Z) - Optimal PID and Antiwindup Control Design as a Reinforcement Learning
Problem [3.131740922192114]
We focus on the interpretability of DRL control methods.
In particular, we view linear fixed-structure controllers as shallow neural networks embedded in the actor-critic framework.
arXiv Detail & Related papers (2020-05-10T01:05:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.