Related papers: Localized Control in Diffusion Models via Latent Vector Prediction

Localized Control in Diffusion Models via Latent Vector Prediction

URL: http://arxiv.org/abs/2602.01991v2
Date: Wed, 11 Feb 2026 16:40:46 GMT
Title: Localized Control in Diffusion Models via Latent Vector Prediction
Authors: Pablo Domingo-Gregorio, Javier Ruiz-Hidalgo,
Abstract summary: We propose a novel methodology to enable precise local control over user-defined regions of an image.<n>Our method effectively synthesizes high-quality images with controlled local conditions.
Score: 2.4923006485141284
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Diffusion models emerged as a leading approach in text-to-image generation, producing high-quality images from textual descriptions. However, attempting to achieve detailed control to get a desired image solely through text remains a laborious trial-and-error endeavor. Recent methods have introduced image-level controls alongside with text prompts, using prior images to extract conditional information such as edges, segmentation and depth maps. While effective, these methods apply conditions uniformly across the entire image, limiting localized control. In this paper, we propose a novel methodology to enable precise local control over user-defined regions of an image, while leaving to the diffusion model the task of autonomously generating the remaining areas according to the original prompt. Our approach introduces a new training framework that incorporates masking features and an additional loss term, which leverages the prediction of the initial latent vector at any diffusion step to enhance the correspondence between the current step and the final sample in the latent space. Extensive experiments demonstrate that our method effectively synthesizes high-quality images with controlled local conditions.

Related papers

SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation [9.212970624261272]
State-of-the-art text-to-image models produce visually impressive results but often struggle with precise alignment to text prompts.<n>We propose a novel approach that learns a high-success-rate distribution conditioned on a target prompt.<n>Our method explicitly models the signal component during the denoising process, offering fine-grained control that mitigates over-optimization.
arXiv Detail & Related papers (2025-08-19T14:31:15Z)
TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization [59.412236435627094]
TALE is a training-free framework harnessing the generative capabilities of text-to-image diffusion models. We equip TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided Latent Optimization. Our experiments demonstrate that TALE surpasses prior baselines and attains state-of-the-art performance in image-guided composition.
arXiv Detail & Related papers (2024-08-07T08:52:21Z)
Coherent and Multi-modality Image Inpainting via Latent Space Optimization [61.99406669027195]
PILOT (intextbfPainting vtextbfIa textbfLatent textbfOptextbfTimization) is an optimization approach grounded on a novel textitsemantic centralization and textitbackground preservation loss. Our method searches latent spaces capable of generating inpainted regions that exhibit high fidelity to user-provided prompts while maintaining coherence with the background.
arXiv Detail & Related papers (2024-07-10T19:58:04Z)
Local Conditional Controlling for Text-to-Image Diffusion Models [26.54188248406709]
Diffusion models have exhibited impressive prowess in the text-to-image task. Recent methods add image-level structure controls, e.g., edge and depth maps, to manipulate the generation process together with text prompts to obtain desired images. This controlling process is globally operated on the entire image, which limits the flexibility of control regions.
arXiv Detail & Related papers (2023-12-14T09:31:33Z)
LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis [24.925757148750684]
We propose a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions. LoCo seamlessly integrates into existing text-to-image and layout-to-image models, enhancing their performance in spatial control and addressing semantic failures observed in prior methods.
arXiv Detail & Related papers (2023-11-21T04:28:12Z)
Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling. We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content. We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z)
Improving Diffusion-based Image Translation using Asymmetric Gradient Guidance [51.188396199083336]
We present an approach that guides the reverse process of diffusion sampling by applying asymmetric gradient guidance. Our model's adaptability allows it to be implemented with both image-fusion and latent-dif models. Experiments show that our method outperforms various state-of-the-art models in image translation tasks.
arXiv Detail & Related papers (2023-06-07T12:56:56Z)
Conditional Score Guidance for Text-Driven Image-to-Image Translation [52.73564644268749]
We present a novel algorithm for text-driven image-to-image translation based on a pretrained text-to-image diffusion model. Our method aims to generate a target image by selectively editing the regions of interest in a source image.
arXiv Detail & Related papers (2023-05-29T10:48:34Z)
SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map. We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.