Related papers: Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models

Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models

URL: http://arxiv.org/abs/2510.14526v1
Date: Thu, 16 Oct 2025 10:14:34 GMT
Title: Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models
Authors: Yunze Tong, Didi Zhu, Zijing Hu, Jinluan Yang, Ziyu Zhao,
Abstract summary: In text-to-image generation, different initial noises induce distinct denoising paths with a pretrained Stable Diffusion (SD) model.<n>While this pattern could output diverse images, some of them may fail to align well with the prompt.<n>We propose a noise projector that applies text-conditioned refinement to the initial noise before denoising.
Score: 9.683618735282414
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In text-to-image generation, different initial noises induce distinct denoising paths with a pretrained Stable Diffusion (SD) model. While this pattern could output diverse images, some of them may fail to align well with the prompt. Existing methods alleviate this issue either by altering the denoising dynamics or by drawing multiple noises and conducting post-selection. In this paper, we attribute the misalignment to a training-inference mismatch: during training, prompt-conditioned noises lie in a prompt-specific subset of the latent space, whereas at inference the noise is drawn from a prompt-agnostic Gaussian prior. To close this gap, we propose a noise projector that applies text-conditioned refinement to the initial noise before denoising. Conditioned on the prompt embedding, it maps the noise to a prompt-aware counterpart that better matches the distribution observed during SD training, without modifying the SD model. Our framework consists of these steps: we first sample some noises and obtain token-level feedback for their corresponding images from a vision-language model (VLM), then distill these signals into a reward model, and finally optimize the noise projector via a quasi-direct preference optimization. Our design has two benefits: (i) it requires no reference images or handcrafted priors, and (ii) it incurs small inference cost, replacing multi-sample selection with a single forward pass. Extensive experiments further show that our prompt-aware noise projection improves text-image alignment across diverse prompts.

Related papers

It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models [80.53672733210111]
We show that a simple noise optimization objective can mitigate mode collapse while preserving the fidelity of the base model.<n>Our experiments demonstrate that noise optimization yields superior results in terms of generation quality and variety.
arXiv Detail & Related papers (2025-12-31T19:47:49Z)
Mitigating the Noise Shift for Denoising Generative Models via Noise Awareness Guidance [54.88271057438763]
Noise Awareness Guidance (NAG) is a correction method that explicitly steers sampling trajectories to remain consistent with the pre-defined noise schedule.<n>NAG consistently mitigates noise shift and substantially improves the generation quality of mainstream diffusion models.
arXiv Detail & Related papers (2025-10-14T13:31:34Z)
Be Decisive: Noise-Induced Layouts for Multi-Subject Generation [56.80513553424086]
Complex prompts lead to subject leakage, causing inaccuracies in quantities, attributes, and visual features.<n>We introduce a new approach that predicts a spatial layout aligned with the prompt, derived from the initial noise, and refines it throughout the denoising process.<n>Our method employs a small neural network to predict and refine the evolving noise-induced layout at each denoising step.
arXiv Detail & Related papers (2025-05-27T17:54:24Z)
Enhancing Sample Generation of Diffusion Models using Noise Level Correction [9.014666170540304]
We propose a novel method to enhance sample generation by aligning the estimated noise level with the true distance of noisy samples to the manifold.<n> Specifically, we introduce a noise level correction network, leveraging a pre-trained denoising network, to refine noise level estimates during the denoising process.<n> Experimental results demonstrate that our method significantly improves sample quality in both unconstrained and constrained generation scenarios.
arXiv Detail & Related papers (2024-12-07T01:19:14Z)
The Silent Assistant: NoiseQuery as Implicit Guidance for Goal-Driven Image Generation [31.599902235859687]
We propose to leverage an aligned Gaussian noise as implicit guidance to complement explicit user-defined inputs, such as text prompts.<n>NoiseQuery enables fine-grained control and yields significant performance boosts over high-level semantics and over low-level visual attributes.
arXiv Detail & Related papers (2024-12-06T14:59:00Z)
Beyond Image Prior: Embedding Noise Prior into Conditional Denoising Transformer [17.430622649002427]
Existing learning-based denoising methods typically train models to generalize the image prior from large-scale datasets.<n>We propose a new perspective on the denoising challenge by highlighting the distinct separation between noise and image priors.<n>We introduce a Locally Noise Prior Estimation algorithm, which accurately estimates the noise prior directly from a single raw noisy image.
arXiv Detail & Related papers (2024-07-12T08:43:11Z)
InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization [27.508861002013358]
InitNO is a paradigm that refines the initial noise in semantically-faithful images. A strategically crafted noise optimization pipeline is developed to guide the initial noise towards valid regions. Our method, validated through rigorous experimentation, shows a commendable proficiency in generating images in strict accordance with text prompts.
arXiv Detail & Related papers (2024-04-06T14:56:59Z)
DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process. During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z)
Score Priors Guided Deep Variational Inference for Unsupervised Real-World Single Image Denoising [14.486289176696438]
We propose a score priors-guided deep variational inference, namely ScoreDVI, for practical real-world denoising. We exploit a Non-$i.i.d$ Gaussian mixture model and variational noise posterior to model the real-world noise. Our method outperforms other single image-based real-world denoising methods and achieves comparable performance to dataset-based unsupervised methods.
arXiv Detail & Related papers (2023-08-09T03:26:58Z)
NLIP: Noise-robust Language-Image Pre-training [95.13287735264937]
We propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion. Our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way.
arXiv Detail & Related papers (2022-12-14T08:19:30Z)
Variational Denoising Network: Toward Blind Noise Modeling and Removal [59.36166491196973]
Blind image denoising is an important yet very challenging problem in computer vision. We propose a new variational inference method, which integrates both noise estimation and image denoising.
arXiv Detail & Related papers (2019-08-29T15:54:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.