The Silent Assistant: NoiseQuery as Implicit Guidance for Goal-Driven Image Generation
- URL: http://arxiv.org/abs/2412.05101v2
- Date: Mon, 17 Mar 2025 05:01:35 GMT
- Title: The Silent Assistant: NoiseQuery as Implicit Guidance for Goal-Driven Image Generation
- Authors: Ruoyu Wang, Huayang Huang, Ye Zhu, Olga Russakovsky, Yu Wu,
- Abstract summary: We propose to leverage an aligned Gaussian noise as implicit guidance to complement explicit user-defined inputs, such as text prompts.<n>NoiseQuery enables fine-grained control and yields significant performance boosts over high-level semantics and over low-level visual attributes.
- Score: 31.599902235859687
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we introduce NoiseQuery as a novel method for enhanced noise initialization in versatile goal-driven text-to-image (T2I) generation. Specifically, we propose to leverage an aligned Gaussian noise as implicit guidance to complement explicit user-defined inputs, such as text prompts, for better generation quality and controllability. Unlike existing noise optimization methods designed for specific models, our approach is grounded in a fundamental examination of the generic finite-step noise scheduler design in diffusion formulation, allowing better generalization across different diffusion-based architectures in a tuning-free manner. This model-agnostic nature allows us to construct a reusable noise library compatible with multiple T2I models and enhancement techniques, serving as a foundational layer for more effective generation. Extensive experiments demonstrate that NoiseQuery enables fine-grained control and yields significant performance boosts not only over high-level semantics but also over low-level visual attributes, which are typically difficult to specify through text alone, with seamless integration into current workflows with minimal computational overhead.
Related papers
- DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers [86.5541501589166]
DiffMoE is a batch-level global token pool that enables experts to access global token distributions during training.
It achieves state-of-the-art performance among diffusion models on ImageNet benchmark.
The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation.
arXiv Detail & Related papers (2025-03-18T17:57:07Z) - Using Random Noise Equivariantly to Boost Graph Neural Networks Universally [27.542173012315413]
Graph Neural Networks (GNNs) have explored the potential of random noise as an input feature to enhance expressivity across diverse tasks.
This paper lays down a theoretical framework that elucidates the increased sample complexity when random noise into GNNs without careful design.
We propose Equivariant Noise GNN (ENGNN), a novel architecture that harnesses the symmetrical properties of noise to sample complexity and bolster generalization.
arXiv Detail & Related papers (2025-02-04T16:54:28Z) - Enhance Vision-Language Alignment with Noise [59.2608298578913]
We investigate whether the frozen model can be fine-tuned by customized noise.
We propose Positive-incentive Noise (PiNI) which can fine-tune CLIP via injecting noise into both visual and text encoders.
arXiv Detail & Related papers (2024-12-14T12:58:15Z) - Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis [9.11767497956649]
This paper proposes leveraging the language comprehension capabilities of large vision-language models to guide the optimization of the initial noisy latent.
We introduce the Noise Diffusion process, which updates the noisy latent to generate semantically faithful images while preserving distribution consistency.
Experimental results demonstrate the effectiveness and adaptability of our framework, consistently enhancing semantic alignment across various diffusion models.
arXiv Detail & Related papers (2024-11-25T15:40:47Z) - Golden Noise for Diffusion Models: A Learning Framework [26.117889730713923]
Text-to-image diffusion model is a popular paradigm that synthesizes personalized images by providing a text prompt and a random Gaussian noise.
While people observe that some noises are golden noises'' that can achieve better text-image alignment and higher human preference than others, we still lack a machine learning framework to obtain those golden noises.
arXiv Detail & Related papers (2024-11-14T15:13:13Z) - Beyond Image Prior: Embedding Noise Prior into Conditional Denoising Transformer [17.430622649002427]
Existing learning-based denoising methods typically train models to generalize the image prior from large-scale datasets.
We propose a new perspective on the denoising challenge by highlighting the distinct separation between noise and image priors.
We introduce a Locally Noise Prior Estimation algorithm, which accurately estimates the noise prior directly from a single raw noisy image.
arXiv Detail & Related papers (2024-07-12T08:43:11Z) - InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization [27.508861002013358]
InitNO is a paradigm that refines the initial noise in semantically-faithful images.
A strategically crafted noise optimization pipeline is developed to guide the initial noise towards valid regions.
Our method, validated through rigorous experimentation, shows a commendable proficiency in generating images in strict accordance with text prompts.
arXiv Detail & Related papers (2024-04-06T14:56:59Z) - AdaDiff: Adaptive Step Selection for Fast Diffusion [88.8198344514677]
We introduce AdaDiff, a framework designed to learn instance-specific step usage policies.
AdaDiff is optimized using a policy gradient method to maximize a carefully designed reward function.
Our approach achieves similar results in terms of visual quality compared to the baseline using a fixed 50 denoising steps.
arXiv Detail & Related papers (2023-11-24T11:20:38Z) - POS: A Prompts Optimization Suite for Augmenting Text-to-Video Generation [11.556147036111222]
This paper aims to enhance the diffusion-based text-to-video generation by improving the two input prompts, including the noise and the text.
We propose POS, a training-free Prompt Optimization Suite to boost text-to-video models.
arXiv Detail & Related papers (2023-11-02T02:33:09Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Advancing Unsupervised Low-light Image Enhancement: Noise Estimation, Illumination Interpolation, and Self-Regulation [55.07472635587852]
Low-Light Image Enhancement (LLIE) techniques have made notable advancements in preserving image details and enhancing contrast.
These approaches encounter persistent challenges in efficiently mitigating dynamic noise and accommodating diverse low-light scenarios.
We first propose a method for estimating the noise level in low light images in a quick and accurate way.
We then devise a Learnable Illumination Interpolator (LII) to satisfy general constraints between illumination and input.
arXiv Detail & Related papers (2023-05-17T13:56:48Z) - Conditional Denoising Diffusion for Sequential Recommendation [62.127862728308045]
Two prominent generative models, Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs)
GANs suffer from unstable optimization, while VAEs are prone to posterior collapse and over-smoothed generations.
We present a conditional denoising diffusion model, which includes a sequence encoder, a cross-attentive denoising decoder, and a step-wise diffuser.
arXiv Detail & Related papers (2023-04-22T15:32:59Z) - NLIP: Noise-robust Language-Image Pre-training [95.13287735264937]
We propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion.
Our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way.
arXiv Detail & Related papers (2022-12-14T08:19:30Z) - Improving Pre-trained Language Model Fine-tuning with Noise Stability
Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR)
Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model.
We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.