Related papers: Diverse Text-to-Image Generation via Contrastive Noise Optimization

Diverse Text-to-Image Generation via Contrastive Noise Optimization

URL: http://arxiv.org/abs/2510.03813v2
Date: Sat, 11 Oct 2025 04:37:42 GMT
Title: Diverse Text-to-Image Generation via Contrastive Noise Optimization
Authors: Byungjun Kim, Soobin Um, Jong Chul Ye,
Abstract summary: Text-to-image (T2I) diffusion models have demonstrated impressive performance in generating high-fidelity images.<n>Existing approaches typically optimize intermediate latents or text conditions during inference.<n>We introduce Contrastive Noise Optimization, a simple yet effective method that addresses the diversity issue from a distinct perspective.
Score: 60.48914865049489
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-image (T2I) diffusion models have demonstrated impressive performance in generating high-fidelity images, largely enabled by text-guided inference. However, this advantage often comes with a critical drawback: limited diversity, as outputs tend to collapse into similar modes under strong text guidance. Existing approaches typically optimize intermediate latents or text conditions during inference, but these methods deliver only modest gains or remain sensitive to hyperparameter tuning. In this work, we introduce Contrastive Noise Optimization, a simple yet effective method that addresses the diversity issue from a distinct perspective. Unlike prior techniques that adapt intermediate latents, our approach shapes the initial noise to promote diverse outputs. Specifically, we develop a contrastive loss defined in the Tweedie data space and optimize a batch of noise latents. Our contrastive optimization repels instances within the batch to maximize diversity while keeping them anchored to a reference sample to preserve fidelity. We further provide theoretical insights into the mechanism of this preprocessing to substantiate its effectiveness. Extensive experiments across multiple T2I backbones demonstrate that our approach achieves a superior quality-diversity Pareto frontier while remaining robust to hyperparameter choices.

Related papers

Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion [60.186310080523135]
Bifurcation of generative modeling into autoregressive approaches for discrete data (text) and diffusion approaches for continuous data (images) hinders development of truly unified multimodal systems.<n>We propose textbfCoM-DAD, a novel probabilistic framework that reformulates multimodal generation as a hierarchical dual-process.<n>Our method demonstrates superior stability over standard masked modeling, establishing a new paradigm for scalable, unified text-image generation.
arXiv Detail & Related papers (2026-01-07T16:21:19Z)
DiverseAR: Boosting Diversity in Bitwise Autoregressive Image Generation [22.400053095939402]
We introduce DiverseAR, a principled and effective method that enhances image diversity without sacrificing visual quality.<n>Specifically, we introduce an adaptive logits distribution scaling mechanism that dynamically adjusts the sharpness of the binary output distribution during sampling.<n>To mitigate potential fidelity loss caused by distribution smoothing, we develop an energy-based generation path search algorithm that avoids sampling low-confidence tokens.
arXiv Detail & Related papers (2025-12-02T16:54:36Z)
SPREAD: Sampling-based Pareto front Refinement via Efficient Adaptive Diffusion [0.8594140167290097]
SPREAD is a generative framework based on Denoising Diffusion Probabilistic Models (DDPMs)<n>It learns a conditional diffusion process over points sampled from the decision space.<n>It refines candidates via a sampling scheme that uses an adaptive multiple gradient descent-inspired update for fast convergence.
arXiv Detail & Related papers (2025-09-25T12:09:37Z)
Investigating the Invertibility of Multimodal Latent Spaces: Limitations of Optimization-Based Methods [0.0]
This paper investigates the inverse capabilities and broader utility of multimodal latent spaces within task-specific AI (Artificial Intelligence) models.<n>Our central hypothesis posits that while optimization can guide models towards inverse tasks, their multimodal latent spaces will not consistently support semantically meaningful and perceptually coherent inverse mappings.<n>Our work underscores the need for further research into developing truly semantically rich and invertible multimodal latent spaces.
arXiv Detail & Related papers (2025-07-30T18:19:11Z)
Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis [9.11767497956649]
This paper proposes leveraging the language comprehension capabilities of large vision-language models to guide the optimization of the initial noisy latent. We introduce the Noise Diffusion process, which updates the noisy latent to generate semantically faithful images while preserving distribution consistency. Experimental results demonstrate the effectiveness and adaptability of our framework, consistently enhancing semantic alignment across various diffusion models.
arXiv Detail & Related papers (2024-11-25T15:40:47Z)
Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack [51.16384207202798]
Vision-language pre-training models are vulnerable to multimodal adversarial examples (AEs) Previous approaches augment image-text pairs to enhance diversity within the adversarial example generation process. We propose sampling from adversarial evolution triangles composed of clean, historical, and current adversarial examples to enhance adversarial diversity.
arXiv Detail & Related papers (2024-11-04T23:07:51Z)
OrientDream: Streamlining Text-to-3D Generation with Explicit Orientation Control [66.03885917320189]
OrientDream is a camera orientation conditioned framework for efficient and multi-view consistent 3D generation from textual prompts. Our strategy emphasizes the implementation of an explicit camera orientation conditioned feature in the pre-training of a 2D text-to-image diffusion module. Our experiments reveal that our method not only produces high-quality NeRF models with consistent multi-view properties but also achieves an optimization speed significantly greater than existing methods.
arXiv Detail & Related papers (2024-06-14T13:16:18Z)
Digging into contrastive learning for robust depth estimation with diffusion models [55.62276027922499]
We propose a novel robust depth estimation method called D4RD. It features a custom contrastive learning mode tailored for diffusion models to mitigate performance degradation in complex environments. In experiments, D4RD surpasses existing state-of-the-art solutions on synthetic corruption datasets and real-world weather conditions.
arXiv Detail & Related papers (2024-04-15T14:29:47Z)
SimAC: A Simple Anti-Customization Method for Protecting Face Privacy against Text-to-Image Synthesis of Diffusion Models [16.505593270720034]
We propose an adaptive greedy search for optimal time steps that seamlessly integrates with existing anti-customization methods. Our approach significantly increases identity disruption, thereby protecting user privacy and copyright.
arXiv Detail & Related papers (2023-12-13T03:04:22Z)
OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization [65.57380193070574]
Vision-language pre-training models are vulnerable to multi-modal adversarial examples. Recent works have indicated that leveraging data augmentation and image-text modal interactions can enhance the transferability of adversarial examples. We propose an Optimal Transport-based Adversarial Attack, dubbed OT-Attack.
arXiv Detail & Related papers (2023-12-07T16:16:50Z)
Conditional Denoising Diffusion for Sequential Recommendation [62.127862728308045]
Two prominent generative models, Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs) GANs suffer from unstable optimization, while VAEs are prone to posterior collapse and over-smoothed generations. We present a conditional denoising diffusion model, which includes a sequence encoder, a cross-attentive denoising decoder, and a step-wise diffuser.
arXiv Detail & Related papers (2023-04-22T15:32:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.